Start the server: vllm serve <model-id-or-local-path> --host 0.0.0.0 --port 8000 — the model-id can be a Hugging Face Hub identifier or a local directory; the server starts on port 8000 by default
Optionally set a served model alias: add --served-model-name my-alias so existing OpenAI client code can reference the alias instead of the underlying model path
Query the chat completions endpoint using any OpenAI-compatible client, pointing base_url to http://localhost:8000/v1 and api_key to any non-empty string (vLLM does not enforce the key by default)
Tune throughput with --tensor-parallel-size to shard across multiple GPUs, --max-num-seqs to control concurrency, and --max-model-len to cap context length and reduce memory
Check server health and loaded model metadata: GET http://localhost:8000/v1/models returns the list of served models and their context lengths
Known gotchas
The vLLM V1 engine became the default in 2025 releases — some older configuration flags (e.g., --engine-use-ray) are removed; consult the release notes when migrating from pre-V1 deployments
Loading large models requires the GPU to have enough contiguous VRAM; if the model does not fit, vLLM raises an OOM at startup rather than during inference — set --gpu-memory-utilization (default 0.9) lower if other processes share the GPU
By default vLLM does not require authentication; expose the server behind a proxy or set --api-key to a secret value before making the endpoint network-accessible
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp