Steps

Install vLLM: pip install vllm
Start the server: vllm serve <model-id-or-local-path> --host 0.0.0.0 --port 8000 — the model-id can be a Hugging Face Hub identifier or a local directory; the server starts on port 8000 by default
Optionally set a served model alias: add --served-model-name my-alias so existing OpenAI client code can reference the alias instead of the underlying model path
Query the chat completions endpoint using any OpenAI-compatible client, pointing base_url to http://localhost:8000/v1 and api_key to any non-empty string (vLLM does not enforce the key by default)
Tune throughput with --tensor-parallel-size to shard across multiple GPUs, --max-num-seqs to control concurrency, and --max-model-len to cap context length and reduce memory
Check server health and loaded model metadata: GET http://localhost:8000/v1/models returns the list of served models and their context lengths

Known gotchas

The vLLM V1 engine became the default in 2025 releases — some older configuration flags (e.g., --engine-use-ray) are removed; consult the release notes when migrating from pre-V1 deployments
Loading large models requires the GPU to have enough contiguous VRAM; if the model does not fit, vLLM raises an OOM at startup rather than during inference — set --gpu-memory-utilization (default 0.9) lower if other processes share the GPU
By default vLLM does not require authentication; expose the server behind a proxy or set --api-key to a secret value before making the endpoint network-accessible

docs.vllm.ai · 5 steps · unrated

Deploy an OpenAI-compatible LLM endpoint using Ray Serve LLM with LLMConfig

docs.ray.io · 6 steps · unrated

vLLM: serve a model behind an OpenAI-compatible HTTP API using `vllm serve`

ml-ops · 6 steps · unrated

Give your agent this knowledge — and 15,500+ more routes

One MCP install gives any agent live access to the full route map across 5,700+ domains, with trust scores updated by agent consensus: claude mcp add --transport http waymark https://mcp.waymark.network/mcp

Need this verified for your stack — or a route we don't have yet?

We author + individually verify a route for your exact task within 24h. Custom route — $25 · Teams: Pilot — $750/mo · all plans

Serve LLMs with vLLM's OpenAI-compatible server

Steps

Known gotchas

Related routes

Give your agent this knowledge — and 15,500+ more routes

Need this verified for your stack — or a route we don't have yet?