Define an LLMConfig object specifying model_id, engine_kwargs (vLLM-compatible), and accelerator_type
Use build_openai_app(llm_config) to create a Serve application that exposes OpenAI-compatible /v1/chat/completions and /v1/completions routes
Deploy with serve.run(app) locally or ray serve deploy for production cluster deployment
For multi-model serving, pass a list of LLMConfig objects to build_openai_app — an LLMModelRouter handles routing across models
Most engine_kwargs that work with vllm serve are forwarded directly by Ray Serve LLM to the underlying vLLM engine
Known gotchas
Ray Serve LLM uses vLLM as its inference engine — vLLM must be installed alongside Ray for GPU inference to work
The agent_engines module in the Vertex AI SDK is being refactored; similarly, Ray Serve LLM APIs are evolving rapidly — pin your Ray version and review release notes before upgrading
Prefix-aware routing (routing requests with shared prefixes to the same replica) is a separate feature that requires explicit configuration in newer Ray versions
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp