Install ray[serve] and start a Ray cluster or connect to an existing one with ray.init().
Define a deployment class decorated with @serve.deployment, implementing a __call__ method (or an async __call__ for async handling) that contains your model inference logic.
Load your model inside __init__ so it is loaded once per replica rather than on every request.
Bind the deployment to create an application object: app = MyDeployment.bind() and optionally compose multiple deployments with .bind() chaining.
Deploy the application with serve.run(app) for a local cluster, or use serve deploy config.yaml for a production cluster using a Serve config file.
Test the endpoint by sending HTTP requests to the Serve HTTP proxy address, typically http://localhost:8000 by default.
Known gotchas
Model objects loaded outside __init__ (e.g., at module level) are not properly replicated and can cause serialization errors when Ray spawns additional replicas.
The default number of replicas is 1; configure num_replicas and autoscaling_config explicitly for production workloads.
Async deployments require the __call__ method to be defined with async def; mixing sync and async incorrectly can cause the event loop to block under load.
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp