Install the Modal Python package and authenticate with modal token new to link your CLI to your Modal account
Define a Modal App and an inference class decorated with @app.cls, specifying the GPU type via the gpu parameter in @modal.method or the container configuration
Use @modal.method() to expose inference logic as an endpoint method; use @modal.web_endpoint() to expose it as an HTTP endpoint if you need a public URL
Build your container image within the Modal definition using modal.Image, installing model dependencies and downloading model weights during the build phase so they are cached in the image
Deploy the app with modal deploy; Modal prints the live endpoint URL and the deployment becomes active with scale-to-zero by default when no requests arrive
Test the endpoint by sending HTTP POST requests to the deployed URL; Modal spins up a container on demand and scales back to zero after the idle timeout
Known gotchas
Model weights downloaded at runtime (not baked into the image during build) are re-downloaded on every cold start, adding significant latency — always cache weights in the container image using image snapshot layers
Scale-to-zero is the default behavior; if your application cannot tolerate cold start latency, use the keep_warm parameter to maintain a minimum number of warm containers, which incurs continuous GPU cost
Modal GPU availability varies by GPU type and region; requesting scarce GPU types (e.g., H100) can result in request queuing during high demand periods rather than immediate container startup
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp