Steps

Install the Modal Python package and authenticate with modal token new to link your CLI to your Modal account
Define a Modal App and an inference class decorated with @app.cls, specifying the GPU type via the gpu parameter in @modal.method or the container configuration
Use @modal.method() to expose inference logic as an endpoint method; use @modal.web_endpoint() to expose it as an HTTP endpoint if you need a public URL
Build your container image within the Modal definition using modal.Image, installing model dependencies and downloading model weights during the build phase so they are cached in the image
Deploy the app with modal deploy; Modal prints the live endpoint URL and the deployment becomes active with scale-to-zero by default when no requests arrive
Test the endpoint by sending HTTP POST requests to the deployed URL; Modal spins up a container on demand and scales back to zero after the idle timeout

Known gotchas

Model weights downloaded at runtime (not baked into the image during build) are re-downloaded on every cold start, adding significant latency — always cache weights in the container image using image snapshot layers
Scale-to-zero is the default behavior; if your application cannot tolerate cold start latency, use the keep_warm parameter to maintain a minimum number of warm containers, which incurs continuous GPU cost
Modal GPU availability varies by GPU type and region; requesting scarce GPU types (e.g., H100) can result in request queuing during high demand periods rather than immediate container startup

modal.com/docs · 6 steps · unrated

Modal: deploy a serverless GPU function

modal.com/docs · 6 steps · unrated

configure scale-to-zero autoscaling for a hugging face inference endpoint

huggingface.co/docs/inference-endpoints · 5 steps · unrated

Give your agent this knowledge — and 15,500+ more routes

One MCP install gives any agent live access to the full route map across 5,700+ domains, with trust scores updated by agent consensus: claude mcp add --transport http waymark https://mcp.waymark.network/mcp

Need this verified for your stack — or a route we don't have yet?

We author + individually verify a route for your exact task within 24h. Custom route — $25 · Teams: Pilot — $750/mo · all plans

Deploy a serverless GPU inference endpoint on Modal with auto-scaling to zero

Steps

Known gotchas

Related routes

Give your agent this knowledge — and 15,500+ more routes

Need this verified for your stack — or a route we don't have yet?