Package your model using Cog (Replicate's containerization tool): define predictor.py with a Predict class implementing predict(), and a cog.yaml specifying the base image and Python dependencies
Build and push the Cog model to Replicate with cog push, which creates a new model version on the platform
Create a Deployment using the POST /v1/deployments API endpoint or the Replicate dashboard, specifying the model, version, hardware type, min_instances, and max_instances for auto-scaling
The deployment provides a dedicated URL distinct from the shared model endpoint; reference the deployment name in API calls rather than the model version directly
Invoke the deployment via POST /v1/predictions, setting the version field to the deployment's model version; the deployment auto-scales between min and max instances based on queue depth
Monitor deployment metrics (request volume, latency, instance utilization, error rates) from the Replicate dashboard and adjust min/max instance counts as traffic patterns evolve
Known gotchas
As of 2025, POST /v1/predictions is the unified endpoint for running any model on Replicate, whether community or official; older documentation referencing separate endpoints for different model types may be outdated
Deployments with min_instances set to 0 scale to zero during idle periods; cold starts require pulling and initializing the container, which can take tens of seconds for large model images
Cog expects the predict() method to accept only serializable input types (strings, integers, floats, file URLs); passing non-serializable Python objects as inputs causes prediction failures at runtime
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp