Package your model artifacts and push them to an S3 bucket; create a SageMaker Model object referencing the artifact path and the inference container image URI
Create an EndpointConfig that includes a ProductionVariant with a ServerlessConfig block, specifying MemorySizeInMB (must be one of the supported values: 1024, 2048, 3072, 4096, 5120, or 6144) and MaxConcurrency
Create a SageMaker Endpoint from the EndpointConfig; the endpoint starts in a Creating state and requires no instance type selection
Invoke the endpoint using the SageMaker Runtime InvokeEndpoint API with a payload up to 4 MB and a processing timeout of up to 60 seconds
Monitor invocation metrics in CloudWatch including invocation count, model latency, and billed duration to understand cost and cold start behavior
Set ProvisionedConcurrency in the ServerlessConfig if cold start latency is unacceptable; provisioned concurrency keeps warm instances ready at additional cost
Known gotchas
Serverless Inference does not support GPU instances, Multi-Model Endpoints, VPC configuration, Model Monitor, or inference pipelines; workloads requiring any of these must use real-time inference endpoints instead
Cold starts occur when no warm instance is available; cold start duration depends on model size and container initialization time and can range from seconds to over a minute for large models
MaxConcurrency caps the number of simultaneous requests; requests beyond this cap are rejected with a throttling error rather than queued, requiring the caller to implement retry logic
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp