Create an AsyncInferenceConfig specifying an OutputPath S3 prefix and an optional ErrorPath for failed requests
Deploy the model with sagemaker_model.deploy(async_inference_config=async_config, ...) — the endpoint returns immediately, not blocking for inference
Upload the input payload to S3 and call endpoint.predict_async(input_path=s3_input_uri) which returns an AsyncInferenceResponse with an output_path
Poll the output S3 key or configure an SNS topic in AsyncInferenceConfig.client_config to receive success and error notifications
Parse the response JSON from the output S3 object once the notification fires or polling detects the key exists
Known gotchas
Async endpoints do not auto-scale to zero by default — you must configure a scaling policy with MinCapacity=0 and use Application Auto Scaling with a custom metric or SageMaker's built-in backlog metric
Maximum payload size for async inference is 1 GB, but the endpoint container still has a per-request timeout (up to 15 minutes) — long-running jobs should use Batch Transform instead
The output S3 prefix must be in the same region as the endpoint; cross-region S3 writes will silently fail and the error path notification will fire
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp