Navigate to the Inference Endpoints section of the Hugging Face Hub and click Create new Endpoint.
Select the model repository to deploy, the cloud provider and region, and the hardware tier (CPU or GPU instance type).
Choose the endpoint type: Public (no authentication), Protected (Hub token required), or Private (VPC link).
Configure scaling settings including minimum and maximum number of replicas, and idle timeout for scale-to-zero.
Click Create Endpoint and wait for the status to change to Running; note the assigned HTTPS endpoint URL.
Send requests using an HTTP POST to the endpoint URL with an Authorization header containing YOUR_TOKEN and a JSON body matching the task's expected input format.
Known gotchas
Scale-to-zero endpoints have a cold-start latency (often tens of seconds) on the first request; if low latency is required, set the minimum replica count to at least 1.
The default container for a given model task (text-generation, feature-extraction, etc.) is inferred from the model card metadata; an incorrect or missing pipeline_tag can cause the wrong container to be selected.
Protected and Private endpoints require sending the Authorization header; omitting it returns a 401 error even if the model itself is public.
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp