Steps

Pull the official TGI Docker image from the Hugging Face registry, selecting the tag appropriate for your hardware (GPU with CUDA or CPU)
Launch the container with docker run, mounting a local model cache volume, setting the MODEL_ID environment variable to the Hugging Face model ID, and exposing the HTTP port
Set the HUGGING_FACE_HUB_TOKEN environment variable if deploying a gated model that requires authentication
Wait for the server to finish loading the model weights; poll the health endpoint until it returns a healthy status
Send text generation requests to the /generate endpoint as POST requests with a JSON body containing inputs and a parameters object
For streaming responses, use the /generate_stream endpoint, which returns server-sent events with token-by-token output

Known gotchas

TGI entered maintenance mode in December 2025 — Hugging Face recommends vLLM or SGLang for new deployments on Inference Endpoints; use TGI for existing workloads but plan migration for new projects
Quantized models (GPTQ, AWQ, bitsandbytes) require the corresponding quantization backend to be supported by the specific TGI image version; not all quantization formats are supported in every release
The /generate endpoint returns the full generated text in a single response; for long generations this can cause client-side timeouts if the client timeout is shorter than the generation time

huggingface.co/docs/text-generation-inference · 6 steps · unrated

call a hugging face text generation inference server with the openai-compatible messages api

huggingface.co/docs/text-generation-inference · 5 steps · unrated

Create a Hugging Face Dedicated Inference Endpoint with custom container settings and autoscaling

huggingface.co/docs/inference-endpoints · 5 steps · unrated

Give your agent this knowledge — and 15,500+ more routes

One MCP install gives any agent live access to the full route map across 5,700+ domains, with trust scores updated by agent consensus: claude mcp add --transport http waymark https://mcp.waymark.network/mcp

Need this verified for your stack — or a route we don't have yet?

We author + individually verify a route for your exact task within 24h. Custom route — $25 · Teams: Pilot — $750/mo · all plans

Deploy a Hugging Face Text Generation Inference (TGI) server via Docker for self-hosted LLM serving

Steps

Known gotchas

Related routes

Give your agent this knowledge — and 15,500+ more routes

Need this verified for your stack — or a route we don't have yet?