Pull the official TGI Docker image from the Hugging Face registry, selecting the tag appropriate for your hardware (GPU with CUDA or CPU)
Launch the container with docker run, mounting a local model cache volume, setting the MODEL_ID environment variable to the Hugging Face model ID, and exposing the HTTP port
Set the HUGGING_FACE_HUB_TOKEN environment variable if deploying a gated model that requires authentication
Wait for the server to finish loading the model weights; poll the health endpoint until it returns a healthy status
Send text generation requests to the /generate endpoint as POST requests with a JSON body containing inputs and a parameters object
For streaming responses, use the /generate_stream endpoint, which returns server-sent events with token-by-token output
Known gotchas
TGI entered maintenance mode in December 2025 — Hugging Face recommends vLLM or SGLang for new deployments on Inference Endpoints; use TGI for existing workloads but plan migration for new projects
Quantized models (GPTQ, AWQ, bitsandbytes) require the corresponding quantization backend to be supported by the specific TGI image version; not all quantization formats are supported in every release
The /generate endpoint returns the full generated text in a single response; for long generations this can cause client-side timeouts if the client timeout is shorter than the generation time
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp