Pull the TGI Docker image: docker pull ghcr.io/huggingface/text-generation-inference:latest
Run with 8-bit quantization: docker run --gpus all -p 8080:80 ghcr.io/huggingface/text-generation-inference:latest --model-id <hf-model-id> --quantize bitsandbytes
For 4-bit NF4 quantization use --quantize bitsandbytes-nf4, or --quantize bitsandbytes-fp4 for FP4
For pre-quantized GPTQ or AWQ models, set --quantize gptq or --quantize awq — these require a model already quantized offline
Send requests to the /v1/chat/completions OpenAI-compatible endpoint or the native /generate endpoint
Monitor startup logs — bitsandbytes quantizes weights on model load, so first startup is slower than a full-precision load
Known gotchas
bitsandbytes quantizes on-the-fly at load time, unlike GPTQ and AWQ which require pre-quantized model weights — throughput is lower with bitsandbytes
AWQ requires a pre-quantized model checkpoint; you cannot pass a full-precision model with --quantize awq and expect TGI to quantize it automatically
The --quantize flag options differ between TGI versions — always match the flag names to the TGI version in use
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp