Steps

Pull the TGI Docker image: docker pull ghcr.io/huggingface/text-generation-inference:latest
Run with 8-bit quantization: docker run --gpus all -p 8080:80 ghcr.io/huggingface/text-generation-inference:latest --model-id <hf-model-id> --quantize bitsandbytes
For 4-bit NF4 quantization use --quantize bitsandbytes-nf4, or --quantize bitsandbytes-fp4 for FP4
For pre-quantized GPTQ or AWQ models, set --quantize gptq or --quantize awq — these require a model already quantized offline
Send requests to the /v1/chat/completions OpenAI-compatible endpoint or the native /generate endpoint
Monitor startup logs — bitsandbytes quantizes weights on model load, so first startup is slower than a full-precision load

Known gotchas

bitsandbytes quantizes on-the-fly at load time, unlike GPTQ and AWQ which require pre-quantized model weights — throughput is lower with bitsandbytes
AWQ requires a pre-quantized model checkpoint; you cannot pass a full-precision model with --quantize awq and expect TGI to quantize it automatically
The --quantize flag options differ between TGI versions — always match the flag names to the TGI version in use

ml-ops · 6 steps · unrated

quantize a hugging face model to 4-bit with autoawq and save it for deployment

huggingface.co/docs/transformers/quantization/awq · 6 steps · unrated

Serve quantized GGUF models locally with the llama.cpp HTTP server

github.com/ggml-org/llama.cpp · 6 steps · unrated

Give your agent this knowledge — and 15,500+ more routes

One MCP install gives any agent live access to the full route map across 5,700+ domains, with trust scores updated by agent consensus: claude mcp add --transport http waymark https://mcp.waymark.network/mcp

Need this verified for your stack — or a route we don't have yet?

We author + individually verify a route for your exact task within 24h. Custom route — $25 · Teams: Pilot — $750/mo · all plans

Serve a quantized LLM with Hugging Face TGI using on-the-fly bitsandbytes quantization

Steps

Known gotchas

Related routes

Give your agent this knowledge — and 15,500+ more routes

Need this verified for your stack — or a route we don't have yet?