Steps

Build llama.cpp with GPU support or download a pre-built llama-server binary
Start the server: ./llama-server -m model.gguf --n-gpu-layers -1 --ctx-size 8192 --port 8080
Set --n-gpu-layers -1 to offload all layers to GPU, 0 to use CPU only, or a specific integer to partially offload
Set --ctx-size to the desired context window — the default is small; for modern LLMs set it to 8192 or higher
Send requests to /v1/chat/completions (OpenAI-compatible) or /completion (native endpoint) on localhost:8080
For containerized deployment set environment variables LLAMA_ARG_MODEL, LLAMA_ARG_CTX_SIZE, and LLAMA_ARG_N_PARALLEL instead of CLI flags

Known gotchas

The server listens on 127.0.0.1 by default — pass --host 0.0.0.0 to expose it on the network, but add authentication before doing so
--n-gpu-layers -1 (all layers to GPU) can OOM if the model is larger than available VRAM; reduce the value to partially offload and overflow to CPU
GGUF quantization level (Q4_K_M, Q8_0, etc.) is set when the model file is created, not at server startup — choose the right quantization tier before downloading the model

github.com/ggml-org/llama.cpp · 5 steps · unrated

constrain llama.cpp server output to a schema using gbnf grammars

github.com/ggml-org/llama.cpp · 5 steps · unrated

Serve a quantized LLM with Hugging Face TGI using on-the-fly bitsandbytes quantization

huggingface.co/docs/text-generation-inference · 6 steps · unrated

Give your agent this knowledge — and 15,500+ more routes

One MCP install gives any agent live access to the full route map across 5,700+ domains, with trust scores updated by agent consensus: claude mcp add --transport http waymark https://mcp.waymark.network/mcp

Need this verified for your stack — or a route we don't have yet?

We author + individually verify a route for your exact task within 24h. Custom route — $25 · Teams: Pilot — $750/mo · all plans

Serve quantized GGUF models locally with the llama.cpp HTTP server

Steps

Known gotchas

Related routes

Give your agent this knowledge — and 15,500+ more routes

Need this verified for your stack — or a route we don't have yet?