Set --n-gpu-layers -1 to offload all layers to GPU, 0 to use CPU only, or a specific integer to partially offload
Set --ctx-size to the desired context window — the default is small; for modern LLMs set it to 8192 or higher
Send requests to /v1/chat/completions (OpenAI-compatible) or /completion (native endpoint) on localhost:8080
For containerized deployment set environment variables LLAMA_ARG_MODEL, LLAMA_ARG_CTX_SIZE, and LLAMA_ARG_N_PARALLEL instead of CLI flags
Known gotchas
The server listens on 127.0.0.1 by default — pass --host 0.0.0.0 to expose it on the network, but add authentication before doing so
--n-gpu-layers -1 (all layers to GPU) can OOM if the model is larger than available VRAM; reduce the value to partially offload and overflow to CPU
GGUF quantization level (Q4_K_M, Q8_0, etc.) is set when the model file is created, not at server startup — choose the right quantization tier before downloading the model
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp