Install vLLM and launch the OpenAI-compatible server specifying the main model with --model and a draft model with --speculative-model for speculative decoding
Set the number of speculative tokens per step with --num-speculative-tokens to control the draft depth; start with a small value (e.g., 5) and tune based on acceptance rate
Prefix caching is enabled by default in current vLLM releases; verify it is active by checking server startup logs for the prefix caching status line
Send requests with shared system prompts or repeated context prefixes; vLLM reuses KV cache blocks for matching prefixes across requests automatically
Monitor GPU memory utilization and the cache hit rate using vLLM's exposed Prometheus metrics to validate that both features are providing benefit
Benchmark throughput and latency with and without speculative decoding under your target QPS to confirm net improvement, since speculative decoding benefits are workload-dependent
Known gotchas
Speculative decoding reduces inter-token latency primarily under low-to-medium QPS memory-bound workloads; under high QPS compute-bound loads it can reduce throughput rather than improve it
When speculative decoding is active, prefix cache hit statistics may not be reported in logs in some vLLM versions — the cache is still functioning but hit rate metrics are absent
The draft model must be compatible in vocabulary and tokenizer with the target model; using mismatched tokenizers causes decoding errors or silently incorrect outputs
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp