Choose quantization method: scalar (4x compression, float32 → int8, minimal recall loss), binary (up to 32x, float32 → 1 bit, best for high-dimensional centered distributions), or product (up to 64x, highest compression, most recall loss)
Define quantization in the collection config at creation time: set quantization_config.scalar.type='int8' for scalar, or quantization_config.binary.always_ram=true for binary to keep quantized vectors in RAM
For product quantization set quantization_config.product.compression='x16' (valid values: x4, x8, x16, x32, x64) to control the compression factor
Enable rescoring at query time by setting params.quantization.rescore=true in the search request — this retrieves more candidates via quantized index then re-ranks with full-precision vectors
Tune oversampling: set params.quantization.oversampling (e.g. 2.0) to fetch 2x more candidates before rescoring to improve recall at modest latency cost
Measure recall with and without rescoring using a benchmark query set before deploying to production
Known gotchas
Binary quantization is only effective for vector models that produce centered, roughly normally distributed values (e.g. OpenAI text-embedding-3); it degrades recall for non-centered embeddings
Product quantization requires a training phase on existing vectors — creating the collection with PQ on an empty collection defers training until enough vectors are loaded
Quantized vectors are stored separately from original vectors; setting on_disk=true for the original payload index does not automatically move quantized vectors to disk
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp