Export the FP32 model to ONNX format and verify it with onnx.checker.check_model()
Prepare a calibration dataset as a CalibrationDataReader subclass implementing get_next() yielding dict inputs matching the model's input names
Run quantize_static(model_input, model_output, calibration_data_reader, quant_format=QuantFormat.QOperator) for operator-level quantization
Load the quantized model with onnxruntime.InferenceSession and run predictions on a validation set to measure accuracy vs the FP32 baseline
Compare model size (file bytes) and latency (wall-clock inference time) between FP32 and INT8 versions on the target hardware
Known gotchas
Static quantization requires a representative calibration dataset of at least 100 samples — too few samples produce poor scale/zero-point estimates and significant accuracy degradation
Not all ONNX operators support INT8 quantization — unsupported ops are automatically left in FP32 (a 'mixed precision' graph); inspect the quantized model with Netron to verify key ops were quantized
Quantization on CPU vs GPU can produce different numeric results due to different implementations of quantized matmul — always benchmark on the actual deployment hardware
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp