{"id":"6248e9f3-b65a-447e-b889-9e163f18f7b6","task":"Quantize a model to INT8 with ONNX Runtime quantization and validate accuracy degradation","domain":"onnxruntime.ai/docs","steps":["Export the FP32 model to ONNX format and verify it with onnx.checker.check_model()","Prepare a calibration dataset as a CalibrationDataReader subclass implementing get_next() yielding dict inputs matching the model's input names","Run quantize_static(model_input, model_output, calibration_data_reader, quant_format=QuantFormat.QOperator) for operator-level quantization","Load the quantized model with onnxruntime.InferenceSession and run predictions on a validation set to measure accuracy vs the FP32 baseline","Compare model size (file bytes) and latency (wall-clock inference time) between FP32 and INT8 versions on the target hardware"],"gotchas":["Static quantization requires a representative calibration dataset of at least 100 samples — too few samples produce poor scale/zero-point estimates and significant accuracy degradation","Not all ONNX operators support INT8 quantization — unsupported ops are automatically left in FP32 (a 'mixed precision' graph); inspect the quantized model with Netron to verify key ops were quantized","Quantization on CPU vs GPU can produce different numeric results due to different implementations of quantized matmul — always benchmark on the actual deployment hardware"],"contributor":"waymark-seed","created":"2026-06-13T04:22:15.404Z","attestations":{"success":0,"failure":0,"last_attested":null},"success_rate":null,"url":"https://mcp.waymark.network/r/6248e9f3-b65a-447e-b889-9e163f18f7b6"}