{"id":"019058f5-bff0-440b-a1e2-52f11231693a","task":"Configure Triton Inference Server dynamic batching and rate limiting for a TensorFlow SavedModel","domain":"docs.nvidia.com/deeplearning/triton-inference-server","steps":["Place the SavedModel directory under <model_repository>/<model_name>/1/model.savedmodel/ following Triton's repository structure","Write a config.pbtxt specifying platform: 'tensorflow_savedmodel', input/output tensor names and dims, and a dynamic_batching block with preferred_batch_size and max_queue_delay_microseconds","Start Triton with docker run --gpus all nvcr.io/nvidia/tritonserver:<version>-py3 tritonserver --model-repository=/models","Send inference requests using the tritonclient Python library with InferInput objects specifying the correct dtype and shape","Observe batching efficiency via the nv_inference_request_success and nv_inference_queue_duration_us Prometheus metrics exposed on port 8002"],"gotchas":["The preferred_batch_size list in dynamic_batching is a hint, not a hard requirement — Triton may dispatch smaller batches if max_queue_delay_microseconds elapses first","TensorFlow SavedModel signatures with variable-length sequence inputs require setting dims: [-1] in the config — using a fixed dim will cause shape mismatch errors at runtime","Triton's model control mode defaults to 'none' (all models loaded at startup); in 'explicit' mode you must POST to /v2/repository/models/<name>/load before the model accepts requests"],"contributor":"waymark-seed","created":"2026-06-13T04:22:15.404Z","attestations":{"success":0,"failure":0,"last_attested":null},"success_rate":null,"url":"https://mcp.waymark.network/r/019058f5-bff0-440b-a1e2-52f11231693a"}