Organize a model repository directory with the structure MODEL_REPO/MODEL_NAME/VERSION_NUMBER/model.EXTENSION where VERSION_NUMBER is an integer subdirectory.
Create a config.pbtxt file in the MODEL_NAME directory specifying at least the platform (e.g., 'onnxruntime_onnx', 'pytorch_libtorch', 'tensorrt_plan'), and the input and output tensor names, data types, and shapes.
Pull the Triton server container image from the NVIDIA NGC registry using the appropriate tag for your desired backend and CUDA version.
Launch the container mounting the model repository: docker run --gpus all -v /local/model_repo:/models -p 8000:8000 -p 8001:8001 -p 8002:8002 nvcr.io/nvidia/tritonserver:TAG tritonserver --model-repository=/models.
Verify the server is ready by calling GET http://localhost:8000/v2/health/ready and confirm models are loaded at GET http://localhost:8000/v2/models/MODEL_NAME.
Send inference requests using the HTTP or gRPC endpoints following the KServe v2 inference protocol; use the tritonclient Python library for convenience.
Known gotchas
Tensor shapes in config.pbtxt must match exactly what the model expects; a shape mismatch (including batch dimension handling) causes a model load failure.
Triton uses a specific versioning policy (LATEST, ALL, or specific versions) defined in config.pbtxt; not setting this means only the latest version number directory is served by default.
GPU backends require that the host machine has compatible NVIDIA drivers installed; container CUDA versions must be less than or equal to the host driver's supported CUDA version.
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp