Build or pull the Triton TensorRT-LLM backend container from the triton-inference-server/tensorrtllm_backend repository
Convert your model to TensorRT-LLM format using the trtllm-build CLI or the high-level Python LLM API
Populate a Triton model repository with the inflight_batcher_llm directory containing the C++ backend configuration files
Choose deployment mode: leader mode (one Triton process per GPU, rank 0 is leader) or orchestrator mode (single orchestrator process that spawns per-GPU workers)
Start Triton: tritonserver --model-repository=/path/to/model-repo and verify readiness on the HTTP health endpoint
Send inference requests via Triton's HTTP or gRPC endpoint; the backend handles in-flight batching and paged KV caching automatically
Known gotchas
Leader mode is simpler for single-model serving; orchestrator mode is required when serving multiple TRT-LLM models on the same server
TensorRT-LLM engine files are GPU-architecture-specific — an engine built for H100 will not run on A100
In-flight batching (continuous batching) is enabled by default in the backend; disabling it reverts to static batching and reduces throughput significantly
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp