When answering an inbound call, respond with TwiML containing <Connect><Stream url="wss://your-server.example.com/stream" /></Connect> to start a bidirectional Media Stream.
Accept the WebSocket upgrade on your server; Twilio will first send a connected message (event: connected) then a start message with stream metadata including the StreamSid, AccountSid, CallSid, and media format.
Media messages arrive as JSON text frames where event=media; the payload field contains the audio encoded as Base64-encoded 8-bit PCM mono mu-law (PCMU) at 8 kHz — decode from Base64 before piping to an STT or audio processor.
To inject audio back into the call, send a media message JSON frame with a Base64-encoded PCMU 8 kHz payload in the payload field, specifying the StreamSid.
To discard buffered outbound audio (e.g. to interrupt a TTS response), send a clear message frame with the StreamSid.
Known gotchas
Audio is always PCMU (mu-law) at 8000 Hz — Twilio does not support configurable encoding on Media Streams; STT services that require linear16 need a mu-law-to-PCM conversion step.
Twilio sends inbound audio in 20 ms chunks; network jitter may cause packet bunching — buffer and reorder by sequence number if strict timing matters.
A mark event sent after injecting audio is echoed back by Twilio when that audio finishes playing, enabling precise speech-end detection.
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp