Most Retell AI + Twilio integrations fail because developers treat them as a single system—they're not. Retell handles conversation logic; Twilio handles the phone connection. This tutorial shows you how to wire them together: configure a Retell assistant, create a Twilio phone number, connect inbound calls to Retell via webhook, and handle call state transitions. Result: production-grade AI voice calls that actually work.
Mental model
Retell AI and Twilio operate as separate layers in a voice pipeline. Twilio receives the phone call and streams raw audio over WebSocket using its Media Streams API. Your server acts as a bridge: it receives Twilio's mulaw-encoded 8kHz audio chunks every 20ms, forwards them to Retell AI's WebSocket endpoint, waits for Retell AI to process speech-to-text → LLM inference → text-to-speech, then streams the synthesized audio back to Twilio. The integration requires three simultaneous connections: caller ↔ Twilio ↔ your server ↔ Retell AI. Latency compounds at each hop, so geographic proximity matters. Twilio owns the telephony layer; Retell AI owns the conversation layer; you own the glue code that keeps both sides synchronized.
What you need first
- Retell AI account with API key from dashboard (used in
Authorization: Bearerheaders) - Twilio account with Account SID, Auth Token, and a provisioned phone number (trial accounts don't support inbound calls)
- Node.js 16+ with
express,ws, andtwiliopackages installed via npm - Public HTTPS endpoint for webhooks (ngrok for local dev, Railway/Render/Fly.io for production)
- SSL certificate (ngrok provides this automatically; production deployments need Let's Encrypt or similar)
- Environment variables for
RETELL_API_KEY,RETELL_AGENT_ID,TWILIO_ACCOUNT_SID,TWILIO_AUTH_TOKEN
Advertisement
Under the hood
When a user dials your Twilio number, Twilio sends an HTTP POST to your /voice webhook. Your server responds with TwiML XML containing a <Connect><Stream> tag pointing to your WebSocket endpoint. Twilio opens the WebSocket and begins sending start, media, and stop events. The media events contain base64-encoded mulaw audio chunks arriving every 20ms (50 frames per second). Your server decodes these chunks and forwards them to Retell AI's WebSocket at wss://api.retellai.com/audio-websocket/{call_id}. Retell AI processes the audio through its STT engine, runs the configured LLM, synthesizes speech via TTS, and returns audio chunks in the same mulaw format. Your server re-encodes and streams these back to Twilio, which plays them to the caller. The handoff requires careful state management: you must track callSid (Twilio's identifier) and call_id (Retell AI's identifier) in a session map to prevent race conditions when events fire out of order.
flowchart LR
A[Caller Dials] --> B[Twilio Voice API]
B --> C[POST /voice webhook]
C --> D[Return TwiML with Stream URL]
D --> E[Twilio Opens WebSocket]
E --> F[Your Server Bridge]
F --> G[Retell AI WebSocket]
G --> H[STT → LLM → TTS]
H --> G
G --> F
F --> E
E --> B
B --> A
Latency breakdown: Twilio audio capture (20–40ms) + network to your server (20–100ms) + Retell AI processing (500–2000ms for STT+LLM+TTS) + network back to Twilio (20–100ms) = 560–2240ms total. Mobile networks add 100–400ms jitter. Deploy your server in the same AWS region as Retell AI (us-west-2) to shave 80–120ms off cross-region latency.
Copy-paste setup
This configuration object goes in your Retell AI agent creation call. Every key matters—mismatched audio encoding causes garbled output, wrong sample rate drops packets, missing audio_websocket_protocol breaks the handshake.
const agentConfig = {
// LLM and voice configuration
llm_websocket_url: process.env.LLM_WEBSOCKET_URL,
voice_id: "11labs-voice-id", // Or "openai-voice-id" depending on provider
agent_name: "Support Agent",
language: "en-US",
response_engine: {
type: "retell-llm",
llm_id: process.env.RETELL_LLM_ID
},
// CRITICAL: Twilio compatibility settings
audio_encoding: "mulaw", // Twilio only accepts mulaw, not PCM
audio_websocket_protocol: "twilio", // Enables Twilio-specific handshake
sample_rate: 8000, // Twilio uses 8kHz, not 16kHz
// Conversation behavior tuning
enable_backchannel: true, // "mm-hmm" acknowledgments during user speech
ambient_sound: "office", // Background noise to prevent dead air
interruption_sensitivity: 0.7, // 0.3 = slow barge-in, 0.9 = hair-trigger
responsiveness: 0.8, // Higher = faster replies, more interruptions
end_call_after_silence_ms: 10000, // Hang up after 10s of silence
// Optional: Custom prompts and tools
begin_message: "Hello, how can I help you today?",
general_prompt: "You are a helpful customer service agent.",
general_tools: []
};
Tradeoffs: interruption_sensitivity at 0.7 catches most real interruptions but triggers false positives on background noise (dogs barking, car horns). Lower to 0.5 for noisy environments. responsiveness at 0.8 feels snappy but the agent may cut off users who pause mid-sentence. Drop to 0.6 for elderly callers or non-native speakers. enable_backchannel sounds natural but conflicts with Twilio's echo cancellation on some carriers—disable if users report hearing themselves.
Edge cases
Race condition on WebSocket open: Twilio sends media events 50–150ms before Retell AI's WebSocket handshake completes. Without buffering, you lose the first 2–3 audio chunks, truncating the caller's opening words ("Hello?" becomes "lo?"). Fix: queue incoming audio in an array until retellWs.readyState === WebSocket.OPEN, then flush the buffer.
const audioBuffer = [];
twilioWs.on('message', (data) => {
const msg = JSON.parse(data);
if (msg.event === 'media') {
if (retellWs.readyState === WebSocket.OPEN) {
retellWs.send(msg.media.payload);
} else {
audioBuffer.push(msg.media.payload); // Queue until ready
}
}
});
retellWs.on('open', () => {
while (audioBuffer.length > 0) retellWs.send(audioBuffer.shift());
});
Barge-in overlap: User interrupts mid-sentence but TTS audio is still queued. Retell AI sends an interrupt event, but if you don't flush audioBuffer immediately, old audio plays after the interrupt. Fix: clear the buffer and send a clear signal to Twilio's WebSocket on every interrupt event. Response time must be < 100ms or users hear overlap.
Audio format mismatch: Default Retell AI configs use PCM 16kHz. Twilio's MediaStreams API only accepts mulaw 8kHz. Symptom: agent responds with "I didn't catch that" on every turn because the decoder fails silently. Fix: set audio_encoding: "mulaw" and sample_rate: 8000 in agentConfig.
Webhook signature validation skipped: Without validating X-Twilio-Signature, attackers can POST fake CallSid values to your /voice endpoint and rack up thousands of Retell AI sessions. Fix: use twilio.validateRequest(authToken, signature, url, body) before processing any webhook.
WebSocket timeout after 4 hours: Twilio closes idle WebSockets after 4 hours. Long support calls hit this limit. Fix: send keepalive pings every 30 seconds: setInterval(() => twilioWs.ping(), 30000).
The whole thing in one file
This is the complete production server. Copy this entire block, set environment variables, and run node server.js. It handles Twilio's /voice webhook, WebSocket bridging, session state tracking, and graceful cleanup.
const express = require('express');
const WebSocket = require('ws');
const twilio = require('twilio');
const app = express();
app.use(express.urlencoded({ extended: false }));
const RETELL_API_KEY = process.env.RETELL_API_KEY;
const TWILIO_ACCOUNT_SID = process.env.TWILIO_ACCOUNT_SID;
const TWILIO_AUTH_TOKEN = process.env.TWILIO_AUTH_TOKEN;
// Session state tracking - prevents race conditions
const activeSessions = new Map();
// Twilio voice webhook - initiates call
app.post('/voice', async (req, res) => {
const callSid = req.body.CallSid;
const from = req.body.From;
// Validate Twilio signature to prevent spoofed calls
const signature = req.headers['x-twilio-signature'];
const url = `https://${req.headers.host}${req.url}`;
if (!twilio.validateRequest(TWILIO_AUTH_TOKEN, signature, url, req.body)) {
return res.status(403).send('Invalid signature');
}
try {
// Create Retell AI agent session
const response = await fetch('https://api.retellai.com/v2/create-web-call', {
method: 'POST',
headers: {
'Authorization': `Bearer ${RETELL_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
agent_id: process.env.RETELL_AGENT_ID,
audio_websocket_protocol: 'twilio',
audio_encoding: 'mulaw',
sample_rate: 8000,
metadata: { callSid, from }
})
});
if (!response.ok) throw new Error(`Retell API error: ${response.status}`);
const { call_id, access_token } = await response.json();
// Store session to prevent duplicate processing
activeSessions.set(callSid, { call_id, audioBuffer: [] });
// Return TwiML with WebSocket stream
const twiml = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Connect>
<Stream url="wss://${req.headers.host}/media/${call_id}">
<Parameter name="access_token" value="${access_token}" />
<Parameter name="callSid" value="${callSid}" />
</Stream>
</Connect>
</Response>`;
res.type('text/xml').send(twiml);
} catch (error) {
console.error('Voice webhook error:', error);
res.status(500).send('<Response><Say>Service unavailable</Say></Response>');
}
});
// WebSocket bridge - handles bidirectional audio
const wss = new WebSocket.Server({ noServer: true });
wss.on('connection', (ws, req) => {
const call_id = req.url.split('/').pop();
const params = new URL(`http://host${req.url}`).searchParams;
const access_token = params.get('access_token');
const callSid = params.get('callSid');
// Connect to Retell AI WebSocket
const retellWs = new WebSocket(`wss://api.retellai.com/audio-websocket/${call_id}`, {
headers: { 'Authorization': `Bearer ${access_token}` }
});
let streamSid = null;
const session = activeSessions.get(callSid);
// Twilio → Retell AI (incoming audio)
ws.on('message', (data) => {
const msg = JSON.parse(data);
if (msg.event === 'start') {
streamSid = msg.start.streamSid;
console.log(`[${Date.now()}] Stream started: ${streamSid}`);
}
if (msg.event === 'media') {
if (retellWs.readyState === WebSocket.OPEN) {
// Forward mulaw audio chunks to Retell AI
retellWs.send(JSON.stringify({
type: 'audio',
audio_encoding: 'mulaw',
sample_rate: 8000,
data: msg.media.payload
}));
} else {
// Buffer audio until Retell WebSocket opens
session.audioBuffer.push(msg.media.payload);
}
}
if (msg.event === 'stop') {
console.log(`[${Date.now()}] Stream stopped: ${streamSid}`);
retellWs.close();
}
});
// Retell AI → Twilio (outgoing audio)
retellWs.on('message', (data) => {
const retellMsg = JSON.parse(data);
if (retellMsg.type === 'audio' && ws.readyState === WebSocket.OPEN) {
// Forward synthesized audio back to Twilio
ws.send(JSON.stringify({
event: 'media',
streamSid: streamSid,
media: { payload: retellMsg.data }
}));
}
if (retellMsg.type === 'interrupt') {
// Clear audio buffer on barge-in
session.audioBuffer = [];
ws.send(JSON.stringify({ event: 'clear', streamSid: streamSid }));
console.log(`[${Date.now()}] Barge-in detected - buffers flushed`);
}
if (retellMsg.type === 'call_ended') {
ws.close();
activeSessions.delete(callSid);
}
});
// Flush buffered audio once Retell WebSocket opens
retellWs.on('open', () => {
console.log(`[${Date.now()}] Retell WebSocket opened for ${call_id}`);
while (session.audioBuffer.length > 0) {
retellWs.send(JSON.stringify({
type: 'audio',
audio_encoding: 'mulaw',
sample_rate: 8000,
data: session.audioBuffer.shift()
}));
}
});
// Error handling - prevents zombie connections
ws.on('error', (err) => console.error('Twilio WS error:', err));
retellWs.on('error', (err) => console.error('Retell WS error:', err));
ws.on('close', () => {
if (retellWs.readyState === WebSocket.OPEN) retellWs.close();
});
// Keepalive to prevent 4-hour timeout
const keepalive = setInterval(() => {
if (ws.readyState === WebSocket.OPEN) ws.ping();
}, 30000);
ws.on('close', () => clearInterval(keepalive));
});
// Upgrade HTTP to WebSocket
const server = app.listen(process.env.PORT || 3000, () => {
console.log(`Server running on port ${server.address().port}`);
});
server.on('upgrade', (request, socket, head) => {
if (request.url.startsWith('/media/')) {
wss.handleUpgrade(request, socket, head, (ws) => {
wss.emit('connection', ws, request);
});
} else {
socket.destroy();
}
});
Run it: Install dependencies with npm install express ws twilio. Set environment variables: export RETELL_API_KEY="your_key", export RETELL_AGENT_ID="agent_xxx", export TWILIO_ACCOUNT_SID="ACxxx", export TWILIO_AUTH_TOKEN="your_token". For local testing, run ngrok http 3000 and copy the HTTPS URL. In Twilio Console, configure your phone number's Voice webhook to https://your-ngrok-url.ngrok.io/voice. Call your Twilio number—the AI agent answers immediately.
Production deployment: Replace ngrok with a real domain (Railway, Render, Fly.io all work). Add session cleanup with TTL expiration: setTimeout(() => activeSessions.delete(callSid), 3600000) to prevent memory leaks on abandoned calls. Implement retry logic for Retell API failures with exponential backoff. Monitor activeSessions.size—if it grows unbounded, you have a cleanup bug in your call_ended handler.
Common questions
How does Retell AI handle real-time audio streaming with Twilio?
Retell AI connects via WebSocket to your server, which bridges Twilio's Media Streams API. Twilio sends mulaw 8kHz audio chunks every 20ms. Your server forwards these to Retell AI's WebSocket at wss://api.retellai.com/audio-websocket/{call_id}. Retell AI processes STT → LLM → TTS internally and returns synthesized audio in the same mulaw format. Your server streams this back to Twilio, which plays it to the caller. The streamSid from Twilio and call_id from Retell AI must be tracked in a session map to prevent race conditions.
What's the difference between Retell AI's voice synthesis and Twilio's TTS?
Retell AI handles all voice synthesis internally via its configured voice_id and response_engine. Twilio doesn't synthesize—it only streams raw audio. Never use Twilio's <Say> tag in TwiML when using Media Streams; it conflicts with Retell AI's audio output. Retell AI owns the entire voice pipeline (transcription, LLM reasoning, TTS), while Twilio is purely the transport layer for phone calls.
Why does audio sometimes cut off mid-sentence when the user interrupts?
Barge-in requires coordinating Twilio's audio stream, Retell AI's VAD, and your TTS buffer. If interruption_sensitivity is too low (default 0.3), Retell AI won't detect the user's speech quickly enough. Increase it to 0.5–0.7. More critically, when Retell AI sends an interrupt event, you must flush audioBuffer immediately—if old TTS audio is still queued, it plays after the interrupt, creating overlap. Implement a flush-on-interrupt handler that clears the buffer before sending the next audio chunk to Twilio.
What latency should I expect end-to-end?
Typical breakdown: Twilio audio capture (20–40ms) + network to your server (20–100ms) + Retell AI STT processing (200–600ms) + LLM inference (500–2000ms) + TTS synthesis (300–800ms) + network back to Twilio (20–100ms) = 1.1–3.7 seconds total. Mobile networks add 100–400ms jitter. To reduce perceived latency, enable responsiveness: 0.8 in agentConfig and deploy your server in the same AWS region as Retell AI (us-west-2) to shave 80–120ms off cross-region latency.
How many concurrent calls can one server handle?
Each call requires two WebSocket connections (Twilio + Retell AI) and 2–5MB of memory for buffers and session state. A single Node.js process can handle 50–200 concurrent calls depending on LLM latency and server specs. Beyond that, implement connection pooling and horizontal scaling. Monitor activeSessions.size—if it grows unbounded, you have a session cleanup bug (missing call_ended webhook handlers or no TTL expiration).
Should I use Retell AI or build custom STT/LLM/TTS with Twilio?
Retell AI abstracts the entire voice AI pipeline—you configure agentConfig once and get production-grade STT, LLM orchestration, TTS, and barge-in handling. Building custom requires wiring Deepgram/Whisper for STT, OpenAI/Anthropic for LLM, ElevenLabs/PlayHT for TTS, plus custom VAD and turn-taking logic. Retell AI's latency (500–2000ms) is competitive with custom stacks because it optimizes the entire pipeline. Use Retell AI unless you need sub-500ms latency or custom audio processing (noise suppression, speaker diarization).
Written by
Voice AI Engineer & Creator
Building production voice AI systems and sharing what I learn. Focused on VAPI, LLM integrations, and real-time communication. Documenting the challenges most tutorials skip.
Tutorials in your inbox
Weekly voice AI tutorials and production tips. No spam.
Found this helpful?
Share it with other developers building voice AI.



