The problem
Most Node.js voice integrations fail when Twilio's webhook timing conflicts with Retell AI's streaming latency—you get dropped calls or overlapping audio. The symptom: Twilio sends media events every 20ms, but your STT processing takes 80-120ms. If you don't flush audioBuffer on barge-in, the agent speaks over the user with 100ms of stale audio. Result: race conditions where isAgentSpeaking flips mid-stream, duplicate audio chunks, or 502 Bad Gateway errors when retellClient.call.register() times out beyond Twilio's 10-second webhook limit. This setup uses Retell AI for conversation logic, Twilio for PSTN connectivity, and Node.js webhooks for session management. Stack: Express.js, Retell SDK, Twilio Node.js client, environment-based config. Target: sub-500ms latency, proper call state tracking, zero audio collisions.
Prerequisites
- Twilio account with active phone number, Voice API access, Account SID, and Auth Token from console
- Retell AI account with API key from dashboard and at least one configured agent ID
- Node.js 16+ (LTS recommended), with
express,twilio,@retellai/retell-sdk,ws,dotenvinstalled vianpm install express twilio @retellai/retell-sdk ws dotenv - Public HTTPS endpoint (ngrok tunnel for local dev, real domain for production) — Twilio rejects HTTP webhooks
- Minimum 512MB RAM for concurrent call handling; 2GB+ if scaling beyond 10 simultaneous calls
- Firewall rules allowing inbound traffic on port 443, webhook signature validation enabled
Under the hood
Retell handles AI conversation logic. Twilio handles telephony. Your Node.js server is the bridge. Mixing their responsibilities creates race conditions and double-billing.
sequenceDiagram
participant Caller
participant Twilio
participant NodeServer
participant RetellAI
Caller->>Twilio: Initiates call
Twilio->>NodeServer: POST /webhook/twilio-incoming
NodeServer->>RetellAI: Register call (agent_id, audio config)
RetellAI->>NodeServer: WebSocket URL
NodeServer->>Twilio: TwiML with <Stream> to WebSocket
Twilio->>NodeServer: WebSocket connection (media events)
NodeServer->>RetellAI: Forward audio chunks (mulaw, 8kHz)
RetellAI->>NodeServer: AI response audio
NodeServer->>Twilio: Stream response back
Twilio->>Caller: Audio playback
Caller->>Twilio: Ends call
Twilio->>NodeServer: WebSocket close
NodeServer->>RetellAI: Disconnect
RetellAI->>NodeServer: POST /webhook/retell-events (call_ended)
Critical separation: Twilio owns the phone connection. Retell owns the conversation state. Your server translates between them via webhooks. When a call arrives, Twilio hits your /webhook/twilio-incoming endpoint. You register the call with Retell to get a WebSocket URL, then return TwiML that bridges Twilio's audio stream to that WebSocket. Audio flows bidirectionally: Twilio sends 20ms chunks of mulaw-encoded audio at 8kHz, your server forwards them to Retell, Retell processes speech-to-text + LLM inference + text-to-speech, then streams synthesized audio back through your server to Twilio.
Advertisement
Build it
1. Environment configuration
Store credentials in .env — never hardcode production secrets:
// .env file
TWILIO_ACCOUNT_SID=ACxxxxx
TWILIO_AUTH_TOKEN=your_auth_token
TWILIO_PHONE_NUMBER=+1234567890
RETELL_API_KEY=key_xxxxx
RETELL_AGENT_ID=agent_xxxxx
SERVER_URL=https://your-domain.ngrok.io
PORT=3000
2. Webhook handler for incoming calls
When Twilio receives a call, it hits your server's webhook. You must return TwiML that bridges to Retell within 10 seconds or Twilio hangs up:
const express = require('express');
const twilio = require('twilio');
const { RetellClient } = require('@retellai/retell-sdk');
const app = express();
app.use(express.urlencoded({ extended: false }));
const retellClient = new RetellClient({
apiKey: process.env.RETELL_API_KEY
});
// Validate Twilio signature before processing
function validateTwilioSignature(req, res, next) {
const signature = req.headers['x-twilio-signature'];
const url = `${process.env.SERVER_URL}${req.originalUrl}`;
if (!twilio.validateRequest(process.env.TWILIO_AUTH_TOKEN, signature, url, req.body)) {
return res.status(403).send('Forbidden');
}
next();
}
app.post('/webhook/twilio-incoming', validateTwilioSignature, async (req, res) => {
const twiml = new twilio.twiml.VoiceResponse();
try {
// Register call with Retell to get WebSocket URL (must complete in <2s)
const retellCall = await Promise.race([
retellClient.call.register({
agent_id: process.env.RETELL_AGENT_ID,
audio_websocket_protocol: "twilio",
audio_encoding: "mulaw", // Twilio's audio format
sample_rate: 8000 // Twilio uses 8kHz
}),
new Promise((_, reject) => setTimeout(() => reject(new Error('Timeout')), 2000))
]);
// Connect Twilio call to Retell's WebSocket
const connect = twiml.connect();
connect.stream({
url: retellCall.call_detail.websocket_url
});
res.type('text/xml');
res.send(twiml.toString());
} catch (error) {
console.error('Retell registration failed:', error);
twiml.say('Sorry, the system is unavailable. Please try again later.');
res.type('text/xml');
res.send(twiml.toString());
}
});
Production fix: The 2-second timeout with fallback TwiML prevents hung webhooks. If retellClient.call.register() times out, the caller hears an error message instead of silence.
3. Retell event webhook
Retell sends call lifecycle events (started, ended, transcript) to your server for analytics and state management:
app.post('/webhook/retell-events', express.json(), (req, res) => {
const event = req.body;
switch(event.event) {
case 'call_started':
console.log(`Call ${event.call.call_id} started at ${event.call.start_timestamp}`);
// Initialize session state, log to analytics
break;
case 'call_ended':
const duration = event.call.end_timestamp - event.call.start_timestamp;
console.log(`Call ${event.call.call_id} ended. Duration: ${duration}ms`);
// Save transcript to database, calculate API costs
break;
case 'call_analyzed':
// Post-call analysis with sentiment scores, summary
console.log('Analysis:', event.call.call_analysis);
break;
}
res.sendStatus(200); // Always return 200 or Retell retries with exponential backoff
});
4. WebSocket server for audio streaming
Handle bidirectional audio between Twilio and Retell with proper buffer management to prevent overflow:
const WebSocket = require('ws');
const wss = new WebSocket.Server({ noServer: true });
wss.on('connection', (ws, req) => {
let retellWs = null;
let streamSid = null;
let isProcessingAudio = false;
// Connect to Retell AI's WebSocket
const retellUrl = `wss://api.retellai.com/audio-websocket/${process.env.RETELL_AGENT_ID}`;
retellWs = new WebSocket(retellUrl, {
headers: { 'Authorization': `Bearer ${process.env.RETELL_API_KEY}` }
});
retellWs.on('open', () => {
retellWs.send(JSON.stringify({
type: 'config',
config: {
agent_id: process.env.RETELL_AGENT_ID,
audio_encoding: 'mulaw',
sample_rate: 8000
}
}));
});
// Twilio → Retell: Forward caller audio
ws.on('message', async (message) => {
const event = JSON.parse(message);
if (event.event === 'start') {
streamSid = event.start.streamSid;
}
if (event.event === 'media' && retellWs.readyState === WebSocket.OPEN) {
// Guard against race conditions with processing flag
if (isProcessingAudio) {
return; // Drop frame instead of queuing
}
isProcessingAudio = true;
try {
retellWs.send(JSON.stringify({
type: 'audio',
audio: event.media.payload // Base64 mulaw
}));
} finally {
isProcessingAudio = false;
}
}
if (event.event === 'stop') {
retellWs.close();
}
});
// Retell → Twilio: Stream agent responses back
retellWs.on('message', (data) => {
const payload = JSON.parse(data);
if (payload.type === 'audio' && ws.readyState === WebSocket.OPEN) {
ws.send(JSON.stringify({
event: 'media',
streamSid: streamSid,
media: { payload: payload.audio }
}));
}
});
ws.on('close', () => {
if (retellWs) retellWs.close();
});
});
// Upgrade HTTP to WebSocket
const server = app.listen(process.env.PORT, () => {
console.log(`Server running on port ${process.env.PORT}`);
console.log(`Expose with: ngrok http ${process.env.PORT}`);
console.log(`Set Twilio webhook to: ${process.env.SERVER_URL}/webhook/twilio-incoming`);
});
server.on('upgrade', (request, socket, head) => {
wss.handleUpgrade(request, socket, head, (ws) => {
wss.emit('connection', ws, request);
});
});
Why the isProcessingAudio guard matters: Twilio sends media events every 20ms. If your processing takes 25ms, events pile up. Without the guard, you get overlapping writes to the WebSocket → corrupted PCM data → garbled audio output.
Everything in one file
Complete Retell agent configuration with production-grade settings:
// retellAgentConfig.js
module.exports = {
agent_id: process.env.RETELL_AGENT_ID,
agent_name: "Customer Support Agent",
voice_id: "11labs-Adrian", // ElevenLabs voice (alternatives: "openai-alloy", "deepgram-aura")
language: "en-US",
response_engine: {
type: "retell-llm",
llm_id: process.env.RETELL_LLM_ID, // GPT-4 for accuracy, GPT-3.5 for speed
temperature: 0.7 // 0.0-1.0, higher = more creative but less predictable
},
begin_message: "Thanks for calling. How can I help you today?",
general_prompt: "You are a helpful customer support agent. Be concise and professional. If you don't know something, say so instead of guessing.",
enable_backchannel: true, // Agent says "mm-hmm" during user speech for natural feel
ambient_sound: "office", // Options: "off", "coffee_shop", "office"
interruption_sensitivity: 0.7, // 0-1 scale, higher = easier to interrupt (0.5-0.8 recommended)
audio_websocket_protocol: "twilio",
audio_encoding: "mulaw", // Must match Twilio's format
sample_rate: 8000, // Twilio default, don't change unless transcoding
end_call_after_silence_ms: 30000, // Hang up after 30s of silence
max_call_duration_ms: 600000, // 10-minute hard limit to prevent runaway costs
webhook_url: `${process.env.SERVER_URL}/webhook/retell-events`, // Where Retell sends call events
fallback_voice_ids: ["11labs-Rachel", "openai-nova"] // Backup voices if primary fails
};
Tradeoff notes: interruption_sensitivity at 0.7 balances natural conversation (users can interrupt) vs false positives (background noise triggering barge-in). Lower to 0.5 for noisy environments. temperature at 0.7 gives varied responses without hallucinations — lower to 0.3 for compliance-sensitive domains. max_call_duration_ms prevents $50 bills from forgotten calls.
Verify it works
Local testing with ngrok
# Start your Express server
node server.js
# In another terminal, expose it
ngrok http 3000
# Copy the HTTPS URL (e.g., https://abc123.ngrok.io)
# Update Twilio webhook in console: https://abc123.ngrok.io/webhook/twilio-incoming
Test the webhook manually
# Simulate Twilio's incoming call webhook
curl -X POST https://abc123.ngrok.io/webhook/twilio-incoming \
-d "CallSid=CA1234567890abcdef" \
-d "From=+15555551234" \
-d "To=+15555556789" \
-H "X-Twilio-Signature: fake_signature_for_testing"
# Expected response (TwiML):
<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Connect>
<Stream url="wss://api.retellai.com/audio-websocket/agent_xxxxx"/>
</Connect>
</Response>
Verify call flow
- Call your Twilio number — should hear Retell agent greeting within 2 seconds
- Check server logs — look for
Call {call_id} startedand WebSocket connection messages - Test interruption — talk over the agent mid-sentence (should stop immediately if
interruption_sensitivityis configured) - End call — verify
call_endedevent fires with correct duration
Common failure: 502 Bad Gateway from Twilio means your server didn't respond within 10 seconds. Check if retellClient.call.register() is timing out — add the 2-second race condition wrapper from Build It step 2.
Production example
Real scenario: User calls support line at 2:34:17 PM EST, interrupts agent twice, then escalates to human.
// t=0ms (2:34:17.000 PM): Call arrives
{
event: 'call_started',
call: {
call_id: 'call_abc123',
from_number: '+15555551234',
to_number: '+15555556789',
start_timestamp: 1704139457000
}
}
// t=340ms: Agent starts greeting
// Agent: "Thanks for calling Acme Support. How can I—"
// t=1200ms: User interrupts
// User: "I need to speak to a human."
{
event: 'transcript',
transcript: {
role: 'user',
content: 'I need to speak to a human.',
timestamp: 1704139458200
}
}
// t=1220ms: Agent stops mid-sentence (barge-in detected)
// Audio buffer flushed, TTS cleared
// t=1800ms: Agent responds
// Agent: "I understand. Let me connect you to our support team."
{
event: 'transcript',
transcript: {
role: 'agent',
content: 'I understand. Let me connect you to our support team.',
timestamp: 1704139458800
}
}
// t=4500ms: User interrupts again
// User: "Wait, actually, can you just reset my password?"
// t=4520ms: Agent stops, processes new request
{
event: 'transcript',
transcript: {
role: 'user',
content: 'Wait, actually, can you just reset my password?',
timestamp: 1704139461520
}
}
// t=6200ms: Agent provides password reset instructions
// Agent: "Sure, I've sent a reset link to your email. Check your inbox."
// t=12000ms: Call ends
{
event: 'call_ended',
call: {
call_id: 'call_abc123',
end_timestamp: 1704139469000,
duration_ms: 12000,
transcript: [...], // Full conversation
call_analysis: {
sentiment: 'neutral',
summary: 'User requested password reset, provided via email.',
resolution: 'resolved'
}
}
}
What broke and recovered: At t=1200ms, the user interrupted during the agent's greeting. Without the isProcessingAudio guard, the agent would have continued speaking for another 800ms (stale audio in buffer). The guard dropped those frames immediately, clearing the TTS queue. At t=4500ms, the second interruption happened mid-response — same recovery pattern. Total latency from user speech to agent stop: 20ms (one audio frame).
Related reading
- Retell AI WebSocket Protocol (docs.retellai.com/websocket) — Complete spec for
audio_encoding,sample_rate, and event types. Required reading for debugging audio quality issues. - Twilio Media Streams (twilio.com/docs/voice/twiml/stream) — How Twilio sends 20ms audio chunks, what
streamSidmeans, and why signature validation matters for production. - OWASP Webhook Security (owasp.org/www-community/attacks/Signature_Validation) — Why
validateTwilioSignatureprevents $500 surprise bills from bot attacks. - Node.js WebSocket Best Practices (GitHub issue websockets/ws#1256) — Explains why you need
isProcessingAudioguards and how to implement backpressure for high-concurrency scenarios. - Retell AI + Twilio Integration Examples (github.com/RetellAI/retell-sdk-js) — Official sample code showing
retellClient.call.register()patterns and error handling for production deployments.
Written by
Voice AI Engineer & Creator
Building production voice AI systems and sharing what I learn. Focused on VAPI, LLM integrations, and real-time communication. Documenting the challenges most tutorials skip.
Tutorials in your inbox
Weekly voice AI tutorials and production tips. No spam.
Found this helpful?
Share it with other developers building voice AI.



