Retell AI Twilio Integration Tutorial: Build AI Voice Calls Step-by-Step

Most Retell AI + Twilio integrations fail because developers treat them as a single system—they're not. Retell handles conversation logic; Twilio handles the phone connection. This tutorial shows you how to wire them together: configure a Retell assistant, create a Twilio phone number, connect inbound calls to Retell via webhook, and handle call state transitions. Result: production-grade AI voice calls that actually work.

Mental model

Retell AI and Twilio operate as separate layers in a voice pipeline. Twilio receives the phone call and streams raw audio over WebSocket using its Media Streams API. Your server acts as a bridge: it receives Twilio's mulaw-encoded 8kHz audio chunks every 20ms, forwards them to Retell AI's WebSocket endpoint, waits for Retell AI to process speech-to-text → LLM inference → text-to-speech, then streams the synthesized audio back to Twilio. The integration requires three simultaneous connections: caller ↔ Twilio ↔ your server ↔ Retell AI. Latency compounds at each hop, so geographic proximity matters. Twilio owns the telephony layer; Retell AI owns the conversation layer; you own the glue code that keeps both sides synchronized.

What you need first

Retell AI account with API key from dashboard (used in Authorization: Bearer headers)
Twilio account with Account SID, Auth Token, and a provisioned phone number (trial accounts don't support inbound calls)
Node.js 16+ with express, ws, and twilio packages installed via npm
Public HTTPS endpoint for webhooks (ngrok for local dev, Railway/Render/Fly.io for production)
SSL certificate (ngrok provides this automatically; production deployments need Let's Encrypt or similar)
Environment variables for RETELL_API_KEY, RETELL_AGENT_ID, TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN

Under the hood

When a user dials your Twilio number, Twilio sends an HTTP POST to your /voice webhook. Your server responds with TwiML XML containing a <Connect><Stream> tag pointing to your WebSocket endpoint. Twilio opens the WebSocket and begins sending start, media, and stop events. The media events contain base64-encoded mulaw audio chunks arriving every 20ms (50 frames per second). Your server decodes these chunks and forwards them to Retell AI's WebSocket at wss://api.retellai.com/audio-websocket/{call_id}. Retell AI processes the audio through its STT engine, runs the configured LLM, synthesizes speech via TTS, and returns audio chunks in the same mulaw format. Your server re-encodes and streams these back to Twilio, which plays them to the caller. The handoff requires careful state management: you must track callSid (Twilio's identifier) and call_id (Retell AI's identifier) in a session map to prevent race conditions when events fire out of order.

mermaid

flowchart LR
    A[Caller Dials] --> B[Twilio Voice API]
    B --> C[POST /voice webhook]
    C --> D[Return TwiML with Stream URL]
    D --> E[Twilio Opens WebSocket]
    E --> F[Your Server Bridge]
    F --> G[Retell AI WebSocket]
    G --> H[STT → LLM → TTS]
    H --> G
    G --> F
    F --> E
    E --> B
    B --> A

Latency breakdown: Twilio audio capture (20–40ms) + network to your server (20–100ms) + Retell AI processing (500–2000ms for STT+LLM+TTS) + network back to Twilio (20–100ms) = 560–2240ms total. Mobile networks add 100–400ms jitter. Deploy your server in the same AWS region as Retell AI (us-west-2) to shave 80–120ms off cross-region latency.

Copy-paste setup

This configuration object goes in your Retell AI agent creation call. Every key matters—mismatched audio encoding causes garbled output, wrong sample rate drops packets, missing audio_websocket_protocol breaks the handshake.

javascript

const agentConfig = {
  // LLM and voice configuration
  llm_websocket_url: process.env.LLM_WEBSOCKET_URL,
  voice_id: "11labs-voice-id", // Or "openai-voice-id" depending on provider
  agent_name: "Support Agent",
  language: "en-US",
  response_engine: {
    type: "retell-llm",
    llm_id: process.env.RETELL_LLM_ID
  },
  
  // CRITICAL: Twilio compatibility settings
  audio_encoding: "mulaw",              // Twilio only accepts mulaw, not PCM
  audio_websocket_protocol: "twilio",   // Enables Twilio-specific handshake
  sample_rate: 8000,                    // Twilio uses 8kHz, not 16kHz
  
  // Conversation behavior tuning
  enable_backchannel: true,             // "mm-hmm" acknowledgments during user speech
  ambient_sound: "office",              // Background noise to prevent dead air
  interruption_sensitivity: 0.7,        // 0.3 = slow barge-in, 0.9 = hair-trigger
  responsiveness: 0.8,                  // Higher = faster replies, more interruptions
  end_call_after_silence_ms: 10000,     // Hang up after 10s of silence
  
  // Optional: Custom prompts and tools
  begin_message: "Hello, how can I help you today?",
  general_prompt: "You are a helpful customer service agent.",
  general_tools: []
};

Tradeoffs: interruption_sensitivity at 0.7 catches most real interruptions but triggers false positives on background noise (dogs barking, car horns). Lower to 0.5 for noisy environments. responsiveness at 0.8 feels snappy but the agent may cut off users who pause mid-sentence. Drop to 0.6 for elderly callers or non-native speakers. enable_backchannel sounds natural but conflicts with Twilio's echo cancellation on some carriers—disable if users report hearing themselves.

Edge cases

Race condition on WebSocket open: Twilio sends media events 50–150ms before Retell AI's WebSocket handshake completes. Without buffering, you lose the first 2–3 audio chunks, truncating the caller's opening words ("Hello?" becomes "lo?"). Fix: queue incoming audio in an array until retellWs.readyState === WebSocket.OPEN, then flush the buffer.

javascript

const audioBuffer = [];
twilioWs.on('message', (data) => {
  const msg = JSON.parse(data);
  if (msg.event === 'media') {
    if (retellWs.readyState === WebSocket.OPEN) {
      retellWs.send(msg.media.payload);
    } else {
      audioBuffer.push(msg.media.payload); // Queue until ready
    }
  }
});
retellWs.on('open', () => {
  while (audioBuffer.length > 0) retellWs.send(audioBuffer.shift());
});

Barge-in overlap: User interrupts mid-sentence but TTS audio is still queued. Retell AI sends an interrupt event, but if you don't flush audioBuffer immediately, old audio plays after the interrupt. Fix: clear the buffer and send a clear signal to Twilio's WebSocket on every interrupt event. Response time must be < 100ms or users hear overlap.

Audio format mismatch: Default Retell AI configs use PCM 16kHz. Twilio's MediaStreams API only accepts mulaw 8kHz. Symptom: agent responds with "I didn't catch that" on every turn because the decoder fails silently. Fix: set audio_encoding: "mulaw" and sample_rate: 8000 in agentConfig.

Webhook signature validation skipped: Without validating X-Twilio-Signature, attackers can POST fake CallSid values to your /voice endpoint and rack up thousands of Retell AI sessions. Fix: use twilio.validateRequest(authToken, signature, url, body) before processing any webhook.

WebSocket timeout after 4 hours: Twilio closes idle WebSockets after 4 hours. Long support calls hit this limit. Fix: send keepalive pings every 30 seconds: setInterval(() => twilioWs.ping(), 30000).

The whole thing in one file

This is the complete production server. Copy this entire block, set environment variables, and run node server.js. It handles Twilio's /voice webhook, WebSocket bridging, session state tracking, and graceful cleanup.

javascript

const express = require('express');
const WebSocket = require('ws');
const twilio = require('twilio');

const app = express();
app.use(express.urlencoded({ extended: false }));

const RETELL_API_KEY = process.env.RETELL_API_KEY;
const TWILIO_ACCOUNT_SID = process.env.TWILIO_ACCOUNT_SID;
const TWILIO_AUTH_TOKEN = process.env.TWILIO_AUTH_TOKEN;

// Session state tracking - prevents race conditions
const activeSessions = new Map();

// Twilio voice webhook - initiates call
app.post('/voice', async (req, res) => {
  const callSid = req.body.CallSid;
  const from = req.body.From;
  
  // Validate Twilio signature to prevent spoofed calls
  const signature = req.headers['x-twilio-signature'];
  const url = `https://${req.headers.host}${req.url}`;
  if (!twilio.validateRequest(TWILIO_AUTH_TOKEN, signature, url, req.body)) {
    return res.status(403).send('Invalid signature');
  }
  
  try {
    // Create Retell AI agent session
    const response = await fetch('https://api.retellai.com/v2/create-web-call', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${RETELL_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        agent_id: process.env.RETELL_AGENT_ID,
        audio_websocket_protocol: 'twilio',
        audio_encoding: 'mulaw',
        sample_rate: 8000,
        metadata: { callSid, from }
      })
    });
    
    if (!response.ok) throw new Error(`Retell API error: ${response.status}`);
    const { call_id, access_token } = await response.json();
    
    // Store session to prevent duplicate processing
    activeSessions.set(callSid, { call_id, audioBuffer: [] });
    
    // Return TwiML with WebSocket stream
    const twiml = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="wss://${req.headers.host}/media/${call_id}">
      <Parameter name="access_token" value="${access_token}" />
      <Parameter name="callSid" value="${callSid}" />
    </Stream>
  </Connect>
</Response>`;
    
    res.type('text/xml').send(twiml);
  } catch (error) {
    console.error('Voice webhook error:', error);
    res.status(500).send('<Response><Say>Service unavailable</Say></Response>');
  }
});

// WebSocket bridge - handles bidirectional audio
const wss = new WebSocket.Server({ noServer: true });

wss.on('connection', (ws, req) => {
  const call_id = req.url.split('/').pop();
  const params = new URL(`http://host${req.url}`).searchParams;
  const access_token = params.get('access_token');
  const callSid = params.get('callSid');
  
  // Connect to Retell AI WebSocket
  const retellWs = new WebSocket(`wss://api.retellai.com/audio-websocket/${call_id}`, {
    headers: { 'Authorization': `Bearer ${access_token}` }
  });
  
  let streamSid = null;
  const session = activeSessions.get(callSid);
  
  // Twilio → Retell AI (incoming audio)
  ws.on('message', (data) => {
    const msg = JSON.parse(data);
    
    if (msg.event === 'start') {
      streamSid = msg.start.streamSid;
      console.log(`[${Date.now()}] Stream started: ${streamSid}`);
    }
    
    if (msg.event === 'media') {
      if (retellWs.readyState === WebSocket.OPEN) {
        // Forward mulaw audio chunks to Retell AI
        retellWs.send(JSON.stringify({
          type: 'audio',
          audio_encoding: 'mulaw',
          sample_rate: 8000,
          data: msg.media.payload
        }));
      } else {
        // Buffer audio until Retell WebSocket opens
        session.audioBuffer.push(msg.media.payload);
      }
    }
    
    if (msg.event === 'stop') {
      console.log(`[${Date.now()}] Stream stopped: ${streamSid}`);
      retellWs.close();
    }
  });
  
  // Retell AI → Twilio (outgoing audio)
  retellWs.on('message', (data) => {
    const retellMsg = JSON.parse(data);
    
    if (retellMsg.type === 'audio' && ws.readyState === WebSocket.OPEN) {
      // Forward synthesized audio back to Twilio
      ws.send(JSON.stringify({
        event: 'media',
        streamSid: streamSid,
        media: { payload: retellMsg.data }
      }));
    }
    
    if (retellMsg.type === 'interrupt') {
      // Clear audio buffer on barge-in
      session.audioBuffer = [];
      ws.send(JSON.stringify({ event: 'clear', streamSid: streamSid }));
      console.log(`[${Date.now()}] Barge-in detected - buffers flushed`);
    }
    
    if (retellMsg.type === 'call_ended') {
      ws.close();
      activeSessions.delete(callSid);
    }
  });
  
  // Flush buffered audio once Retell WebSocket opens
  retellWs.on('open', () => {
    console.log(`[${Date.now()}] Retell WebSocket opened for ${call_id}`);
    while (session.audioBuffer.length > 0) {
      retellWs.send(JSON.stringify({
        type: 'audio',
        audio_encoding: 'mulaw',
        sample_rate: 8000,
        data: session.audioBuffer.shift()
      }));
    }
  });
  
  // Error handling - prevents zombie connections
  ws.on('error', (err) => console.error('Twilio WS error:', err));
  retellWs.on('error', (err) => console.error('Retell WS error:', err));
  
  ws.on('close', () => {
    if (retellWs.readyState === WebSocket.OPEN) retellWs.close();
  });
  
  // Keepalive to prevent 4-hour timeout
  const keepalive = setInterval(() => {
    if (ws.readyState === WebSocket.OPEN) ws.ping();
  }, 30000);
  
  ws.on('close', () => clearInterval(keepalive));
});

// Upgrade HTTP to WebSocket
const server = app.listen(process.env.PORT || 3000, () => {
  console.log(`Server running on port ${server.address().port}`);
});

server.on('upgrade', (request, socket, head) => {
  if (request.url.startsWith('/media/')) {
    wss.handleUpgrade(request, socket, head, (ws) => {
      wss.emit('connection', ws, request);
    });
  } else {
    socket.destroy();
  }
});

Run it: Install dependencies with npm install express ws twilio. Set environment variables: export RETELL_API_KEY="your_key", export RETELL_AGENT_ID="agent_xxx", export TWILIO_ACCOUNT_SID="ACxxx", export TWILIO_AUTH_TOKEN="your_token". For local testing, run ngrok http 3000 and copy the HTTPS URL. In Twilio Console, configure your phone number's Voice webhook to https://your-ngrok-url.ngrok.io/voice. Call your Twilio number—the AI agent answers immediately.

Production deployment: Replace ngrok with a real domain (Railway, Render, Fly.io all work). Add session cleanup with TTL expiration: setTimeout(() => activeSessions.delete(callSid), 3600000) to prevent memory leaks on abandoned calls. Implement retry logic for Retell API failures with exponential backoff. Monitor activeSessions.size—if it grows unbounded, you have a cleanup bug in your call_ended handler.

Common questions

How does Retell AI handle real-time audio streaming with Twilio?
Retell AI connects via WebSocket to your server, which bridges Twilio's Media Streams API. Twilio sends mulaw 8kHz audio chunks every 20ms. Your server forwards these to Retell AI's WebSocket at wss://api.retellai.com/audio-websocket/{call_id}. Retell AI processes STT → LLM → TTS internally and returns synthesized audio in the same mulaw format. Your server streams this back to Twilio, which plays it to the caller. The streamSid from Twilio and call_id from Retell AI must be tracked in a session map to prevent race conditions.

What's the difference between Retell AI's voice synthesis and Twilio's TTS?
Retell AI handles all voice synthesis internally via its configured voice_id and response_engine. Twilio doesn't synthesize—it only streams raw audio. Never use Twilio's <Say> tag in TwiML when using Media Streams; it conflicts with Retell AI's audio output. Retell AI owns the entire voice pipeline (transcription, LLM reasoning, TTS), while Twilio is purely the transport layer for phone calls.

Why does audio sometimes cut off mid-sentence when the user interrupts?
Barge-in requires coordinating Twilio's audio stream, Retell AI's VAD, and your TTS buffer. If interruption_sensitivity is too low (default 0.3), Retell AI won't detect the user's speech quickly enough. Increase it to 0.5–0.7. More critically, when Retell AI sends an interrupt event, you must flush audioBuffer immediately—if old TTS audio is still queued, it plays after the interrupt, creating overlap. Implement a flush-on-interrupt handler that clears the buffer before sending the next audio chunk to Twilio.

What latency should I expect end-to-end?
Typical breakdown: Twilio audio capture (20–40ms) + network to your server (20–100ms) + Retell AI STT processing (200–600ms) + LLM inference (500–2000ms) + TTS synthesis (300–800ms) + network back to Twilio (20–100ms) = 1.1–3.7 seconds total. Mobile networks add 100–400ms jitter. To reduce perceived latency, enable responsiveness: 0.8 in agentConfig and deploy your server in the same AWS region as Retell AI (us-west-2) to shave 80–120ms off cross-region latency.

How many concurrent calls can one server handle?
Each call requires two WebSocket connections (Twilio + Retell AI) and 2–5MB of memory for buffers and session state. A single Node.js process can handle 50–200 concurrent calls depending on LLM latency and server specs. Beyond that, implement connection pooling and horizontal scaling. Monitor activeSessions.size—if it grows unbounded, you have a session cleanup bug (missing call_ended webhook handlers or no TTL expiration).

Should I use Retell AI or build custom STT/LLM/TTS with Twilio?
Retell AI abstracts the entire voice AI pipeline—you configure agentConfig once and get production-grade STT, LLM orchestration, TTS, and barge-in handling. Building custom requires wiring Deepgram/Whisper for STT, OpenAI/Anthropic for LLM, ElevenLabs/PlayHT for TTS, plus custom VAD and turn-taking logic. Retell AI's latency (500–2000ms) is competitive with custom stacks because it optimizes the entire pipeline. Use Retell AI unless you need sub-500ms latency or custom audio processing (noise suppression, speaker diarization).

Retell AI Twilio Integration Tutorial: Build AI Voice Calls Step-by-Step

Mental model

What you need first

Under the hood

Copy-paste setup

Edge cases

The whole thing in one file

Common questions

Topics

Written by

Tutorials in your inbox

Found this helpful?

Continue reading

How to Lower Transcription Latency in Voice AI Systems: Practical Tips

Create a Voice AI Solution for Real Estate Lead Qualification: My Journey

How to Deploy Retell AI Docs on Railway: My Experience with Vapi and Twilio