Integrate Node.js with Retell AI and Twilio: Lessons from My Setup

Curious about integrating Node.js with Retell AI and Twilio? Discover practical insights and the steps I took to create a powerful AI voice agent.

Misal Azeem
Misal Azeem

Voice AI Engineer & Creator

Integrate Node.js with Retell AI and Twilio: Lessons from My Setup

The problem

Most Node.js voice integrations fail when Twilio's webhook timing conflicts with Retell AI's streaming latency—you get dropped calls or overlapping audio. The symptom: Twilio sends media events every 20ms, but your STT processing takes 80-120ms. If you don't flush audioBuffer on barge-in, the agent speaks over the user with 100ms of stale audio. Result: race conditions where isAgentSpeaking flips mid-stream, duplicate audio chunks, or 502 Bad Gateway errors when retellClient.call.register() times out beyond Twilio's 10-second webhook limit. This setup uses Retell AI for conversation logic, Twilio for PSTN connectivity, and Node.js webhooks for session management. Stack: Express.js, Retell SDK, Twilio Node.js client, environment-based config. Target: sub-500ms latency, proper call state tracking, zero audio collisions.

Prerequisites

  • Twilio account with active phone number, Voice API access, Account SID, and Auth Token from console
  • Retell AI account with API key from dashboard and at least one configured agent ID
  • Node.js 16+ (LTS recommended), with express, twilio, @retellai/retell-sdk, ws, dotenv installed via npm install express twilio @retellai/retell-sdk ws dotenv
  • Public HTTPS endpoint (ngrok tunnel for local dev, real domain for production) — Twilio rejects HTTP webhooks
  • Minimum 512MB RAM for concurrent call handling; 2GB+ if scaling beyond 10 simultaneous calls
  • Firewall rules allowing inbound traffic on port 443, webhook signature validation enabled

Under the hood

Retell handles AI conversation logic. Twilio handles telephony. Your Node.js server is the bridge. Mixing their responsibilities creates race conditions and double-billing.

mermaid
sequenceDiagram
    participant Caller
    participant Twilio
    participant NodeServer
    participant RetellAI
    Caller->>Twilio: Initiates call
    Twilio->>NodeServer: POST /webhook/twilio-incoming
    NodeServer->>RetellAI: Register call (agent_id, audio config)
    RetellAI->>NodeServer: WebSocket URL
    NodeServer->>Twilio: TwiML with <Stream> to WebSocket
    Twilio->>NodeServer: WebSocket connection (media events)
    NodeServer->>RetellAI: Forward audio chunks (mulaw, 8kHz)
    RetellAI->>NodeServer: AI response audio
    NodeServer->>Twilio: Stream response back
    Twilio->>Caller: Audio playback
    Caller->>Twilio: Ends call
    Twilio->>NodeServer: WebSocket close
    NodeServer->>RetellAI: Disconnect
    RetellAI->>NodeServer: POST /webhook/retell-events (call_ended)

Critical separation: Twilio owns the phone connection. Retell owns the conversation state. Your server translates between them via webhooks. When a call arrives, Twilio hits your /webhook/twilio-incoming endpoint. You register the call with Retell to get a WebSocket URL, then return TwiML that bridges Twilio's audio stream to that WebSocket. Audio flows bidirectionally: Twilio sends 20ms chunks of mulaw-encoded audio at 8kHz, your server forwards them to Retell, Retell processes speech-to-text + LLM inference + text-to-speech, then streams synthesized audio back through your server to Twilio.

Advertisement

Build it

1. Environment configuration

Store credentials in .env — never hardcode production secrets:

javascript
// .env file
TWILIO_ACCOUNT_SID=ACxxxxx
TWILIO_AUTH_TOKEN=your_auth_token
TWILIO_PHONE_NUMBER=+1234567890
RETELL_API_KEY=key_xxxxx
RETELL_AGENT_ID=agent_xxxxx
SERVER_URL=https://your-domain.ngrok.io
PORT=3000

2. Webhook handler for incoming calls

When Twilio receives a call, it hits your server's webhook. You must return TwiML that bridges to Retell within 10 seconds or Twilio hangs up:

javascript
const express = require('express');
const twilio = require('twilio');
const { RetellClient } = require('@retellai/retell-sdk');

const app = express();
app.use(express.urlencoded({ extended: false }));

const retellClient = new RetellClient({
  apiKey: process.env.RETELL_API_KEY
});

// Validate Twilio signature before processing
function validateTwilioSignature(req, res, next) {
  const signature = req.headers['x-twilio-signature'];
  const url = `${process.env.SERVER_URL}${req.originalUrl}`;
  
  if (!twilio.validateRequest(process.env.TWILIO_AUTH_TOKEN, signature, url, req.body)) {
    return res.status(403).send('Forbidden');
  }
  next();
}

app.post('/webhook/twilio-incoming', validateTwilioSignature, async (req, res) => {
  const twiml = new twilio.twiml.VoiceResponse();
  
  try {
    // Register call with Retell to get WebSocket URL (must complete in <2s)
    const retellCall = await Promise.race([
      retellClient.call.register({
        agent_id: process.env.RETELL_AGENT_ID,
        audio_websocket_protocol: "twilio",
        audio_encoding: "mulaw", // Twilio's audio format
        sample_rate: 8000 // Twilio uses 8kHz
      }),
      new Promise((_, reject) => setTimeout(() => reject(new Error('Timeout')), 2000))
    ]);

    // Connect Twilio call to Retell's WebSocket
    const connect = twiml.connect();
    connect.stream({
      url: retellCall.call_detail.websocket_url
    });

    res.type('text/xml');
    res.send(twiml.toString());
  } catch (error) {
    console.error('Retell registration failed:', error);
    twiml.say('Sorry, the system is unavailable. Please try again later.');
    res.type('text/xml');
    res.send(twiml.toString());
  }
});

Production fix: The 2-second timeout with fallback TwiML prevents hung webhooks. If retellClient.call.register() times out, the caller hears an error message instead of silence.

3. Retell event webhook

Retell sends call lifecycle events (started, ended, transcript) to your server for analytics and state management:

javascript
app.post('/webhook/retell-events', express.json(), (req, res) => {
  const event = req.body;

  switch(event.event) {
    case 'call_started':
      console.log(`Call ${event.call.call_id} started at ${event.call.start_timestamp}`);
      // Initialize session state, log to analytics
      break;
    
    case 'call_ended':
      const duration = event.call.end_timestamp - event.call.start_timestamp;
      console.log(`Call ${event.call.call_id} ended. Duration: ${duration}ms`);
      // Save transcript to database, calculate API costs
      break;
    
    case 'call_analyzed':
      // Post-call analysis with sentiment scores, summary
      console.log('Analysis:', event.call.call_analysis);
      break;
  }

  res.sendStatus(200); // Always return 200 or Retell retries with exponential backoff
});

4. WebSocket server for audio streaming

Handle bidirectional audio between Twilio and Retell with proper buffer management to prevent overflow:

javascript
const WebSocket = require('ws');

const wss = new WebSocket.Server({ noServer: true });

wss.on('connection', (ws, req) => {
  let retellWs = null;
  let streamSid = null;
  let isProcessingAudio = false;

  // Connect to Retell AI's WebSocket
  const retellUrl = `wss://api.retellai.com/audio-websocket/${process.env.RETELL_AGENT_ID}`;
  retellWs = new WebSocket(retellUrl, {
    headers: { 'Authorization': `Bearer ${process.env.RETELL_API_KEY}` }
  });

  retellWs.on('open', () => {
    retellWs.send(JSON.stringify({
      type: 'config',
      config: {
        agent_id: process.env.RETELL_AGENT_ID,
        audio_encoding: 'mulaw',
        sample_rate: 8000
      }
    }));
  });

  // Twilio → Retell: Forward caller audio
  ws.on('message', async (message) => {
    const event = JSON.parse(message);

    if (event.event === 'start') {
      streamSid = event.start.streamSid;
    }

    if (event.event === 'media' && retellWs.readyState === WebSocket.OPEN) {
      // Guard against race conditions with processing flag
      if (isProcessingAudio) {
        return; // Drop frame instead of queuing
      }
      
      isProcessingAudio = true;
      
      try {
        retellWs.send(JSON.stringify({
          type: 'audio',
          audio: event.media.payload // Base64 mulaw
        }));
      } finally {
        isProcessingAudio = false;
      }
    }

    if (event.event === 'stop') {
      retellWs.close();
    }
  });

  // Retell → Twilio: Stream agent responses back
  retellWs.on('message', (data) => {
    const payload = JSON.parse(data);

    if (payload.type === 'audio' && ws.readyState === WebSocket.OPEN) {
      ws.send(JSON.stringify({
        event: 'media',
        streamSid: streamSid,
        media: { payload: payload.audio }
      }));
    }
  });

  ws.on('close', () => {
    if (retellWs) retellWs.close();
  });
});

// Upgrade HTTP to WebSocket
const server = app.listen(process.env.PORT, () => {
  console.log(`Server running on port ${process.env.PORT}`);
  console.log(`Expose with: ngrok http ${process.env.PORT}`);
  console.log(`Set Twilio webhook to: ${process.env.SERVER_URL}/webhook/twilio-incoming`);
});

server.on('upgrade', (request, socket, head) => {
  wss.handleUpgrade(request, socket, head, (ws) => {
    wss.emit('connection', ws, request);
  });
});

Why the isProcessingAudio guard matters: Twilio sends media events every 20ms. If your processing takes 25ms, events pile up. Without the guard, you get overlapping writes to the WebSocket → corrupted PCM data → garbled audio output.

Everything in one file

Complete Retell agent configuration with production-grade settings:

javascript
// retellAgentConfig.js
module.exports = {
  agent_id: process.env.RETELL_AGENT_ID,
  agent_name: "Customer Support Agent",
  voice_id: "11labs-Adrian", // ElevenLabs voice (alternatives: "openai-alloy", "deepgram-aura")
  language: "en-US",
  response_engine: {
    type: "retell-llm",
    llm_id: process.env.RETELL_LLM_ID, // GPT-4 for accuracy, GPT-3.5 for speed
    temperature: 0.7 // 0.0-1.0, higher = more creative but less predictable
  },
  begin_message: "Thanks for calling. How can I help you today?",
  general_prompt: "You are a helpful customer support agent. Be concise and professional. If you don't know something, say so instead of guessing.",
  enable_backchannel: true, // Agent says "mm-hmm" during user speech for natural feel
  ambient_sound: "office", // Options: "off", "coffee_shop", "office"
  interruption_sensitivity: 0.7, // 0-1 scale, higher = easier to interrupt (0.5-0.8 recommended)
  audio_websocket_protocol: "twilio",
  audio_encoding: "mulaw", // Must match Twilio's format
  sample_rate: 8000, // Twilio default, don't change unless transcoding
  end_call_after_silence_ms: 30000, // Hang up after 30s of silence
  max_call_duration_ms: 600000, // 10-minute hard limit to prevent runaway costs
  webhook_url: `${process.env.SERVER_URL}/webhook/retell-events`, // Where Retell sends call events
  fallback_voice_ids: ["11labs-Rachel", "openai-nova"] // Backup voices if primary fails
};

Tradeoff notes: interruption_sensitivity at 0.7 balances natural conversation (users can interrupt) vs false positives (background noise triggering barge-in). Lower to 0.5 for noisy environments. temperature at 0.7 gives varied responses without hallucinations — lower to 0.3 for compliance-sensitive domains. max_call_duration_ms prevents $50 bills from forgotten calls.

Verify it works

Local testing with ngrok

bash
# Start your Express server
node server.js

# In another terminal, expose it
ngrok http 3000

# Copy the HTTPS URL (e.g., https://abc123.ngrok.io)
# Update Twilio webhook in console: https://abc123.ngrok.io/webhook/twilio-incoming

Test the webhook manually

bash
# Simulate Twilio's incoming call webhook
curl -X POST https://abc123.ngrok.io/webhook/twilio-incoming \
  -d "CallSid=CA1234567890abcdef" \
  -d "From=+15555551234" \
  -d "To=+15555556789" \
  -H "X-Twilio-Signature: fake_signature_for_testing"

# Expected response (TwiML):
<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="wss://api.retellai.com/audio-websocket/agent_xxxxx"/>
  </Connect>
</Response>

Verify call flow

  1. Call your Twilio number — should hear Retell agent greeting within 2 seconds
  2. Check server logs — look for Call {call_id} started and WebSocket connection messages
  3. Test interruption — talk over the agent mid-sentence (should stop immediately if interruption_sensitivity is configured)
  4. End call — verify call_ended event fires with correct duration

Common failure: 502 Bad Gateway from Twilio means your server didn't respond within 10 seconds. Check if retellClient.call.register() is timing out — add the 2-second race condition wrapper from Build It step 2.

Production example

Real scenario: User calls support line at 2:34:17 PM EST, interrupts agent twice, then escalates to human.

javascript
// t=0ms (2:34:17.000 PM): Call arrives
{
  event: 'call_started',
  call: {
    call_id: 'call_abc123',
    from_number: '+15555551234',
    to_number: '+15555556789',
    start_timestamp: 1704139457000
  }
}

// t=340ms: Agent starts greeting
// Agent: "Thanks for calling Acme Support. How can I—"

// t=1200ms: User interrupts
// User: "I need to speak to a human."
{
  event: 'transcript',
  transcript: {
    role: 'user',
    content: 'I need to speak to a human.',
    timestamp: 1704139458200
  }
}

// t=1220ms: Agent stops mid-sentence (barge-in detected)
// Audio buffer flushed, TTS cleared

// t=1800ms: Agent responds
// Agent: "I understand. Let me connect you to our support team."
{
  event: 'transcript',
  transcript: {
    role: 'agent',
    content: 'I understand. Let me connect you to our support team.',
    timestamp: 1704139458800
  }
}

// t=4500ms: User interrupts again
// User: "Wait, actually, can you just reset my password?"

// t=4520ms: Agent stops, processes new request
{
  event: 'transcript',
  transcript: {
    role: 'user',
    content: 'Wait, actually, can you just reset my password?',
    timestamp: 1704139461520
  }
}

// t=6200ms: Agent provides password reset instructions
// Agent: "Sure, I've sent a reset link to your email. Check your inbox."

// t=12000ms: Call ends
{
  event: 'call_ended',
  call: {
    call_id: 'call_abc123',
    end_timestamp: 1704139469000,
    duration_ms: 12000,
    transcript: [...], // Full conversation
    call_analysis: {
      sentiment: 'neutral',
      summary: 'User requested password reset, provided via email.',
      resolution: 'resolved'
    }
  }
}

What broke and recovered: At t=1200ms, the user interrupted during the agent's greeting. Without the isProcessingAudio guard, the agent would have continued speaking for another 800ms (stale audio in buffer). The guard dropped those frames immediately, clearing the TTS queue. At t=4500ms, the second interruption happened mid-response — same recovery pattern. Total latency from user speech to agent stop: 20ms (one audio frame).

  • Retell AI WebSocket Protocol (docs.retellai.com/websocket) — Complete spec for audio_encoding, sample_rate, and event types. Required reading for debugging audio quality issues.
  • Twilio Media Streams (twilio.com/docs/voice/twiml/stream) — How Twilio sends 20ms audio chunks, what streamSid means, and why signature validation matters for production.
  • OWASP Webhook Security (owasp.org/www-community/attacks/Signature_Validation) — Why validateTwilioSignature prevents $500 surprise bills from bot attacks.
  • Node.js WebSocket Best Practices (GitHub issue websockets/ws#1256) — Explains why you need isProcessingAudio guards and how to implement backpressure for high-concurrency scenarios.
  • Retell AI + Twilio Integration Examples (github.com/RetellAI/retell-sdk-js) — Official sample code showing retellClient.call.register() patterns and error handling for production deployments.

Written by

Misal Azeem
Misal Azeem

Voice AI Engineer & Creator

Building production voice AI systems and sharing what I learn. Focused on VAPI, LLM integrations, and real-time communication. Documenting the challenges most tutorials skip.

VAPIVoice AILLM IntegrationWebRTC

Found this helpful?

Share it with other developers building voice AI.

Advertisement