How to Implement Voice AI with Twilio and VAPI: A Step-by-Step Guide

When a caller dials your Twilio number, the audio doesn't magically reach your AI model—you need a WebSocket bridge that converts telephony streams into something VAPI can process. Most implementations fail because developers treat Twilio and VAPI as plug-and-play components instead of separate systems that require explicit protocol translation.

The 60-second explanation

Twilio handles PSTN telephony (phone network routing), VAPI handles conversational AI (speech-to-text, LLM reasoning, text-to-speech). They don't integrate directly—you build a proxy server that receives Twilio's Media Streams via WebSocket, forwards audio packets to VAPI's WebSocket API, then streams VAPI's synthesized responses back through Twilio to the caller. The proxy manages mulaw-to-PCM audio format conversion, sequence tracking to prevent packet reordering, and bidirectional flow control. Your server sits between two WebSocket connections: one from Twilio (incoming call audio) and one to VAPI (AI processing). Without this bridge, you can't connect phone calls to voice AI.

Request flow

A caller dials your Twilio number. Twilio's Voice API executes a webhook on your server, which returns TwiML markup containing a <Stream> instruction. This tells Twilio to open a WebSocket connection to your server and stream raw audio packets. Your server receives mulaw-encoded 8kHz audio chunks every 20ms. You forward these chunks to VAPI's WebSocket endpoint, which runs speech-to-text, sends the transcript to an LLM, generates a response, and synthesizes speech. VAPI streams TTS audio back to your server as base64-encoded PCM. You convert it to mulaw and forward it through the Twilio WebSocket, which plays it to the caller.

mermaid

flowchart LR
    A[Caller] -->|PSTN| B[Twilio Number]
    B -->|TwiML Webhook| C[Your Server]
    C -->|WebSocket Stream| B
    C -->|Audio Packets| D[VAPI WebSocket]
    D -->|STT → LLM → TTS| C
    C -->|Synthesized Audio| B
    B -->|PSTN| A

Critical timing: Twilio buffers 20ms of audio per packet. If your server waits for complete sentences before forwarding to VAPI, you add 500-1000ms latency. Process partial transcripts immediately. VAPI's endpointing threshold (default 200ms silence) triggers false interruptions on mobile networks—increase to 300-400ms. The sequenceId in your WebSocket handler prevents race conditions when audio packets arrive out-of-order during network jitter.

Everything in one file

This configuration object contains every required parameter for production deployment. The TWILIO_AUTH_TOKEN validates webhook signatures to prevent unauthorized calls. The VAPI_ASSISTANT_ID references an assistant you create via VAPI's dashboard (configure model, voice, transcriber settings there). The threshold value of 500 in speech detection is tuned for phone audio—lower values (300) catch speech faster but increase false positives from breathing or background noise.

javascript

// config.js - Complete environment configuration
module.exports = {
  twilio: {
    accountSid: process.env.TWILIO_ACCOUNT_SID,
    authToken: process.env.TWILIO_AUTH_TOKEN, // For signature validation
    phoneNumber: process.env.TWILIO_PHONE_NUMBER
  },
  vapi: {
    apiKey: process.env.VAPI_API_KEY,
    assistantId: process.env.VAPI_ASSISTANT_ID, // Created via dashboard.vapi.ai
    websocketUrl: 'wss://api.vapi.ai/ws',
    endpointingThreshold: 300 // ms silence before finalizing transcript
  },
  server: {
    port: process.env.PORT || 3000,
    publicUrl: process.env.PUBLIC_URL, // ngrok URL for local dev
    audioThreshold: 500, // RMS energy for speech detection
    maxBufferSize: 50, // packets (1000ms at 20ms/packet)
    jitterBuffer: 50 // ms to handle network variance
  },
  costs: {
    twilioPerMinute: 0.0085, // USD
    vapiPerMinute: 0.03 // USD for GPT-4
  }
};

The maxBufferSize prevents memory leaks when VAPI's TTS is slower than Twilio's stream rate. After 50 queued packets (1 second of audio), drop the oldest chunks. The jitterBuffer compensates for mobile network delays—without it, you get choppy audio on 4G connections. Set NODE_ENV=production to disable verbose logging that slows down the event loop.

Footguns

Race condition on barge-in: User interrupts the assistant, but the last 600ms of TTS audio still plays because you didn't flush Twilio's buffer. When detectSpeech() fires, send { type: 'interrupt' } to VAPI AND clear your audioBuffer array immediately. Without both actions, the assistant talks over the user.

javascript

if (detectSpeech(audioChunk) && isSpeaking) {
  audioBuffer.length = 0; // Flush local buffer
  vapiConnection.send(JSON.stringify({ type: 'interrupt' }));
  isSpeaking = false;
}

Webhook signature validation fails after ngrok restart: Twilio's HMAC-SHA1 signature includes the full URL with query params. If you reconstruct the URL from req.headers.host, it breaks when ngrok assigns a new subdomain. Store the complete webhook URL in process.env.PUBLIC_URL and use it directly in signature calculation.

Audio format mismatch crashes VAPI connection: Twilio sends mulaw 8kHz, VAPI expects PCM 16kHz. If you forward raw Twilio packets without transcoding, VAPI's WebSocket closes with error code 1003. Add a resampling layer using node-webrtc or configure VAPI's transcriber to accept mulaw (check if supported—most deployments require server-side conversion).

VAD threshold too low causes echo loops: Default RMS threshold of 0.008 detects the assistant's own audio as user speech, triggering infinite interrupts. Raise to 0.015 for production. Test on mobile networks where background noise is 12% higher than landlines.

Twilio timeout after 15 seconds: If VAPI doesn't respond within 15s, Twilio hangs up. Implement keepalive pings every 10s on the VAPI WebSocket. Send { type: 'ping' } and expect { type: 'pong' } within 2s. If no response, reconnect before Twilio terminates the call.

Complete working example

This server handles Twilio's incoming call webhook, establishes dual WebSocket connections (Twilio ↔ Server ↔ VAPI), manages bidirectional audio streaming, and validates webhook signatures. Every route, error handler, and audio processing function is included. The validateTwilioSignature function prevents unauthorized webhook calls—production systems must validate every request. The sequenceId prevents audio packet reordering during network jitter. Speech detection uses RMS energy calculation to avoid echo loops where VAPI responds to its own audio.

javascript

// server.js - Production Twilio + VAPI Voice AI Server
const express = require('express');
const WebSocket = require('ws');
const crypto = require('crypto');

const app = express();
const port = process.env.PORT || 3000;

app.use(express.urlencoded({ extended: false }));
app.use(express.json());

function validateTwilioSignature(req) {
  const signature = req.headers['x-twilio-signature'];
  const url = `https://${req.headers.host}${req.originalUrl}`;
  const params = req.body;
  
  const data = Object.keys(params)
    .sort()
    .reduce((acc, key) => acc + key + params[key], url);
  
  const expectedSig = crypto
    .createHmac('sha1', process.env.TWILIO_AUTH_TOKEN)
    .update(Buffer.from(data, 'utf-8'))
    .digest('base64');
  
  return signature === expectedSig;
}

app.post('/voice/incoming', (req, res) => {
  if (!validateTwilioSignature(req)) {
    return res.status(403).send('Forbidden');
  }

  const callSid = req.body.CallSid;
  const twiml = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="wss://${req.headers.host}/media/${callSid}">
      <Parameter name="callSid" value="${callSid}" />
    </Stream>
  </Connect>
</Response>`;

  res.type('text/xml');
  res.send(twiml);
});

const wss = new WebSocket.Server({ noServer: true });

wss.on('connection', (ws, req) => {
  const callSid = req.url.split('/').pop();
  let vapiConnection = null;
  let audioBuffer = [];
  let isSpeaking = false;
  let sequenceId = 0;

  console.log(`[${callSid}] Twilio stream connected`);

  vapiConnection = new WebSocket('wss://api.vapi.ai/ws', {
    headers: { 'Authorization': `Bearer ${process.env.VAPI_API_KEY}` }
  });

  vapiConnection.on('open', () => {
    console.log(`[${callSid}] VAPI connection established`);
    vapiConnection.send(JSON.stringify({
      type: 'start',
      assistantId: process.env.VAPI_ASSISTANT_ID,
      metadata: { callSid, source: 'twilio' }
    }));
  });

  ws.on('message', (msg) => {
    const data = JSON.parse(msg);

    if (data.event === 'start') {
      console.log(`[${callSid}] Media stream started`);
    }

    if (data.event === 'media') {
      const audioChunk = Buffer.from(data.media.payload, 'base64');
      
      const samples = new Int16Array(audioChunk.buffer);
      let sum = 0;
      for (let i = 0; i < samples.length; i++) {
        sum += Math.abs(samples[i]);
      }
      const rms = Math.sqrt(sum / samples.length);
      const threshold = 500;
      
      if (rms > threshold) {
        isSpeaking = true;
        if (vapiConnection?.readyState === WebSocket.OPEN) {
          vapiConnection.send(JSON.stringify({
            type: 'audio',
            data: data.media.payload,
            sequenceId: sequenceId++
          }));
        }
      }
    }

    if (data.event === 'stop') {
      console.log(`[${callSid}] Media stream stopped`);
      vapiConnection?.close();
    }
  });

  vapiConnection.on('message', (msg) => {
    const data = JSON.parse(msg);

    if (data.type === 'audio') {
      if (ws.readyState === WebSocket.OPEN) {
        ws.send(JSON.stringify({
          event: 'media',
          media: { payload: data.audio }
        }));
      }
    }

    if (data.type === 'transcript') {
      console.log(`[${callSid}] Transcript: ${data.text}`);
    }

    if (data.type === 'error') {
      console.error(`[${callSid}] VAPI error:`, data.message);
    }
  });

  vapiConnection.on('error', (error) => {
    console.error(`[${callSid}] VAPI error:`, error);
  });

  ws.on('close', () => {
    console.log(`[${callSid}] Twilio stream closed`);
    vapiConnection?.close();
  });
});

const server = app.listen(port, () => {
  console.log(`Server running on port ${port}`);
});

server.on('upgrade', (req, socket, head) => {
  if (req.url.startsWith('/media/')) {
    wss.handleUpgrade(req, socket, head, (ws) => {
      wss.emit('connection', ws, req);
    });
  } else {
    socket.destroy();
  }
});

Run it:

bash

npm install express ws
export TWILIO_AUTH_TOKEN="your_auth_token"
export VAPI_API_KEY="your_vapi_key"
export VAPI_ASSISTANT_ID="asst_xyz"
node server.js

# Expose locally with ngrok
ngrok http 3000
# Update Twilio phone number webhook to: https://YOUR_NGROK_URL/voice/incoming

Production checklist: Use wss:// for VAPI connections (not ws://). Set NODE_ENV=production to disable verbose logging. Deploy to a server with a valid SSL certificate—Twilio rejects non-HTTPS webhooks. Monitor response times: webhooks timing out after 5 seconds cause Twilio retries and duplicate events. Measure latency from user speech end to bot response start—target under 800ms, above 1200ms feels broken.

FAQ

Can I use VAPI's native Twilio integration instead of building a custom WebSocket bridge?

VAPI's native integration handles SIP trunking but abstracts away the media stream layer. If you need custom audio processing (barge-in detection, silence thresholds, real-time transcription manipulation), you must build the WebSocket bridge. The native integration is a black box—you can't inject middleware or modify the audio pipeline. For sub-200ms latency or custom VAD logic, the custom bridge is required.

How do I handle Twilio's mulaw format with VAPI's PCM requirement?

Twilio streams mulaw-encoded audio at 8kHz, VAPI expects PCM 16kHz. Add a transcoding layer in your WebSocket handler using node-webrtc or ffmpeg.wasm. The detectSpeech function works on raw mulaw samples but you'll get better accuracy post-transcoding. Expect 15-30ms overhead per chunk. Alternatively, configure VAPI's transcriber to accept mulaw if supported (check documentation).

Why does my voice AI agent have 800ms+ latency on Twilio calls?

Three bottlenecks: (1) Twilio's media stream buffers 20ms chunks—process partial transcripts immediately instead of waiting for full sentences. (2) Cold-start on VAPI connection—keep a connection pool warm. (3) The threshold in detectSpeech is too conservative (default 0.3)—lower to 0.15-0.2 for faster barge-in. Measure each hop: Twilio → WebSocket (50ms), WebSocket → VAPI (100ms), VAPI → LLM (200ms), LLM → TTS (150ms). Optimize the slowest link first.

Should I use Twilio + VAPI or VAPI's standalone phone system?

VAPI's standalone system (SIP trunking) is 150ms faster because it eliminates the WebSocket hop. Use Twilio if you need existing Twilio phone numbers, SMS fallback, call recording with compliance tools, or Flex integration. The custom bridge gives you middleware control (fraud detection, custom analytics) that VAPI's native integration doesn't expose. For greenfield projects with no Twilio dependency, VAPI standalone is simpler and cheaper ($0.012/min vs Twilio's $0.0085/min + VAPI fees).

How do I prevent audio buffer overruns when VAPI's TTS is slower than Twilio's stream rate?

Track sequenceId from Twilio's media packets and compare against your processing loop. If the delta exceeds 10 packets (200ms), pause the Twilio stream with a <Pause> TwiML update via REST API, flush audioBuffer, then resume. The isSpeaking flag prevents overlapping TTS but doesn't handle queue buildup. Add a max queue size (50 chunks) and drop oldest packets when full.

What's the difference between Twilio's <Stream> and <Connect> TwiML verbs?

<Stream> sends raw audio to your WebSocket server (what this tutorial uses). <Connect> routes the call to a SIP endpoint or conference. For VAPI integration, <Stream> is correct because you need bidirectional media access. <Connect> would bypass your server entirely, making custom logic impossible. The TwiML response uses <Stream url="wss://..."> to establish the WebSocket connection before VAPI processes audio.

How to Implement Voice AI with Twilio and VAPI: A Step-by-Step Guide

The 60-second explanation

Request flow

Everything in one file

Footguns

Complete working example

FAQ

Topics

Written by

Tutorials in your inbox

Found this helpful?

Continue reading

How to Lower Transcription Latency in Voice AI Systems: Practical Tips

Create a Voice AI Solution for Real Estate Lead Qualification: My Journey

How to Deploy Retell AI Docs on Railway: My Experience with Vapi and Twilio