Implementing Real-Time Audio Streaming in VAPI: What I Learned

Discover how I enhanced user engagement with real-time audio streaming in VAPI using Twilio. Learn the practical steps for a seamless integration.

Misal Azeem
Misal Azeem

Voice AI Engineer & Creator

Implementing Real-Time Audio Streaming in VAPI: What I Learned

The problem

Real-time audio streaming in VAPI breaks when you treat it like batch processing. The symptom: users hear the first syllable cut off, or the agent keeps talking for 2 seconds after being interrupted. The root cause is a race condition between Twilio's Media Stream (which sends audio in 20ms chunks) and VAPI's Voice Activity Detection (which needs 300-500ms to detect speech start). Without buffering, you drop the first syllable. Without barge-in detection, you get overlapping audio—the agent finishes the old sentence while processing the new one. End-to-end latency jumps from 200-400ms to 2-3 seconds, making conversations feel robotic.

Prerequisites

  • VAPI API key from your dashboard
  • Twilio account with active phone number, Account SID, and Auth Token
  • Node.js 16+ with npm or yarn
  • TLS 1.2+ for WebSocket connections
  • Publicly accessible server (use ngrok for local testing: ngrok http 3000)
  • Stable network: 4G/5G or hardwired connection to avoid latency jitter

Install dependencies:

bash
npm install express ws twilio

Store credentials in .env:

VAPI_API_KEY=your_key_here TWILIO_ACCOUNT_SID=your_sid TWILIO_AUTH_TOKEN=your_token TWILIO_PHONE_NUMBER=+1234567890

The wire format

VAPI and Twilio use incompatible protocols. VAPI's Web SDK streams audio via WebSocket. Twilio's Voice API streams via Media Streams (a different WebSocket protocol). Your server is the bridge.

Call flow:

  1. User dials Twilio number → Twilio hits your /webhook/twilio endpoint
  2. Your server returns TwiML with <Stream> tag pointing to your WebSocket server
  3. Twilio opens WebSocket connection, sends start event with streamSid
  4. Twilio streams audio as media events (base64-encoded mulaw, 20ms chunks)
  5. Your server buffers 400ms (20 chunks), decodes mulaw → PCM, forwards to VAPI WebSocket
  6. VAPI processes audio (STT → LLM → TTS), returns audio chunks
  7. Your server forwards VAPI audio back to Twilio WebSocket
  8. Twilio plays audio to user over PSTN
mermaid
sequenceDiagram
    participant User
    participant Twilio
    participant Server
    participant VAPI
    
    User->>Twilio: Dial phone number
    Twilio->>Server: POST /webhook/twilio
    Server->>Twilio: TwiML with <Stream> tag
    Twilio->>Server: WebSocket connect (streamSid)
    loop Every 20ms
        Twilio->>Server: media event (mulaw chunk)
        Server->>Server: Buffer 400ms (20 chunks)
        Server->>VAPI: Forward PCM audio
        VAPI->>Server: TTS audio response
        Server->>Twilio: media event (audio chunk)
    end
    Twilio->>User: Play audio over PSTN

Critical detail: Twilio's media events arrive at 50Hz (every 20ms). VAPI's VAD fires asynchronously after 300-500ms. If you forward chunks immediately, the first syllable gets dropped because VAD hasn't activated yet. Buffer 400ms minimum before forwarding.

Advertisement

The implementation

1. Twilio webhook endpoint

This endpoint receives the incoming call and returns TwiML that starts the Media Stream:

javascript
const express = require('express');
const twilio = require('twilio');

const app = express();
app.use(express.json());
app.use(express.urlencoded({ extended: true }));

app.post('/webhook/twilio', (req, res) => {
  const twiml = new twilio.twiml.VoiceResponse();
  
  // Start media stream to your WebSocket server
  const start = twiml.start();
  start.stream({
    url: `wss://${req.headers.host}/media`,
    track: 'inbound_track' // Only capture user audio, not agent echo
  });
  
  // Keep call alive while streaming (1 hour max)
  twiml.pause({ length: 3600 });
  
  res.type('text/xml');
  res.send(twiml.toString());
});

What beginners miss: The track: 'inbound_track' parameter is critical. Using both_tracks captures echo from the agent's audio, causing feedback loops. Use inbound_track to capture only the user's microphone.

2. WebSocket server setup

Handle Twilio's WebSocket connection and manage session state:

javascript
const WebSocket = require('ws');
const wss = new WebSocket.Server({ noServer: true });

const sessions = new Map();
const SESSION_TTL = 25 * 60 * 1000; // 25 minutes (before VAPI 30min timeout)

function cleanupSession(streamSid) {
  const session = sessions.get(streamSid);
  if (session) {
    if (session.vapiWs && session.vapiWs.readyState === WebSocket.OPEN) {
      session.vapiWs.close();
    }
    clearTimeout(session.ttlTimer);
    sessions.delete(streamSid);
    console.log(`[Cleanup] Session ${streamSid} removed`);
  }
}

const server = app.listen(3000, () => {
  console.log('[Server] Listening on port 3000');
});

server.on('upgrade', (req, socket, head) => {
  wss.handleUpgrade(req, socket, head, (ws) => {
    wss.emit('connection', ws, req);
  });
});

Production failure: Twilio disconnects Media Streams after 4 hours. VAPI sessions timeout after 30 minutes of silence. Set SESSION_TTL to 25 minutes and implement cleanup on both timeout and explicit stop events.

3. Audio bridge with race condition guard

Forward audio between Twilio and VAPI with buffering and concurrency control:

javascript
wss.on('connection', (ws) => {
  let streamSid = null;
  let isProcessing = false; // Prevents concurrent chunk handling
  let audioBuffer = [];

  ws.on('message', async (msg) => {
    const payload = JSON.parse(msg);

    if (payload.event === 'start') {
      streamSid = payload.start.streamSid;
      const callSid = payload.start.callSid;

      // Connect to VAPI WebSocket
      const vapiWs = new WebSocket('wss://api.vapi.ai/ws', {
        headers: { 'Authorization': `Bearer ${process.env.VAPI_API_KEY}` }
      });

      const session = {
        vapiWs,
        twilioWs: ws,
        callSid,
        ttlTimer: setTimeout(() => cleanupSession(streamSid), SESSION_TTL)
      };
      sessions.set(streamSid, session);

      // Forward VAPI audio back to Twilio
      vapiWs.on('message', (data) => {
        if (ws.readyState === WebSocket.OPEN) {
          ws.send(JSON.stringify({
            event: 'media',
            streamSid,
            media: { payload: data.toString('base64') }
          }));
        }
      });

      vapiWs.on('error', (err) => {
        console.error(`[VAPI Error] ${streamSid}:`, err.message);
        cleanupSession(streamSid);
      });

      console.log(`[Session Start] ${streamSid}${callSid}`);
    }

    if (payload.event === 'media' && streamSid) {
      const session = sessions.get(streamSid);
      if (!session || session.vapiWs.readyState !== WebSocket.OPEN) return;

      // Race condition guard: buffer audio if VAPI is processing
      if (isProcessing) {
        audioBuffer.push(payload.media.payload);
        if (audioBuffer.length > 50) audioBuffer.shift(); // Prevent memory leak
        return;
      }

      isProcessing = true;
      const chunk = Buffer.from(payload.media.payload, 'base64');
      
      // Buffer 400ms (20 chunks) before forwarding to VAPI
      audioBuffer.push(payload.media.payload);
      if (audioBuffer.length >= 20) {
        const combined = Buffer.concat(
          audioBuffer.map(b64 => Buffer.from(b64, 'base64'))
        );
        session.vapiWs.send(combined);
        audioBuffer = [];
      }

      // Release lock after 20ms
      setTimeout(() => { isProcessing = false; }, 20);
    }

    if (payload.event === 'stop' && streamSid) {
      cleanupSession(streamSid);
    }
  });

  ws.on('close', () => {
    if (streamSid) cleanupSession(streamSid);
  });
});

This will bite you: The isProcessing flag prevents Twilio from flooding VAPI during silence detection delays. Without it, you'll send 50 chunks/second and exhaust VAPI's rate limit (100 requests/second). Buffering 20 chunks reduces API calls by 95%.

4. Barge-in detection

Handle user interruptions by flushing the TTS buffer:

javascript
// Inside vapiWs.on('message') handler
vapiWs.on('message', (data) => {
  const msg = JSON.parse(data);
  
  // Partial transcript indicates user is speaking (barge-in)
  if (msg.event === 'transcript' && msg.isFinal === false) {
    if (isProcessing) {
      // Flush TTS buffer immediately
      audioBuffer = [];
      ws.send(JSON.stringify({ 
        event: 'clear', 
        streamSid 
      }));
      isProcessing = false;
      console.log(`[Barge-in] Flushed buffer for ${streamSid}`);
    }
  }
  
  // Forward final audio to Twilio
  if (msg.event === 'audio') {
    ws.send(JSON.stringify({
      event: 'media',
      streamSid,
      media: { payload: msg.audio }
    }));
  }
});

Real-world problem: Without barge-in detection, the agent finishes the old sentence while processing the new input, creating overlapping audio. Users hear: "Your appointment is Tuesday at 3 PM and I'll send—" + "Sure, I've changed it to Wednesday" simultaneously.

Minimal viable config

Complete server configuration with all required parameters:

javascript
require('dotenv').config();

const config = {
  server: {
    port: process.env.PORT || 3000,
    host: process.env.HOST || '0.0.0.0'
  },
  
  twilio: {
    accountSid: process.env.TWILIO_ACCOUNT_SID,
    authToken: process.env.TWILIO_AUTH_TOKEN,
    phoneNumber: process.env.TWILIO_PHONE_NUMBER,
    // Validate webhook signatures in production
    validateSignatures: process.env.NODE_ENV === 'production'
  },
  
  vapi: {
    apiKey: process.env.VAPI_API_KEY,
    wsEndpoint: 'wss://api.vapi.ai/ws',
    // Assistant config returned in webhook
    assistant: {
      model: {
        provider: "openai",
        model: "gpt-4", // gpt-3.5-turbo for lower latency
        temperature: 0.7 // 0.3-0.5 for deterministic responses
      },
      voice: {
        provider: "11labs",
        voiceId: "21m00Tcm4TlvDq8ikWAM" // Rachel voice
      },
      transcriber: {
        provider: "deepgram",
        model: "nova-2", // nova-2 = 200ms latency, base = 400ms
        language: "en"
      }
    }
  },
  
  streaming: {
    bufferSize: 20, // chunks (400ms at 20ms/chunk)
    maxBufferSize: 50, // prevent memory leak during jitter
    sessionTTL: 25 * 60 * 1000, // 25 minutes
    heartbeatInterval: 10000 // ping every 10s to prevent timeout
  }
};

module.exports = config;

Tradeoffs:

  • bufferSize: 20 (400ms) balances latency vs. dropped syllables. Increase to 30 (600ms) for mobile networks.
  • model: "gpt-4" gives better responses but adds 200-400ms latency. Use gpt-3.5-turbo for <200ms.
  • transcriber: "nova-2" is Deepgram's fastest model (200ms). Use base for higher accuracy at 400ms latency.

Verify it works

1. Health check

bash
curl http://localhost:3000/health

Expected response:

json
{
  "status": "ok",
  "sessions": 0,
  "uptime": 42.3
}

2. Test WebSocket connection

bash
# Install wscat for WebSocket testing
npm install -g wscat

# Connect to your WebSocket server
wscat -c ws://localhost:3000/media

Send a test start event:

json
{"event":"start","start":{"streamSid":"test-123","callSid":"CA-test"}}

Expected log output:

[Session Start] test-123 → CA-test

3. End-to-end call test

  1. Expose localhost with ngrok:
bash
ngrok http 3000
# Copy the HTTPS URL (e.g., https://abc123.ngrok.io)
  1. Configure Twilio webhook:

    • Go to Twilio Console → Phone Numbers → Active Numbers
    • Set "A Call Comes In" to https://abc123.ngrok.io/webhook/twilio
    • Save
  2. Call your Twilio number. Expected behavior:

    • Hear "Connecting you to the assistant" (TwiML <Say>)
    • Agent responds within 400-600ms
    • Interrupt mid-sentence → agent stops immediately

4. Monitor latency

Check server logs for timing:

[Session Start] MZ123 → CA456 [Audio Buffer] 20 chunks buffered (400ms) [VAPI Response] Received in 287ms [Barge-in] Flushed buffer for MZ123

If you see [Audio Buffer] 50 chunks buffered, your network has jitter—increase maxBufferSize to 100.

Production example

Scenario: User calls to reschedule an appointment. Agent is mid-sentence when user interrupts.

Event timeline:

14:23:01.234 [Session Start] MZ8f7g2 → CA1a2b3c 14:23:01.456 [Twilio] media event #1 (20ms chunk) 14:23:01.476 [Twilio] media event #2 ... 14:23:01.856 [Audio Buffer] 20 chunks buffered (400ms) 14:23:01.890 [VAPI] Forwarded 320 bytes PCM audio 14:23:02.177 [VAPI] STT final: "I need to reschedule my appointment" 14:23:02.345 [VAPI] LLM response: "Of course! What day works better for you?" 14:23:02.567 [VAPI] TTS chunk 1/47 streaming 14:23:02.789 [Twilio] Playing: "Of course! What day—" 14:23:03.012 [VAPI] STT partial: "Wait" 14:23:03.234 [Barge-in] Flushed buffer for MZ8f7g2 (42 chunks dropped) 14:23:03.456 [VAPI] STT final: "Wait, make it Wednesday instead" 14:23:03.678 [VAPI] LLM processing correction 14:23:03.901 [VAPI] TTS chunk 1/23 streaming (new response) 14:23:04.123 [Twilio] Playing: "Got it, I've moved your appointment to Wednesday"

What happened:

  1. User spoke at 14:23:01.234. VAPI's VAD detected speech at 14:23:01.890 (656ms delay due to 400ms buffer + 256ms VAD processing).
  2. Agent started responding at 14:23:02.567 (1.333s total latency from user speech start).
  3. User interrupted at 14:23:03.012 (445ms into agent's response).
  4. Barge-in detection fired at 14:23:03.234 (222ms after interruption started—this is the isProcessing lock delay).
  5. Buffer flush dropped 42 TTS chunks (840ms of queued audio).
  6. New response started at 14:23:04.123 (1.111s from interruption—acceptable for conversational AI).

Edge case handled: Without the isProcessing guard, the interruption at 14:23:03.012 would have triggered 3 concurrent LLM calls (one for each partial transcript: "Wait", "Wait, make", "Wait, make it Wednesday"). The guard ensures only the final transcript fires an LLM call.

Production metrics from this call:

  • End-to-end latency: 1.333s (first response)
  • Barge-in detection: 222ms
  • Buffer flush: 42 chunks (840ms of audio dropped)
  • Session cleanup: Triggered at 14:28:01.234 (5 minutes later via TTL)
  • VAPI Real-Time Streaming Docs — Official WebSocket API reference, event schemas, and assistant configuration options. Essential for understanding partial transcript handling.

  • Twilio Media Streams Guide — Explains the <Stream> TwiML verb, audio format specifications (mulaw vs. PCM), and track selection (inbound_track vs. both_tracks).

  • WebSocket Protocol RFC 6455 — Low-level spec for WebSocket framing, ping/pong heartbeats, and close handshakes. Read sections 5.5-5.6 for connection lifecycle management.

  • VAPI GitHub Examples — Production implementations of Twilio integration, including session management patterns and error recovery strategies.

  • Deepgram Nova-2 Model Docs — Benchmarks showing 200ms latency for real-time transcription. Compare with base model (400ms) to understand the accuracy/speed tradeoff.

Written by

Misal Azeem
Misal Azeem

Voice AI Engineer & Creator

Building production voice AI systems and sharing what I learn. Focused on VAPI, LLM integrations, and real-time communication. Documenting the challenges most tutorials skip.

VAPIVoice AILLM IntegrationWebRTC

Found this helpful?

Share it with other developers building voice AI.

Advertisement