The problem
Real-time audio streaming in VAPI breaks when you treat it like batch processing. The symptom: users hear the first syllable cut off, or the agent keeps talking for 2 seconds after being interrupted. The root cause is a race condition between Twilio's Media Stream (which sends audio in 20ms chunks) and VAPI's Voice Activity Detection (which needs 300-500ms to detect speech start). Without buffering, you drop the first syllable. Without barge-in detection, you get overlapping audio—the agent finishes the old sentence while processing the new one. End-to-end latency jumps from 200-400ms to 2-3 seconds, making conversations feel robotic.
Prerequisites
- VAPI API key from your dashboard
- Twilio account with active phone number, Account SID, and Auth Token
- Node.js 16+ with
npmoryarn - TLS 1.2+ for WebSocket connections
- Publicly accessible server (use ngrok for local testing:
ngrok http 3000) - Stable network: 4G/5G or hardwired connection to avoid latency jitter
Install dependencies:
npm install express ws twilio
Store credentials in .env:
VAPI_API_KEY=your_key_here
TWILIO_ACCOUNT_SID=your_sid
TWILIO_AUTH_TOKEN=your_token
TWILIO_PHONE_NUMBER=+1234567890
The wire format
VAPI and Twilio use incompatible protocols. VAPI's Web SDK streams audio via WebSocket. Twilio's Voice API streams via Media Streams (a different WebSocket protocol). Your server is the bridge.
Call flow:
- User dials Twilio number → Twilio hits your
/webhook/twilioendpoint - Your server returns TwiML with
<Stream>tag pointing to your WebSocket server - Twilio opens WebSocket connection, sends
startevent withstreamSid - Twilio streams audio as
mediaevents (base64-encoded mulaw, 20ms chunks) - Your server buffers 400ms (20 chunks), decodes mulaw → PCM, forwards to VAPI WebSocket
- VAPI processes audio (STT → LLM → TTS), returns audio chunks
- Your server forwards VAPI audio back to Twilio WebSocket
- Twilio plays audio to user over PSTN
sequenceDiagram
participant User
participant Twilio
participant Server
participant VAPI
User->>Twilio: Dial phone number
Twilio->>Server: POST /webhook/twilio
Server->>Twilio: TwiML with <Stream> tag
Twilio->>Server: WebSocket connect (streamSid)
loop Every 20ms
Twilio->>Server: media event (mulaw chunk)
Server->>Server: Buffer 400ms (20 chunks)
Server->>VAPI: Forward PCM audio
VAPI->>Server: TTS audio response
Server->>Twilio: media event (audio chunk)
end
Twilio->>User: Play audio over PSTN
Critical detail: Twilio's media events arrive at 50Hz (every 20ms). VAPI's VAD fires asynchronously after 300-500ms. If you forward chunks immediately, the first syllable gets dropped because VAD hasn't activated yet. Buffer 400ms minimum before forwarding.
Advertisement
The implementation
1. Twilio webhook endpoint
This endpoint receives the incoming call and returns TwiML that starts the Media Stream:
const express = require('express');
const twilio = require('twilio');
const app = express();
app.use(express.json());
app.use(express.urlencoded({ extended: true }));
app.post('/webhook/twilio', (req, res) => {
const twiml = new twilio.twiml.VoiceResponse();
// Start media stream to your WebSocket server
const start = twiml.start();
start.stream({
url: `wss://${req.headers.host}/media`,
track: 'inbound_track' // Only capture user audio, not agent echo
});
// Keep call alive while streaming (1 hour max)
twiml.pause({ length: 3600 });
res.type('text/xml');
res.send(twiml.toString());
});
What beginners miss: The track: 'inbound_track' parameter is critical. Using both_tracks captures echo from the agent's audio, causing feedback loops. Use inbound_track to capture only the user's microphone.
2. WebSocket server setup
Handle Twilio's WebSocket connection and manage session state:
const WebSocket = require('ws');
const wss = new WebSocket.Server({ noServer: true });
const sessions = new Map();
const SESSION_TTL = 25 * 60 * 1000; // 25 minutes (before VAPI 30min timeout)
function cleanupSession(streamSid) {
const session = sessions.get(streamSid);
if (session) {
if (session.vapiWs && session.vapiWs.readyState === WebSocket.OPEN) {
session.vapiWs.close();
}
clearTimeout(session.ttlTimer);
sessions.delete(streamSid);
console.log(`[Cleanup] Session ${streamSid} removed`);
}
}
const server = app.listen(3000, () => {
console.log('[Server] Listening on port 3000');
});
server.on('upgrade', (req, socket, head) => {
wss.handleUpgrade(req, socket, head, (ws) => {
wss.emit('connection', ws, req);
});
});
Production failure: Twilio disconnects Media Streams after 4 hours. VAPI sessions timeout after 30 minutes of silence. Set SESSION_TTL to 25 minutes and implement cleanup on both timeout and explicit stop events.
3. Audio bridge with race condition guard
Forward audio between Twilio and VAPI with buffering and concurrency control:
wss.on('connection', (ws) => {
let streamSid = null;
let isProcessing = false; // Prevents concurrent chunk handling
let audioBuffer = [];
ws.on('message', async (msg) => {
const payload = JSON.parse(msg);
if (payload.event === 'start') {
streamSid = payload.start.streamSid;
const callSid = payload.start.callSid;
// Connect to VAPI WebSocket
const vapiWs = new WebSocket('wss://api.vapi.ai/ws', {
headers: { 'Authorization': `Bearer ${process.env.VAPI_API_KEY}` }
});
const session = {
vapiWs,
twilioWs: ws,
callSid,
ttlTimer: setTimeout(() => cleanupSession(streamSid), SESSION_TTL)
};
sessions.set(streamSid, session);
// Forward VAPI audio back to Twilio
vapiWs.on('message', (data) => {
if (ws.readyState === WebSocket.OPEN) {
ws.send(JSON.stringify({
event: 'media',
streamSid,
media: { payload: data.toString('base64') }
}));
}
});
vapiWs.on('error', (err) => {
console.error(`[VAPI Error] ${streamSid}:`, err.message);
cleanupSession(streamSid);
});
console.log(`[Session Start] ${streamSid} → ${callSid}`);
}
if (payload.event === 'media' && streamSid) {
const session = sessions.get(streamSid);
if (!session || session.vapiWs.readyState !== WebSocket.OPEN) return;
// Race condition guard: buffer audio if VAPI is processing
if (isProcessing) {
audioBuffer.push(payload.media.payload);
if (audioBuffer.length > 50) audioBuffer.shift(); // Prevent memory leak
return;
}
isProcessing = true;
const chunk = Buffer.from(payload.media.payload, 'base64');
// Buffer 400ms (20 chunks) before forwarding to VAPI
audioBuffer.push(payload.media.payload);
if (audioBuffer.length >= 20) {
const combined = Buffer.concat(
audioBuffer.map(b64 => Buffer.from(b64, 'base64'))
);
session.vapiWs.send(combined);
audioBuffer = [];
}
// Release lock after 20ms
setTimeout(() => { isProcessing = false; }, 20);
}
if (payload.event === 'stop' && streamSid) {
cleanupSession(streamSid);
}
});
ws.on('close', () => {
if (streamSid) cleanupSession(streamSid);
});
});
This will bite you: The isProcessing flag prevents Twilio from flooding VAPI during silence detection delays. Without it, you'll send 50 chunks/second and exhaust VAPI's rate limit (100 requests/second). Buffering 20 chunks reduces API calls by 95%.
4. Barge-in detection
Handle user interruptions by flushing the TTS buffer:
// Inside vapiWs.on('message') handler
vapiWs.on('message', (data) => {
const msg = JSON.parse(data);
// Partial transcript indicates user is speaking (barge-in)
if (msg.event === 'transcript' && msg.isFinal === false) {
if (isProcessing) {
// Flush TTS buffer immediately
audioBuffer = [];
ws.send(JSON.stringify({
event: 'clear',
streamSid
}));
isProcessing = false;
console.log(`[Barge-in] Flushed buffer for ${streamSid}`);
}
}
// Forward final audio to Twilio
if (msg.event === 'audio') {
ws.send(JSON.stringify({
event: 'media',
streamSid,
media: { payload: msg.audio }
}));
}
});
Real-world problem: Without barge-in detection, the agent finishes the old sentence while processing the new input, creating overlapping audio. Users hear: "Your appointment is Tuesday at 3 PM and I'll send—" + "Sure, I've changed it to Wednesday" simultaneously.
Minimal viable config
Complete server configuration with all required parameters:
require('dotenv').config();
const config = {
server: {
port: process.env.PORT || 3000,
host: process.env.HOST || '0.0.0.0'
},
twilio: {
accountSid: process.env.TWILIO_ACCOUNT_SID,
authToken: process.env.TWILIO_AUTH_TOKEN,
phoneNumber: process.env.TWILIO_PHONE_NUMBER,
// Validate webhook signatures in production
validateSignatures: process.env.NODE_ENV === 'production'
},
vapi: {
apiKey: process.env.VAPI_API_KEY,
wsEndpoint: 'wss://api.vapi.ai/ws',
// Assistant config returned in webhook
assistant: {
model: {
provider: "openai",
model: "gpt-4", // gpt-3.5-turbo for lower latency
temperature: 0.7 // 0.3-0.5 for deterministic responses
},
voice: {
provider: "11labs",
voiceId: "21m00Tcm4TlvDq8ikWAM" // Rachel voice
},
transcriber: {
provider: "deepgram",
model: "nova-2", // nova-2 = 200ms latency, base = 400ms
language: "en"
}
}
},
streaming: {
bufferSize: 20, // chunks (400ms at 20ms/chunk)
maxBufferSize: 50, // prevent memory leak during jitter
sessionTTL: 25 * 60 * 1000, // 25 minutes
heartbeatInterval: 10000 // ping every 10s to prevent timeout
}
};
module.exports = config;
Tradeoffs:
bufferSize: 20(400ms) balances latency vs. dropped syllables. Increase to 30 (600ms) for mobile networks.model: "gpt-4"gives better responses but adds 200-400ms latency. Usegpt-3.5-turbofor <200ms.transcriber: "nova-2"is Deepgram's fastest model (200ms). Usebasefor higher accuracy at 400ms latency.
Verify it works
1. Health check
curl http://localhost:3000/health
Expected response:
{
"status": "ok",
"sessions": 0,
"uptime": 42.3
}
2. Test WebSocket connection
# Install wscat for WebSocket testing
npm install -g wscat
# Connect to your WebSocket server
wscat -c ws://localhost:3000/media
Send a test start event:
{"event":"start","start":{"streamSid":"test-123","callSid":"CA-test"}}
Expected log output:
[Session Start] test-123 → CA-test
3. End-to-end call test
- Expose localhost with ngrok:
ngrok http 3000
# Copy the HTTPS URL (e.g., https://abc123.ngrok.io)
-
Configure Twilio webhook:
- Go to Twilio Console → Phone Numbers → Active Numbers
- Set "A Call Comes In" to
https://abc123.ngrok.io/webhook/twilio - Save
-
Call your Twilio number. Expected behavior:
- Hear "Connecting you to the assistant" (TwiML
<Say>) - Agent responds within 400-600ms
- Interrupt mid-sentence → agent stops immediately
- Hear "Connecting you to the assistant" (TwiML
4. Monitor latency
Check server logs for timing:
[Session Start] MZ123 → CA456
[Audio Buffer] 20 chunks buffered (400ms)
[VAPI Response] Received in 287ms
[Barge-in] Flushed buffer for MZ123
If you see [Audio Buffer] 50 chunks buffered, your network has jitter—increase maxBufferSize to 100.
Production example
Scenario: User calls to reschedule an appointment. Agent is mid-sentence when user interrupts.
Event timeline:
14:23:01.234 [Session Start] MZ8f7g2 → CA1a2b3c
14:23:01.456 [Twilio] media event #1 (20ms chunk)
14:23:01.476 [Twilio] media event #2
...
14:23:01.856 [Audio Buffer] 20 chunks buffered (400ms)
14:23:01.890 [VAPI] Forwarded 320 bytes PCM audio
14:23:02.177 [VAPI] STT final: "I need to reschedule my appointment"
14:23:02.345 [VAPI] LLM response: "Of course! What day works better for you?"
14:23:02.567 [VAPI] TTS chunk 1/47 streaming
14:23:02.789 [Twilio] Playing: "Of course! What day—"
14:23:03.012 [VAPI] STT partial: "Wait"
14:23:03.234 [Barge-in] Flushed buffer for MZ8f7g2 (42 chunks dropped)
14:23:03.456 [VAPI] STT final: "Wait, make it Wednesday instead"
14:23:03.678 [VAPI] LLM processing correction
14:23:03.901 [VAPI] TTS chunk 1/23 streaming (new response)
14:23:04.123 [Twilio] Playing: "Got it, I've moved your appointment to Wednesday"
What happened:
- User spoke at
14:23:01.234. VAPI's VAD detected speech at14:23:01.890(656ms delay due to 400ms buffer + 256ms VAD processing). - Agent started responding at
14:23:02.567(1.333s total latency from user speech start). - User interrupted at
14:23:03.012(445ms into agent's response). - Barge-in detection fired at
14:23:03.234(222ms after interruption started—this is theisProcessinglock delay). - Buffer flush dropped 42 TTS chunks (840ms of queued audio).
- New response started at
14:23:04.123(1.111s from interruption—acceptable for conversational AI).
Edge case handled: Without the isProcessing guard, the interruption at 14:23:03.012 would have triggered 3 concurrent LLM calls (one for each partial transcript: "Wait", "Wait, make", "Wait, make it Wednesday"). The guard ensures only the final transcript fires an LLM call.
Production metrics from this call:
- End-to-end latency: 1.333s (first response)
- Barge-in detection: 222ms
- Buffer flush: 42 chunks (840ms of audio dropped)
- Session cleanup: Triggered at
14:28:01.234(5 minutes later via TTL)
Related reading
-
VAPI Real-Time Streaming Docs — Official WebSocket API reference, event schemas, and assistant configuration options. Essential for understanding partial transcript handling.
-
Twilio Media Streams Guide — Explains the
<Stream>TwiML verb, audio format specifications (mulaw vs. PCM), and track selection (inbound_trackvs.both_tracks). -
WebSocket Protocol RFC 6455 — Low-level spec for WebSocket framing, ping/pong heartbeats, and close handshakes. Read sections 5.5-5.6 for connection lifecycle management.
-
VAPI GitHub Examples — Production implementations of Twilio integration, including session management patterns and error recovery strategies.
-
Deepgram Nova-2 Model Docs — Benchmarks showing 200ms latency for real-time transcription. Compare with
basemodel (400ms) to understand the accuracy/speed tradeoff.
Topics
Written by
Voice AI Engineer & Creator
Building production voice AI systems and sharing what I learn. Focused on VAPI, LLM integrations, and real-time communication. Documenting the challenges most tutorials skip.
Tutorials in your inbox
Weekly voice AI tutorials and production tips. No spam.
Found this helpful?
Share it with other developers building voice AI.



