Most voice AI integrations fail when STT/TTS latency exceeds 200ms—users perceive lag as unresponsiveness. Build with streaming transcription (partial results), concurrent TTS synthesis, and barge-in detection to keep round-trip under 150ms. Use connection pooling, warm WebSocket handshakes, and regional endpoints. This stack (VAPI + Twilio) handles multi-turn dialogue without dropout or audio overlap.
Mental model
Voice AI integration is a three-stage pipeline where audio flows through transport (Twilio), speech processing (VAPI's STT/TTS), and conversational logic (LLM). Each stage adds latency: STT takes 200-400ms, LLM inference 800-1500ms, TTS synthesis 300-600ms, plus network overhead of 100-200ms per hop. The total 1.4-2.7 seconds exceeds human tolerance for conversational turn-taking. Streaming architectures solve this by processing partial transcripts before the user finishes speaking and synthesizing audio chunks before the full response completes. Barge-in detection cancels active TTS streams when the user interrupts, preventing overlapping speech. Session state management prevents race conditions when multiple events arrive faster than your processing pipeline can handle them.
What you need first
API credentials
- VAPI API key from dashboard
- Twilio Account SID + Auth Token
- VAPI webhook secret for signature validation
Runtime environment
- Node.js 18+ with npm
- Express 4.18+, axios, dotenv packages
- Twilio SDK 3.80+
Infrastructure
- Public HTTPS endpoint (ngrok for local testing)
- Webhook response time under 5 seconds
- Firewall allowing inbound port 443
- Valid SSL certificate for production
Audio knowledge
- PCM 16kHz mono format
- VAD thresholds (0.3-0.6 range)
- Silence detection windows (100-400ms)
- Audio buffer management patterns
VAPI: Get Started with VAPI → Get VAPI
The wire format
Audio flows from user microphone through Twilio's PSTN gateway to your webhook server. VAPI receives the audio stream, runs STT to generate partial transcripts (fired every 100-300ms), sends final transcripts to your LLM, receives the response, synthesizes speech via TTS, and streams audio back through Twilio to the user's speaker.
Event sequence:
- User speaks → Twilio captures audio → sends to VAPI
- VAPI STT fires
transcript.partialevents (incomplete speech) - Your webhook receives partials, queues them in session state
- VAPI fires
transcript.finalwhen user stops speaking - Your server sends final transcript to LLM
- LLM response triggers VAPI TTS synthesis
- VAPI fires
speech-update.startedwith streamId - Audio chunks stream to Twilio → user hears response
- If user interrupts:
transcript.partialarrives → cancel active TTS stream
Critical timing: VAD detection adds 120ms, STT partial processing 150ms, buffer flush 12ms = 282ms total interrupt latency. Acceptable threshold for conversational AI is under 300ms.
graph LR
A[User Speech] --> B[Twilio PSTN]
B --> C[VAPI STT]
C -->|partial| D[Webhook Server]
C -->|final| D
D --> E[LLM Processing]
E --> F[VAPI TTS]
F --> G[Audio Stream]
G --> B
B --> H[User Hears Response]
C -.->|barge-in| I[Cancel TTS]
I --> F
Webhook signature validation prevents replay attacks. VAPI sends x-vapi-signature header containing HMAC-SHA256 hash of the payload. Your server must compute the same hash using your webhook secret and compare via timing-safe equality to avoid timing attacks.
function validateVapiSignature(payload, signature, secret) {
const hash = crypto
.createHmac('sha256', secret)
.update(JSON.stringify(payload))
.digest('hex');
return crypto.timingSafeEqual(
Buffer.from(signature),
Buffer.from(hash)
);
}
Advertisement
Copy-paste setup
This configuration handles webhook validation, session state management, and streaming audio control. Every key is required for production—missing any causes silent failures or security vulnerabilities.
const express = require('express');
const crypto = require('crypto');
require('dotenv').config();
const app = express();
app.use(express.json());
// Session storage with automatic cleanup
const sessions = new Map();
const activeStreams = new Map();
const SESSION_TTL = 300000; // 5 minutes - adjust based on avg call duration
// Security: validate webhook signatures (REQUIRED)
function validateVapiSignature(payload, signature, secret) {
const hash = crypto
.createHmac('sha256', secret)
.update(JSON.stringify(payload))
.digest('hex');
return crypto.timingSafeEqual(
Buffer.from(signature),
Buffer.from(hash)
);
}
// Main webhook endpoint
app.post('/webhook/vapi', async (req, res) => {
const signature = req.headers['x-vapi-signature'];
// Reject invalid signatures immediately
if (!validateVapiSignature(req.body, signature, process.env.VAPI_SECRET)) {
return res.status(401).json({ error: 'Invalid signature' });
}
const { type, message } = req.body;
const sessionId = message?.call?.id;
// Initialize session if new
if (!sessions.has(sessionId)) {
sessions.set(sessionId, {
id: sessionId,
buffer: '',
isProcessing: false,
createdAt: Date.now()
});
}
// Always respond 200 within 5 seconds (Vapi timeout)
res.status(200).json({ received: true });
});
const PORT = process.env.PORT || 3000;
app.listen(PORT);
Tradeoffs: SESSION_TTL at 5 minutes balances memory usage vs call duration. Shorter TTL risks dropping active calls; longer TTL leaks memory on abandoned sessions. timingSafeEqual prevents timing attacks but requires Node.js 16+. Responding with 200 before processing prevents webhook timeouts but requires async handling for long operations.
A real call we ran
Restaurant booking agent receives a call at 14:32:01. Agent starts: "Thank you for calling. I can help you book a table for—" User interrupts at 14:32:02.891: "I need a table for four tonight at 7pm."
Event log:
14:32:01.234 [stream_abc] assistant.speech-started
payload: { text: "Thank you for calling..." }
14:32:02.891 [stream_abc] transcript.partial
payload: { text: "I need", isFinal: false }
action: VAD threshold 0.5 triggered
14:32:02.903 [stream_abc] Buffer flush
dropped: 1847ms of queued audio
reason: barge-in detected
14:32:03.156 [stream_abc] transcript.final
payload: { text: "I need a table for four tonight at 7pm" }
14:32:03.401 [stream_abc] assistant.speech-started
payload: { text: "Perfect, I can book that for you..." }
What happened: VAD detected speech 120ms after user started talking. STT generated partial transcript at 150ms. Our webhook received the partial, checked activeStreams[stream_abc], found active TTS, and flushed the buffer within 12ms. Total interrupt latency: 282ms from first phoneme to audio cancellation.
The code that handled it:
app.post('/webhook/vapi', async (req, res) => {
const { type, message } = req.body;
const sessionId = message?.call?.id;
if (type === 'transcript' && message.transcript) {
// Check for barge-in triggers in partial transcripts
const bargeInTriggers = ['stop', 'wait', 'hold on', 'actually'];
const shouldInterrupt = bargeInTriggers.some(trigger =>
message.transcript.toLowerCase().includes(trigger)
);
if (shouldInterrupt && activeStreams.has(sessionId)) {
// Kill active TTS immediately
activeStreams.delete(sessionId);
const session = sessions.get(sessionId);
session.buffer = '';
session.isProcessing = false;
console.log(`[${sessionId}] Barged in at: ${message.transcript}`);
}
}
res.status(200).send();
});
Why it worked: Partial transcript processing caught the interrupt before the user finished speaking. Buffer flush prevented 1.8 seconds of overlapping audio. Without this, the agent would have talked over the user until the full sentence completed.
Edge cases
Multiple rapid interrupts: User says "Actually—no wait—make that 8pm" in 600ms. Each partial fires a webhook. Without locking, three LLM calls trigger simultaneously, responses arrive out of order, agent says "8pm" then "wait" then "actually."
Fix: Guard with isProcessing flag.
if (session.isProcessing) {
session.pendingTranscript = message.transcript; // Queue latest
return res.status(200).send();
}
session.isProcessing = true;
False positive VAD on mobile networks: Dog bark at 85dB triggers VAD at default 0.3 threshold. Agent interrupts itself. Cellular jitter causes 3-5 false positives per minute on noisy calls.
Fix: Increase VAD sensitivity to 0.5-0.6 for mobile users. Tradeoff: adds 80-120ms to wake word detection.
Webhook timeout on slow LLM: GPT-4 takes 2.1 seconds for complex prompt. VAPI webhook times out at 5 seconds. If your processing takes 4.8s, retry storms occur.
Fix: Respond 202 immediately, process async.
res.status(202).json({ queued: true });
processAsync(sessionId, transcript); // No await
Memory leak from abandoned sessions: User hangs up without triggering end-of-call-report. Session stays in Map forever. After 1000 calls, server OOMs.
Fix: TTL-based cleanup every 60 seconds.
setInterval(() => {
const now = Date.now();
for (const [id, session] of sessions.entries()) {
if (now - session.createdAt > SESSION_TTL) {
sessions.delete(id);
activeStreams.delete(id);
}
}
}, 60000);
Race condition in TTS cancellation: speech-update.started arrives 50ms after transcript.partial. Your code tries to cancel a stream that doesn't exist yet in activeStreams. Agent plays 200ms of stale audio before cancellation takes effect.
Fix: Queue cancellation requests, retry for 100ms.
async function cancelWithRetry(sessionId, maxRetries = 5) {
for (let i = 0; i < maxRetries; i++) {
if (activeStreams.has(sessionId)) {
activeStreams.delete(sessionId);
return true;
}
await new Promise(resolve => setTimeout(resolve, 20));
}
return false;
}
Signature validation fails on proxy servers: Nginx rewrites request body, HMAC hash no longer matches. Webhook returns 401, VAPI retries 3x, call fails silently.
Fix: Preserve raw body for signature validation.
app.use(express.json({
verify: (req, res, buf) => {
req.rawBody = buf.toString('utf8');
}
}));
The whole thing in one file
const express = require('express');
const crypto = require('crypto');
require('dotenv').config();
const app = express();
app.use(express.json());
// State management
const sessions = new Map();
const activeStreams = new Map();
const SESSION_TTL = 300000; // 5 minutes
// Webhook signature validation
function validateVapiSignature(payload, signature, secret) {
const hash = crypto
.createHmac('sha256', secret)
.update(JSON.stringify(payload))
.digest('hex');
return crypto.timingSafeEqual(
Buffer.from(signature),
Buffer.from(hash)
);
}
// Barge-in handler
function handleBargeIn(sessionId) {
if (activeStreams.has(sessionId)) {
activeStreams.delete(sessionId);
const session = sessions.get(sessionId);
if (session) {
session.buffer = '';
session.isProcessing = false;
}
}
}
// Partial transcript processor with race condition guard
async function processPartialTranscript(session, transcript) {
if (session.isProcessing) {
session.pendingTranscript = transcript;
return;
}
session.isProcessing = true;
session.buffer = transcript;
// Check for barge-in triggers
const triggers = ['stop', 'wait', 'hold on', 'actually'];
if (triggers.some(t => transcript.toLowerCase().includes(t))) {
handleBargeIn(session.id);
}
try {
// Your LLM processing here
// await callLLM(transcript);
} finally {
session.isProcessing = false;
if (session.pendingTranscript) {
const pending = session.pendingTranscript;
session.pendingTranscript = null;
await processPartialTranscript(session, pending);
}
}
}
// Main webhook
app.post('/webhook/vapi', async (req, res) => {
const signature = req.headers['x-vapi-signature'];
if (!validateVapiSignature(req.body, signature, process.env.VAPI_SECRET)) {
return res.status(401).json({ error: 'Invalid signature' });
}
const { type, message } = req.body;
const sessionId = message?.call?.id;
// Initialize session
if (!sessions.has(sessionId)) {
const session = {
id: sessionId,
buffer: '',
isProcessing: false,
createdAt: Date.now()
};
sessions.set(sessionId, session);
// Auto-cleanup
setTimeout(() => {
sessions.delete(sessionId);
activeStreams.delete(sessionId);
}, SESSION_TTL);
}
const session = sessions.get(sessionId);
// Handle events
switch (type) {
case 'transcript':
if (message.role === 'user' && message.transcript) {
if (!message.isFinal) {
await processPartialTranscript(session, message.transcript);
}
}
break;
case 'speech-update':
if (message.status === 'started') {
activeStreams.set(sessionId, message.streamId);
} else if (message.status === 'stopped') {
activeStreams.delete(sessionId);
}
break;
case 'end-of-call-report':
sessions.delete(sessionId);
activeStreams.delete(sessionId);
break;
}
res.status(200).json({ received: true });
});
// Health check
app.get('/health', (req, res) => {
const now = Date.now();
const activeSessions = Array.from(sessions.values())
.filter(s => now - s.createdAt < SESSION_TTL).length;
res.json({
status: 'healthy',
activeSessions,
activeStreams: activeStreams.size,
uptime: process.uptime()
});
});
// Session cleanup
setInterval(() => {
const now = Date.now();
for (const [id, session] of sessions.entries()) {
if (now - session.createdAt > SESSION_TTL) {
sessions.delete(id);
activeStreams.delete(id);
}
}
}, 60000);
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
console.log(`Voice AI webhook server running on port ${PORT}`);
});
Run it:
# Install dependencies
npm install express dotenv
# Set environment variables
export VAPI_SECRET="your_webhook_secret_from_dashboard"
export PORT=3000
# For local testing with ngrok
ngrok http 3000
# Copy the HTTPS URL to VAPI dashboard → Assistant Settings → Server URL
# Start server
node server.js
Test barge-in: Start a call, let the assistant speak for 2 seconds, then say "stop" or "wait". The audio should cut off within 300ms. Check /health endpoint to verify session cleanup after 5 minutes.
Production checklist: Enable HTTPS (required for signature validation), set SESSION_TTL based on your average call duration, monitor activeStreams.size for memory leaks, implement retry logic for webhook delivery failures (VAPI retries 3x with exponential backoff), use connection pooling for database queries if storing call transcripts.
Common questions
Why does my barge-in detection have 300-500ms latency?
VAD algorithms buffer 100-200ms of audio to distinguish speech from noise. Add STT processing (100-300ms) and you're at 200-500ms before the system recognizes an interrupt. Reduce this by lowering VAD sensitivity from 0.3 to 0.5 (catches speech faster but risks false positives from background noise), using low-latency STT models (Deepgram is faster than OpenAI Whisper), or implementing early partial transcript detection to interrupt mid-sentence rather than waiting for complete words.
What's the difference between streaming STT and batch transcription?
Streaming STT processes audio chunks in real-time, delivering partial transcripts within 100-300ms as the user speaks. Batch transcription waits for complete audio, adding 500ms-2s latency. For conversational AI, streaming is mandatory—users expect immediate feedback. Batch only works for post-call analysis. The tradeoff: streaming requires buffer management and partial transcript handling to prevent race conditions, but eliminates the "dead air" problem where users think the system froze.
How do I prevent the agent from talking over itself during rapid interrupts?
Use a session-level isProcessing flag to guard your LLM processing logic. When a partial transcript arrives, check if the previous one is still being processed. If yes, queue the new transcript in session.pendingTranscript and return immediately. After the first LLM call completes, check for queued transcripts and process them. Without this guard, multiple LLM calls trigger simultaneously, responses arrive out of order, and the agent plays overlapping audio.
Should I use VAPI's native TTS or Twilio's?
VAPI integrates ElevenLabs, Google Cloud TTS, and OpenAI TTS natively via voice.provider in your assistant config. Twilio uses its own TTS engine or integrates third-party providers via Media Streams. VAPI's approach is simpler for standard use cases and avoids webhook overhead. Twilio is better if you need fine-grained control over audio streaming or custom voice cloning. Latency is similar (200-400ms). Cost differs: ElevenLabs charges per character, Google per 1M characters, Twilio per minute.
What audio format gives the lowest latency?
PCM 16-bit, 16kHz mono is the standard. It's smaller than 8kHz (worse quality) and doesn't require codec overhead like Opus or mulaw. Most STT engines (OpenAI Whisper, Google Cloud Speech) accept this natively. Twilio Media Streams default to mulaw; convert to PCM if you're piping to a custom STT service. Codec conversion adds 20-50ms latency—avoid it if possible by configuring your providers to use the same format end-to-end.
How much does webhook latency impact call quality?
Every webhook round-trip adds 50-200ms depending on your server location and network. If your webhook takes 300ms to respond, the user hears a 300ms delay before the next bot response. Keep webhook handlers under 100ms by offloading heavy work to async queues, caching function call results, and using connection pooling for database queries. VAPI webhooks timeout after 5 seconds; if you exceed this, the call fails. Monitor your P95 response times and implement circuit breakers for external API calls.
Resources
VAPI Documentation – Official API Reference covers assistant configuration, call management, and webhook event schemas. Essential for STT/TTS provider setup and real-time transcription handling.
Twilio Voice API – Twilio Docs provides SIP integration patterns and media stream protocols for bridging voice calls into your pipeline.
Twilio: Get Twilio Voice API → https://www.twilio.com/try-twilio
WebRTC Audio Standards – RFC 7874 (Opus codec) and PCM 16kHz specs define audio encoding for low-latency STT/TTS pipelines.
GitHub Reference – Search vapi-twilio-integration for open-source examples of webhook validation, session management, and barge-in interrupt handling.
Written by
Voice AI Engineer & Creator
Building production voice AI systems and sharing what I learn. Focused on VAPI, LLM integrations, and real-time communication. Documenting the challenges most tutorials skip.
Tutorials in your inbox
Weekly voice AI tutorials and production tips. No spam.
Found this helpful?
Share it with other developers building voice AI.



