Advertisement
Table of Contents
Build Your Own Voice Stack with Deepgram and PlayHT: A Practical Guide
TL;DR
Most voice stacks fail because STT and TTS operate independently—you get latency jitter, buffer misalignment, and audio cutoffs mid-sentence. This guide builds a real-time conversational pipeline: Deepgram handles streaming speech-to-text with partial transcripts, PlayHT generates low-latency audio responses, and a Node.js server orchestrates the handoff. Result: sub-500ms round-trip latency, proper barge-in handling, and no audio overlap.
Prerequisites
API Keys & Credentials
You'll need active accounts with Deepgram and PlayHT. Generate API keys from both platforms' dashboards—Deepgram's key enables real-time speech-to-text streaming via WebSocket, while PlayHT's key handles text-to-speech synthesis requests. Store both in .env files using DEEPGRAM_API_KEY and PLAYHT_API_KEY` variables.
System & Runtime Requirements
Node.js 18+ (for native fetch and async/await support). Install dependencies: npm install dotenv axios ws for WebSocket streaming and HTTP requests. You'll also need a modern browser with Web Audio API support if building a client-side component.
Network & Audio Setup
Ensure your development environment supports WebSocket connections (firewalls sometimes block these). Have a microphone available for testing real-time STT. For production, you'll need HTTPS endpoints and a domain for webhook callbacks—ngrok works for local testing.
Knowledge Assumptions
Familiarity with async JavaScript, REST APIs, and JSON payloads. Understanding of audio formats (PCM 16kHz) helps but isn't mandatory.
Deepgram: Try Deepgram Speech-to-Text → Get Deepgram
Step-by-Step Tutorial
Most voice stacks break because developers treat STT and TTS as separate batch operations. Real-time audio requires streaming both directions simultaneously while managing buffer states. Here's how to build it correctly.
Architecture & Flow
Your voice stack needs three concurrent processes: audio capture → Deepgram STT → LLM processing → PlayHT TTS → audio playback. The critical part is managing the bidirectional streams without blocking.
flowchart LR
A[Microphone] -->|WebSocket| B[Deepgram STT]
B -->|Transcript| C[LLM Processing]
C -->|Response Text| D[PlayHT TTS]
D -->|Audio Stream| E[Speaker]
E -.->|Barge-in Signal| B
Configuration & Setup
Deepgram Configuration - Enable interim results for low-latency partial transcripts:
const deepgramConfig = {
model: 'nova-2',
language: 'en-US',
encoding: 'linear16',
sample_rate: 16000,
channels: 1,
interim_results: true,
endpointing: 300, // 300ms silence = end of utterance
vad_events: true, // Voice activity detection
punctuate: true
};
PlayHT Configuration - Stream audio in chunks for immediate playback:
const playhtConfig = {
voice: 'larry', // Or your cloned voice ID
output_format: 'mp3',
sample_rate: 24000,
quality: 'medium', // Balance latency vs quality
speed: 1.0,
seed: null // Randomize for natural variation
};
Step-by-Step Implementation
Step 1: Initialize WebSocket Connections
Open persistent connections to both services. Deepgram uses WebSocket for bidirectional audio streaming. PlayHT uses HTTP streaming with chunked transfer encoding.
const deepgramWs = new WebSocket(
`wss://api.deepgram.com/v1/listen?${new URLSearchParams(deepgramConfig)}`,
{ headers: { 'Authorization': `Token ${process.env.DEEPGRAM_API_KEY}` }}
);
// PlayHT uses HTTP streaming, not WebSocket
const playhtStream = await fetch('https://api.play.ht/api/v2/tts/stream', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.PLAYHT_API_KEY}`,
'X-User-ID': process.env.PLAYHT_USER_ID,
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: responseText,
voice: playhtConfig.voice,
output_format: playhtConfig.output_format,
sample_rate: playhtConfig.sample_rate
})
});
Step 2: Stream Audio to Deepgram
Capture microphone input and send raw PCM chunks. Do NOT buffer entire utterances - stream immediately.
let isProcessing = false; // Race condition guard
navigator.mediaDevices.getUserMedia({ audio: true })
.then(stream => {
const mediaRecorder = new MediaRecorder(stream, {
mimeType: 'audio/webm;codecs=opus'
});
mediaRecorder.ondataavailable = (event) => {
if (deepgramWs.readyState === WebSocket.OPEN) {
deepgramWs.send(event.data);
}
};
mediaRecorder.start(250); // Send chunks every 250ms
});
Step 3: Handle Partial Transcripts
Process interim results for responsiveness. Only trigger LLM on final transcripts to avoid duplicate responses.
deepgramWs.onmessage = async (message) => {
const data = JSON.parse(message.data);
if (data.is_final) {
const transcript = data.channel.alternatives[0].transcript;
if (isProcessing) return; // Prevent race condition
isProcessing = true;
try {
const llmResponse = await processWithLLM(transcript);
await streamTTSResponse(llmResponse);
} finally {
isProcessing = false;
}
}
};
Step 4: Stream TTS Audio Back
PlayHT returns audio chunks as they're generated. Play immediately - don't wait for complete response.
async function streamTTSResponse(text) {
const response = await fetch('https://api.play.ht/api/v2/tts/stream', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.PLAYHT_API_KEY}`,
'X-User-ID': process.env.PLAYHT_USER_ID,
'Content-Type': 'application/json'
},
body: JSON.stringify({ text, ...playhtConfig })
});
const reader = response.body.getReader();
const audioContext = new AudioContext({ sampleRate: 24000 });
while (true) {
const { done, value } = await reader.read();
if (done) break;
// Decode and play chunk immediately
const audioBuffer = await audioContext.decodeAudioData(value.buffer);
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
source.connect(audioContext.destination);
source.start();
}
}
Error Handling & Edge Cases
WebSocket Reconnection - Deepgram connections drop after 10 seconds of silence. Implement exponential backoff:
let reconnectAttempts = 0;
const maxReconnectDelay = 30000;
deepgramWs.onclose = () => {
const delay = Math.min(1000 * Math.pow(2, reconnectAttempts), maxReconnectDelay);
setTimeout(() => {
reconnectAttempts++;
initializeDeepgram();
}, delay);
};
Barge-in Handling - Stop TTS playback when user interrupts. Flush audio buffers to prevent old audio playing after interrupt.
deepgramWs.onmessage = (message) => {
const data = JSON.parse(message.data);
if (data.type === 'SpeechStarted') {
// User started speaking - cancel TTS immediately
audioContext.close(); // Stops all audio sources
audioContext = new AudioContext({ sampleRate: 24000 });
isProcessing = false; // Allow new processing
}
};
Rate Limiting - PlayHT enforces 100 requests/minute. Queue requests and implement backoff on 429 errors.
Testing & Validation
Test with 200-500ms network jitter. Real mobile networks have variable latency. Your endpointing threshold (300ms) must account for this or you'll get false turn-taking triggers.
Validate audio format compatibility: Deepgram expects PCM 16kHz, PlayHT outputs 24kHz MP3. Resample if needed to prevent playback speed issues.
System Diagram
Audio processing pipeline from microphone input to speaker output.
graph LR
AudioInput[Audio Input]
PreProcessor[Pre-Processor]
FeatureExtractor[Feature Extraction]
AcousticModel[Acoustic Model]
LanguageModel[Language Model]
Decoder[Decoder]
PostProcessor[Post-Processor]
Transcript[Transcript Output]
ErrorHandler[Error Handler]
Log[Logging]
AudioInput-->PreProcessor
PreProcessor-->FeatureExtractor
FeatureExtractor-->AcousticModel
AcousticModel-->LanguageModel
LanguageModel-->Decoder
Decoder-->PostProcessor
PostProcessor-->Transcript
PreProcessor-- Error -->ErrorHandler
FeatureExtractor-- Error -->ErrorHandler
AcousticModel-- Error -->ErrorHandler
LanguageModel-- Error -->ErrorHandler
Decoder-- Error -->ErrorHandler
ErrorHandler-->Log
Log-->PreProcessor
Testing & Validation
Most voice stacks break in production because devs skip local testing. Here's how to catch issues before they hit users.
Local Testing
Test the full pipeline locally before deploying. This catches 80% of integration bugs.
// Test STT → LLM → TTS pipeline with mock audio
async function testVoicePipeline() {
const testAudioFile = './test_audio.wav'; // 16kHz PCM audio
const audioData = fs.readFileSync(testAudioFile);
// 1. Test Deepgram STT
deepgramWs.send(audioData);
deepgramWs.on('message', async (data) => {
const transcript = JSON.parse(data).channel.alternatives[0].transcript;
console.log('STT Output:', transcript);
// 2. Test LLM response
const llmResponse = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: { 'Authorization': `Bearer ${process.env.OPENAI_API_KEY}` },
body: JSON.stringify({ messages: [{ role: 'user', content: transcript }] })
});
// 3. Test PlayHT TTS
await streamTTSResponse(llmResponse.choices[0].message.content);
});
}
Run this with 5-10 test audio files covering edge cases: background noise, fast speech, accents, interruptions.
Webhook Validation
If using webhooks for async processing, validate signatures to prevent replay attacks. Check response codes: 200 = success, 429 = rate limit hit (back off exponentially), 503 = service down (retry with jitter).
Real-World Example
Barge-In Scenario
User interrupts the AI mid-sentence while it's explaining a 3-step process. Most implementations break here because they don't flush the TTS buffer—the old audio keeps playing after the interrupt.
let currentAudioSource = null;
let isPlaying = false;
// Barge-in detection from Deepgram STT
deepgramWs.on('message', (message) => {
const data = JSON.parse(message);
if (data.is_final && data.speech_final) {
const transcript = data.channel.alternatives[0].transcript;
// User spoke while AI was talking = barge-in
if (isPlaying && transcript.length > 0) {
console.log('[BARGE-IN] User interrupted:', transcript);
// CRITICAL: Stop current audio immediately
if (currentAudioSource) {
currentAudioSource.stop(0);
currentAudioSource = null;
}
// Flush PlayHT stream buffer
if (playhtStream && !playhtStream.destroyed) {
playhtStream.destroy();
}
isPlaying = false;
// Process new user input
handleUserInput(transcript);
}
}
});
// Track audio playback state
function playAudioChunk(audioBuffer) {
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
source.connect(audioContext.destination);
currentAudioSource = source;
isPlaying = true;
source.onended = () => {
isPlaying = false;
currentAudioSource = null;
};
source.start(0);
}
Event Logs
Timestamp: 14:32:18.234 - AI starts TTS: "To complete the setup, first navigate to..."
Timestamp: 14:32:19.891 - Deepgram detects speech: is_final: false, transcript: "wait"
Timestamp: 14:32:20.103 - Barge-in triggered, audio source stopped
Timestamp: 14:32:20.156 - PlayHT stream destroyed, buffer flushed
Timestamp: 14:32:20.421 - New STT final: "wait, can you repeat that?"
Edge Cases
Multiple rapid interrupts: User says "wait... no... actually..." within 500ms. Without debouncing, you'll trigger 3 separate LLM calls. Add a 300ms debounce window before processing the final transcript.
False positives from background noise: Breathing, keyboard clicks, or ambient sound trigger barge-in at Deepgram's default endpointing: 10 (10ms silence). Increase to endpointing: 300 for noisy environments. This prevents phantom interrupts but adds 290ms latency to legitimate barge-ins—tune based on your use case.
Partial audio playback: If you don't track isPlaying state, the system can't distinguish between "AI is speaking" vs "silence between responses." Result: user's normal speech gets treated as an interrupt, breaking turn-taking logic.
Common Issues & Fixes
Race Conditions in Audio Playback
Most voice stacks break when TTS chunks arrive faster than they can be played. You'll hear overlapping audio or the bot talking over itself.
The Problem: PlayHT streams audio chunks at ~50ms intervals, but Web Audio API scheduling isn't instant. If you queue chunks without tracking playback state, they pile up.
// BROKEN: Chunks overlap because we don't track playback
playhtStream.on('data', (chunk) => {
const audioBuffer = audioContext.decodeAudioData(chunk);
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
source.connect(audioContext.destination);
source.start(0); // ❌ Always starts immediately = overlap
});
// FIXED: Track playback timing
let nextStartTime = audioContext.currentTime;
playhtStream.on('data', async (chunk) => {
const audioBuffer = await audioContext.decodeAudioData(chunk);
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
source.connect(audioContext.destination);
// Schedule next chunk after current one finishes
source.start(Math.max(0, nextStartTime));
nextStartTime = Math.max(audioContext.currentTime, nextStartTime) + audioBuffer.duration;
});
WebSocket Reconnection Failures
Deepgram WebSocket connections drop after 10 seconds of silence or network hiccups. Without exponential backoff, you'll spam reconnect attempts and hit rate limits (429 errors).
// Exponential backoff with jitter
async function reconnectDeepgram() {
const delay = Math.min(1000 * Math.pow(2, reconnectAttempts), maxReconnectDelay);
const jitter = Math.random() * 1000; // Prevent thundering herd
await new Promise(resolve => setTimeout(resolve, delay + jitter));
deepgramWs = new WebSocket('wss://api.deepgram.com/v1/listen', {
headers: { Authorization: `Token ${process.env.DEEPGRAM_API_KEY}` }
});
reconnectAttempts++;
}
Real-world trigger: Mobile networks cause 200-500ms jitter. Set endpointing: 1500 (not 300ms default) to avoid false disconnects.
Barge-In Audio Corruption
When users interrupt mid-sentence, you must flush the TTS buffer and cancel queued chunks. Otherwise, old audio plays after the interruption.
deepgramWs.on('message', (data) => {
const transcript = JSON.parse(data);
if (transcript.is_final && isPlaying) {
// Stop current audio immediately
if (currentAudioSource) {
currentAudioSource.stop();
currentAudioSource = null;
}
// Clear queued chunks
nextStartTime = audioContext.currentTime;
isPlaying = false;
}
});
Complete Working Example
Most voice stack tutorials show isolated snippets. Here's the full server that actually runs—WebSocket handlers, audio streaming, error recovery, and graceful shutdown. This is what you deploy.
Full Server Code
This implementation handles the complete real-time speech-to-text to text-to-speech pipeline. The server manages concurrent WebSocket connections, buffers audio chunks to prevent jitter, and implements exponential backoff for reconnection failures.
// server.js - Production voice stack server
const WebSocket = require('ws');
const express = require('express');
const { createClient } = require('@deepgram/sdk');
const fetch = require('node-fetch');
const app = express();
const server = require('http').createServer(app);
const wss = new WebSocket.Server({ server });
// Configuration from previous sections
const deepgramConfig = {
model: 'nova-2',
language: 'en-US',
encoding: 'linear16',
sample_rate: 16000,
channels: 1,
endpointing: 300,
interim_results: true
};
const playhtConfig = {
voice: 'jennifer',
output_format: 'mp3',
quality: 'high',
speed: 1.0
};
// Session state management
const sessions = new Map();
const SESSION_TTL = 300000; // 5 minutes
// Audio buffer management to prevent jitter
class AudioBuffer {
constructor() {
this.chunks = [];
this.isPlaying = false;
this.nextStartTime = 0;
}
add(chunk) {
this.chunks.push(chunk);
if (!this.isPlaying) this.play();
}
async play() {
this.isPlaying = true;
while (this.chunks.length > 0) {
const chunk = this.chunks.shift();
const now = Date.now();
const delay = Math.max(0, this.nextStartTime - now);
await new Promise(resolve => setTimeout(resolve, delay));
// Send chunk to client
this.nextStartTime = Date.now() + (chunk.duration || 100);
}
this.isPlaying = false;
}
clear() {
this.chunks = [];
this.isPlaying = false;
}
}
// Deepgram connection with reconnection logic
function createDeepgramConnection(sessionId) {
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
const connection = deepgram.listen.live(deepgramConfig);
const session = sessions.get(sessionId);
session.reconnectAttempts = 0;
const maxReconnectDelay = 30000;
connection.on('open', () => {
console.log(`[${sessionId}] Deepgram connected`);
session.reconnectAttempts = 0;
});
connection.on('Results', async (data) => {
if (!data.channel?.alternatives?.[0]) return;
const transcript = data.channel.alternatives[0].transcript;
if (!transcript || data.is_final === false) return;
// Prevent race condition during TTS playback
if (session.isProcessing) {
console.log(`[${sessionId}] Dropped transcript (already processing)`);
return;
}
session.isProcessing = true;
try {
// Generate LLM response (simplified - use your LLM here)
const llmResponse = await generateResponse(transcript);
// Stream TTS from PlayHT
await streamTTSResponse(sessionId, llmResponse);
} catch (error) {
console.error(`[${sessionId}] Processing error:`, error);
session.ws.send(JSON.stringify({
type: 'error',
message: 'Processing failed'
}));
} finally {
session.isProcessing = false;
}
});
connection.on('error', (error) => {
console.error(`[${sessionId}] Deepgram error:`, error);
});
connection.on('close', () => {
console.log(`[${sessionId}] Deepgram closed`);
if (sessions.has(sessionId)) {
reconnectDeepgram(sessionId);
}
});
return connection;
}
// Exponential backoff reconnection
function reconnectDeepgram(sessionId) {
const session = sessions.get(sessionId);
if (!session) return;
session.reconnectAttempts++;
const delay = Math.min(
1000 * Math.pow(2, session.reconnectAttempts),
30000
);
console.log(`[${sessionId}] Reconnecting in ${delay}ms (attempt ${session.reconnectAttempts})`);
setTimeout(() => {
if (sessions.has(sessionId)) {
session.deepgramWs = createDeepgramConnection(sessionId);
}
}, delay);
}
// PlayHT TTS streaming with buffer management
async function streamTTSResponse(sessionId, text) {
const session = sessions.get(sessionId);
if (!session) return;
// Cancel any ongoing playback (barge-in handling)
session.audioBuffer.clear();
try {
const response = await fetch('https://api.play.ht/api/v2/tts/stream', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.PLAYHT_API_KEY}`,
'X-User-ID': process.env.PLAYHT_USER_ID,
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: text,
voice: playhtConfig.voice,
output_format: playhtConfig.output_format,
quality: playhtConfig.quality,
speed: playhtConfig.speed
})
});
if (!response.ok) {
throw new Error(`PlayHT API error: ${response.status}`);
}
const reader = response.body.getReader();
let audioChunk;
while (true) {
const { done, value } = await reader.read();
if (done) break;
// Buffer audio to prevent jitter
session.audioBuffer.add({
data: value,
duration: (value.length / 16000) * 1000 // Estimate duration
});
// Send to client WebSocket
if (session.ws.readyState === WebSocket.OPEN) {
session.ws.send(value, { binary: true });
}
}
} catch (error) {
console.error(`[${sessionId}] TTS streaming error:`, error);
throw error;
}
}
// Simplified LLM response (replace with your LLM)
async function generateResponse(transcript) {
// This is where you'd call OpenAI, Anthropic, etc.
return `You said: ${transcript}. This is a test response.`;
}
// WebSocket connection handler
wss.on('connection', (ws) => {
const sessionId = `session_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
console.log(`[${sessionId}] Client connected`);
// Initialize session
const session = {
ws: ws,
deepgramWs: createDeepgramConnection(sessionId),
audioBuffer: new AudioBuffer(),
isProcessing: false,
reconnectAttempts: 0,
createdAt: Date.now()
};
sessions.set(sessionId, session);
// Auto-cleanup after TTL
setTimeout(() => {
if (sessions.has(sessionId)) {
console.log(`[${sessionId}] Session expired (TTL)`);
cleanupSession(sessionId);
}
}, SESSION_TTL);
ws.on
## FAQ
## Technical Questions
**How do I handle WebSocket reconnection when Deepgram drops mid-stream?**
Implement exponential backoff with a maximum delay cap. Track `reconnectAttempts` and increment after each failed connection. Set `maxReconnectDelay` to 30 seconds to prevent runaway retry loops. When the WebSocket closes unexpectedly, calculate `delay = Math.min(1000 * Math.pow(2, reconnectAttempts), maxReconnectDelay)`, then reconnect. Store partial transcripts in memory before reconnecting so you don't lose context. Most production failures happen because developers retry immediately without backoff—this will exhaust your connection pool.
**What's the latency difference between batch STT and streaming STT?**
Batch processing (send entire audio file) adds 500ms–2s overhead for queueing and processing. Streaming STT with Deepgram returns partial transcripts within 100–300ms of audio arrival, depending on network jitter and VAD (voice activity detection) settings. For real-time voice stacks, streaming is non-negotiable. Batch is only acceptable for post-call analysis or asynchronous transcription jobs.
**How do I prevent PlayHT from speaking over Deepgram's STT?**
Set `isProcessing = true` when audio input starts, and only allow TTS output when `isProcessing = false`. Use a state machine, not boolean flags alone—this prevents race conditions where both streams try to output simultaneously. If the user interrupts mid-sentence, flush the `playhtStream` buffer immediately and set `interrupted = true` to signal cancellation downstream.
**Why does my voice stack have 2–3 second latency spikes?**
Check three things: (1) LLM response time (usually 500–1500ms for GPT-4), (2) TTS synthesis delay (PlayHT typically 200–800ms depending on text length), (3) network jitter on WebSocket. Profile each component separately using `console.time()`. Most developers blame the API when the bottleneck is their own LLM integration.
## Performance
**What sample rate should I use for Deepgram?**
Use 16kHz for voice conversations (standard for telephony). 8kHz works but reduces accuracy by 3–5%. Higher rates (48kHz) waste bandwidth without meaningful accuracy gains for speech. Set `sample_rate: 16000` in `deepgramConfig` and ensure your `mediaRecorder` or audio source matches this rate. Mismatched sample rates cause audio artifacts and transcription errors.
**How do I reduce PlayHT synthesis latency?**
Lower the `quality` setting from "high" to "medium" (saves 200–400ms). Reduce `speed` to 0.9–1.0 (faster speech = less synthesis time, but clarity suffers). Pre-warm the connection by sending a test request during initialization. Batch multiple short sentences into one TTS call instead of calling PlayHT for every single response chunk—this reduces API overhead by 40–60%.
## Platform Comparison
**Should I use Deepgram or Google Cloud Speech-to-Text?**
Deepgram is 2–3x faster for streaming (100–200ms latency vs. 300–500ms for Google). Deepgram's pricing is predictable (per-minute). Google charges per request plus storage. For real-time voice stacks, Deepgram wins on latency and cost. Google wins if you need multi-language support across 100+ languages out-of-the-box.
**PlayHT vs. ElevenLabs for TTS?**
PlayHT has lower latency (200–600ms) and better cost efficiency for high-volume applications. ElevenLabs has superior voice quality and emotional expressiveness, but adds 400–1000ms latency. For conversational AI, PlayHT is the practical choice. For branded voice experiences, ElevenLabs justifies the latency trade-off.
## Resources
**Deepgram Speech-to-Text API**
- [Official Documentation](https://developers.deepgram.com/docs) – Real-time STT, WebSocket streaming, VAD configuration
- [API Reference](https://developers.deepgram.com/reference) – Endpoint specs, authentication, error codes
**PlayHT Text-to-Speech API**
- [Official Documentation](https://docs.playht.com) – Low-latency TTS streaming, voice selection, output formats
- [API Reference](https://docs.playht.com/api-reference) – Endpoint specs, authentication, streaming protocols
**Voice AI Stack Architecture**
- [Deepgram GitHub Examples](https://github.com/deepgram-devs) – Production WebSocket implementations, audio streaming patterns
- [PlayHT GitHub Examples](https://github.com/playht) – End-to-end conversational AI pipeline examples
Advertisement
Written by
Voice AI Engineer & Creator
Building production voice AI systems and sharing what I learn. Focused on VAPI, LLM integrations, and real-time communication. Documenting the challenges most tutorials skip.
Found this helpful?
Share it with other developers building voice AI.



