Building a Sub-500ms Voice Agent from Scratch
Nick Tikhonov built a voice agent pipeline from individual components — Twilio for telephony, Deepgram for transcription and turn detection, Groq-hosted Llama 3.3 70B for inference, ElevenLabs for speech synthesis - and got end-to-end latency down to around 400ms. That's roughly twice as fast as Vapi's managed stack. The key insight is that LLM time-to-first-token dominates the entire pipeline; Groq's ~80ms TTFT accounts for more than half the total latency budget. Warm TTS connections save another 300ms. Turn-taking - knowing when a user is actually done speaking versus just pausing - remains the hardest unsolved piece, requiring a mix of audio-level VAD and semantic signals. More teams are discovering that the orchestration layer between STT, LLM, and TTS is where voice agents are actually won or lost.