Job Description
We are seeking an experienced Senior Voice AI Engineer to build the voice infrastructure for an intelligent conversational AI agent serving a US-based client. You will own the real-time voice layer - ensuring natural, low-latency voice interactions that feel human-like and responsive.
This is a hands-on technical role requiring deep expertise in speech technologies, real-time audio systems, and telephony integration. You should have proven experience building production voice systems that handle real user conversations at scale.
Key Responsibilities
Speech Voice Pipeline
Implement and optimize Speech-to-Text (STT) pipelines for accuracy, latency, and robustness
Integrate and fine-tune Text-to-Speech (TTS) engines for natural prosody and appropriate tone
Implement Voice Activity Detection (VAD) for accurate speech endpoint detection
Handle interruptions, barge-in, and natural turn-taking in conversations
Optimize for real-time performance with sub-500ms end-to-end latency
Real-Time Infrastructure
Build low-latency audio streaming infrastructure using WebSockets/WebRTC
Implement audio preprocessing (noise reduction, echo cancellation, normalization)
Design resilient pipelines that handle network variability and audio quality issues
Build connection management for concurrent voice sessions at scale
Telephony Integration
Integrate with telephony platforms (Twilio, Vonage) for phone-based voice channels
Handle call lifecycle management (inbound, outbound, transfers, hold)
Implement DTMF handling and IVR fallback capabilities
Support multiple audio codecs and telephony protocols
Quality Optimization
Establish metrics for voice quality (latency, Word Error Rate, naturalness)
Build monitoring and alerting for real-time voice pipeline health
Analyze call recordings to identify quality improvement opportunities
Collaborate with the AI/Agent team on seamless voice-to-agent handoff