A digital services team builds a client-facing Voice Bot that answers customer queries using a hybrid speech stack combining open-source ASR/TTS models and vendor fallbacks. The role focuses on streaming audio, real-time STT/TTS and telephony integration such as SIP/WebRTC, and demands experience shipping streaming speech in production and working with WebSockets or gRPC.
The mission
You will join a cross-disciplinary AI Dev Engineers and Data Scientists community responsible for delivering a production voice assistant used by enterprise clients. The project combines open-source ASR (for example Whisper, wav2vec2, NeMo) and neural TTS components with vendor APIs to meet strict latency and reliability targets for live conversations.
Day to day you design and implement streaming pipelines for audio ingest, VAD/endpointing, STT, orchestration with LLMs, and streaming TTS. You will own turn-taking logic (barge-in, interruptions, endpointing), measure conversation KPIs (WER by cohort, latency p95) and integrate telephony channels (PSTN/IVR, SIP, CPaaS, WebRTC). This Senior Voice AI Engineer will work with teams across Belgium and the wider EU to deploy, monitor and iterate these services in containerised environments.
Your responsibilities
- Design and deliver production streaming audio pipelines that meet latency and accuracy SLAs, including VAD, STT, orchestration and streaming TTS.
- Implement and tune turn-taking, barge-in and endpointing logic to reduce latency p95 and improve conversation-level KPIs.
- Integrate telephony and app channels, including SIP, PSTN/IVR, WebRTC and CPaaS, managing codecs (u-law/A-law) and 8kHz realities.
- Build resilience into the stack with retries, backpressure, rate limiting and fallbacks between open-source and vendor components.
- Automate builds, images and CI/CD pipelines (gitlab-ci), and implement code, model and data versioning for production deployments.
- Collaborate with Data Scientists and IT Production to define evaluation frameworks (WER by cohort, latency p95), monitoring and retraining strategies.
Your profile
Essential skills
- 4+ years engineering experience, with at least 2+ years shipping streaming speech in production.
- Proven ability with streaming audio, WebSockets or gRPC, and real-time STT and TTS systems.
- Hands-on experience with open-source ASR/TTS such as Whisper, NeMo, wav2vec2 and neural TTS stacks.
- Practical telephony/WebRTC integration experience, familiarity with SIP, PSTN/IVR, CPaaS and codecs (u-law/A-law).
- Strong automation skills: containerisation/virtualisation, CI/CD (gitlab-ci), and model/data/code versioning.
- Comfortable evaluating models and systems using WER by cohort, latency p95 and other conversation KPIs.
Preferred skills
- Proficiency in Python plus one systems language (Go, Rust or C++).
- Experience with PostgreSQL, speaker diarization, echo cancellation constraints and semantic VAD/endpointing models.
- Experience operating in regulated environments (banking, insurance, health) and integrating with legacy/distributed systems.
Languages
- English, C1 (mandatory)
- Dutch, B2 (nice to have)
- French, B2 (nice to have)
Education
- Bachelor's degree in Computer Science, Engineering or equivalent practical experience