VAD explained: why your AI interrupts callers
VAD explained: why your AI interrupts callers - and how to fix it
VAD - Voice Activity Detection - is the component that decides when a caller has finished speaking. When it is configured too aggressively, it cuts the caller off mid-sentence. When it is too conservative, it makes the AI feel sluggish and unresponsive. Getting VAD right is the single configuration decision that most affects whether a Voice AI conversation feels natural - and it is the one that most teams treat as a default setting rather than a tuned parameter.
There is a specific type of caller complaint that Voice AI teams hear in the first two weeks of production and cannot immediately explain: "The AI keeps interrupting me." The call logs show the AI responding correctly. The STT transcripts are accurate. The LLM is generating appropriate responses. But callers are being cut off mid-sentence and the AI is responding to incomplete inputs.
The cause is almost always VAD misconfiguration. Understanding VAD - what it does, how it works, and how to tune it - is one of the most practically useful pieces of technical knowledge a Voice AI PM can have. This guide explains all of it, including the calibration decisions that the platform documentation rarely spells out clearly.
What VAD is and what it actually does
Voice Activity Detection is a signal processing component that classifies incoming audio frames as either "speech" or "silence/background noise." In a Voice AI system, VAD serves as the gatekeeper for the STT pipeline - it determines which audio to send to the speech-to-text engine and, critically, when to stop listening and trigger the AI's response generation.
The VAD decision happens in near-real-time, processing audio in small frames - typically 10–30 milliseconds each. For each frame, the VAD model outputs a probability that the frame contains speech. When that probability crosses a threshold, the frame is classified as speech. When the probability drops below the threshold for a configured duration - the "end of speech timeout" - VAD signals that the caller has finished speaking and triggers the AI response pipeline.
This sounds straightforward. In practice, human speech is not clean. People pause mid-sentence. They say "um" and then continue. They trail off and then add a final clause. They pause while thinking. VAD has to distinguish between "the caller has finished speaking" and "the caller has paused briefly and is about to continue" - and it has to make that distinction in under 500ms, because waiting longer to be sure means the AI feels slow.
The two failure modes - and what they sound like to callers
The VAD triggers end-of-speech too quickly, cutting the caller off before they have finished their sentence. The AI responds to an incomplete input - either with a confused response, a clarification request, or an answer to the first half of a sentence that had a different meaning when complete.
What callers say: "It kept interrupting me." "It answered before I finished." "It didn't let me explain properly." This complaint spikes on calls where callers pause naturally mid-sentence - elderly callers, callers thinking through a complex query, non-native speakers. Aggressive VAD has a disproportionate impact on vulnerable and non-native caller populations.
The VAD waits too long before triggering end-of-speech, adding hundreds of milliseconds of perceived latency after every caller utterance. The caller finishes speaking. There is an awkward silence. Then the AI responds. The silence feels like the AI is "thinking too hard" - even when the actual LLM processing time is fast.
What callers say: "There was a long pause every time." "It felt slow and unresponsive." "The AI kept hesitating." Conservative VAD adds latency without adding any benefit to STT accuracy. It is a pure UX cost with no corresponding technical gain.
The four VAD parameters that matter
Different platforms expose different VAD parameters, but most implementations are controlled by some combination of four settings. Understanding what each one does is the prerequisite for tuning them correctly.
The minimum probability score (0-1) above which an audio frame is classified as speech. A threshold of 0.5 means any frame where the model is more than 50% confident it contains speech is classified as speech. Increasing the threshold makes VAD more conservative - only frames the model is very confident about are classified as speech. Decreasing it makes VAD more aggressive. Start at 0.5 and adjust in increments of 0.05.
The duration of continuous silence (in milliseconds) that the VAD must observe before triggering end-of-speech. This is the single most impactful parameter for the interruption problem. A timeout of 300ms means the AI fires after 300ms of silence - which cuts off callers who pause naturally. A timeout of 700ms feels sluggish. The sweet spot for most English-language business call use cases is 400-500ms. Non-native speakers and elderly callers benefit from 550-650ms.
The minimum duration (in milliseconds) of detected speech before VAD classifies an audio segment as a genuine utterance. This filter prevents brief noises - a cough, a background sound, a brief "um" - from triggering a false end-of-speech followed by an AI response to nothing meaningful. A minimum speech duration of 100-150ms filters out most accidental triggers without affecting real utterances.
Controls how quickly the AI stops its own speech when the caller interrupts. High barge-in sensitivity means the AI stops speaking almost immediately when speech is detected - which can cause the AI to stop mid-sentence when a caller makes a background noise. Low sensitivity means the AI finishes what it is saying even when the caller is trying to interrupt - which frustrates callers who want to cut to the point. This setting needs to match the conversational style of your use case.
How I calibrate VAD on a new deployment
My VAD calibration process starts with a 20-call recording sample from the target caller population - not my colleagues in a quiet office, but real callers in the real acoustic environment. I listen to each call specifically for natural pause patterns: how long do callers in this demographic pause mid-sentence? How long are their inter-sentence pauses? These numbers set my initial end-of-speech timeout target.
For a financial services client with a primarily 55+ caller demographic, I set the initial end-of-speech timeout at 580ms - significantly higher than the platform default of 400ms. The callers in that demographic pause longer, speak more slowly, and are more likely to add a clause after what sounds like a sentence-ending pause. At 400ms the interruption rate was 12% of turns. At 580ms it dropped to 2.4%.
The rule I follow: never tune VAD using your own voice as the test input. Your speaking patterns, your pause duration, your accent, and your background noise environment are almost certainly different from your callers'. VAD must be calibrated on the actual caller population in the actual call environment. A VAD setting that sounds perfect when you test it from your office chair may cut off half your callers when they ring from a car, a kitchen, or a noisy street.
VAD calibration starting points by use case
| Use case / population | End-of-speech timeout | Notes |
|---|---|---|
| Young professional, structured query | 350-400ms | Short natural pauses, assertive speech patterns |
| General consumer, mixed demographics | 450-500ms | Safe starting point for most inbound use cases |
| 55+ caller demographic | 550-650ms | Longer natural pauses, more mid-sentence additions |
| Non-native English speakers | 550-700ms | Longer processing pauses, more mid-sentence hesitation |
| Healthcare / clinical disclosures | 600-750ms | Callers describing symptoms often pause frequently |
| Outbound structured (confirmation calls) | 300-380ms | Short binary responses - yes/no/1/2. Faster is better. |
How to measure VAD performance in production
VAD quality is measurable. Define these three metrics and track them weekly in the first month of production.
"Never tune VAD using your own voice as the test input. Your speaking patterns are almost certainly different from your callers'. A setting that sounds perfect from your office chair may cut off half your callers when they ring from a car."
- The rule I state at the start of every VAD calibration sessionThe setting that decides whether your AI feels natural
VAD configuration is not glamorous. It does not feature in demos. Vendors do not lead with it in sales conversations. But it is the setting that determines whether callers experience your Voice AI as a natural conversation partner or as a system that cuts them off and makes them repeat themselves.
The calibration process takes half a day if done correctly - listening to real caller recordings, measuring natural pause patterns, setting initial parameters based on the population rather than defaults, and monitoring the three production metrics in the first weeks. That half day of calibration work is visible in every interaction your AI has for the lifetime of the deployment. It is one of the highest-leverage investments a Voice AI PM can make.
Dealing with an AI that interrupts callers?
Get in touch if you are working through VAD calibration and want to discuss your specific caller population and use case.
Follow with your Google account and get new posts in your Blogger reading list automatically.

Comments
Post a Comment