Disclosure: This post contains affiliate links, including a link to Vapi. If you click through and sign up for a paid plan, I may earn a commission at no extra cost to you. I only recommend platforms I have personally evaluated. Full affiliate disclosure here.

Home › Voice AI › VAD explained: why your AI interrupts callers

Deep Dive

VAD explained: why your AI interrupts callers - and how to fix it

Priyanka

Senior Voice AI PM · April 19, 2026 · 10 min read · 1,900 words

Deep dive Voice AI VAD

The short answer

VAD - Voice Activity Detection - is the component that decides when a caller has finished speaking. When it is configured too aggressively, it cuts the caller off mid-sentence. When it is too conservative, it makes the AI feel sluggish and unresponsive. Getting VAD right is the single configuration decision that most affects whether a Voice AI conversation feels natural - and it is the one that most teams treat as a default setting rather than a tuned parameter.

There is a specific type of caller complaint that Voice AI teams hear in the first two weeks of production and cannot immediately explain: "The AI keeps interrupting me." The call logs show the AI responding correctly. The STT transcripts are accurate. The LLM is generating appropriate responses. But callers are being cut off mid-sentence and the AI is responding to incomplete inputs.

The cause is almost always VAD misconfiguration. Understanding VAD - what it does, how it works, and how to tune it - is one of the most practically useful pieces of technical knowledge a Voice AI PM can have. This guide explains all of it, including the calibration decisions that the platform documentation rarely spells out clearly.

What VAD is and what it actually does

Voice Activity Detection is a signal processing component that classifies incoming audio frames as either "speech" or "silence/background noise." In a Voice AI system, VAD serves as the gatekeeper for the STT pipeline - it determines which audio to send to the speech-to-text engine and, critically, when to stop listening and trigger the AI's response generation.

The VAD decision happens in near-real-time, processing audio in small frames - typically 10–30 milliseconds each. For each frame, the VAD model outputs a probability that the frame contains speech. When that probability crosses a threshold, the frame is classified as speech. When the probability drops below the threshold for a configured duration - the "end of speech timeout" - VAD signals that the caller has finished speaking and triggers the AI response pipeline.

This sounds straightforward. In practice, human speech is not clean. People pause mid-sentence. They say "um" and then continue. They trail off and then add a final clause. They pause while thinking. VAD has to distinguish between "the caller has finished speaking" and "the caller has paused briefly and is about to continue" - and it has to make that distinction in under 500ms, because waiting longer to be sure means the AI feels slow.

The two failure modes - and what they sound like to callers

Failure Mode 1 - Aggressive VAD (threshold too high, timeout too short)

The VAD triggers end-of-speech too quickly, cutting the caller off before they have finished their sentence. The AI responds to an incomplete input - either with a confused response, a clarification request, or an answer to the first half of a sentence that had a different meaning when complete.

What callers say: "It kept interrupting me." "It answered before I finished." "It didn't let me explain properly." This complaint spikes on calls where callers pause naturally mid-sentence - elderly callers, callers thinking through a complex query, non-native speakers. Aggressive VAD has a disproportionate impact on vulnerable and non-native caller populations.

Failure Mode 2 - Conservative VAD (threshold too low, timeout too long)

The VAD waits too long before triggering end-of-speech, adding hundreds of milliseconds of perceived latency after every caller utterance. The caller finishes speaking. There is an awkward silence. Then the AI responds. The silence feels like the AI is "thinking too hard" - even when the actual LLM processing time is fast.

What callers say: "There was a long pause every time." "It felt slow and unresponsive." "The AI kept hesitating." Conservative VAD adds latency without adding any benefit to STT accuracy. It is a pure UX cost with no corresponding technical gain.

The four VAD parameters that matter

Different platforms expose different VAD parameters, but most implementations are controlled by some combination of four settings. Understanding what each one does is the prerequisite for tuning them correctly.

1. Speech probability threshold

The minimum probability score (0-1) above which an audio frame is classified as speech. A threshold of 0.5 means any frame where the model is more than 50% confident it contains speech is classified as speech. Increasing the threshold makes VAD more conservative - only frames the model is very confident about are classified as speech. Decreasing it makes VAD more aggressive. Start at 0.5 and adjust in increments of 0.05.

2. End-of-speech timeout (silence duration)

The duration of continuous silence (in milliseconds) that the VAD must observe before triggering end-of-speech. This is the single most impactful parameter for the interruption problem. A timeout of 300ms means the AI fires after 300ms of silence - which cuts off callers who pause naturally. A timeout of 700ms feels sluggish. The sweet spot for most English-language business call use cases is 400-500ms. Non-native speakers and elderly callers benefit from 550-650ms.

3. Minimum speech duration

The minimum duration (in milliseconds) of detected speech before VAD classifies an audio segment as a genuine utterance. This filter prevents brief noises - a cough, a background sound, a brief "um" - from triggering a false end-of-speech followed by an AI response to nothing meaningful. A minimum speech duration of 100-150ms filters out most accidental triggers without affecting real utterances.

4. Barge-in sensitivity

Controls how quickly the AI stops its own speech when the caller interrupts. High barge-in sensitivity means the AI stops speaking almost immediately when speech is detected - which can cause the AI to stop mid-sentence when a caller makes a background noise. Low sensitivity means the AI finishes what it is saying even when the caller is trying to interrupt - which frustrates callers who want to cut to the point. This setting needs to match the conversational style of your use case.

How I calibrate VAD on a new deployment

From my experience

My VAD calibration process starts with a 20-call recording sample from the target caller population - not my colleagues in a quiet office, but real callers in the real acoustic environment. I listen to each call specifically for natural pause patterns: how long do callers in this demographic pause mid-sentence? How long are their inter-sentence pauses? These numbers set my initial end-of-speech timeout target.

For a financial services client with a primarily 55+ caller demographic, I set the initial end-of-speech timeout at 580ms - significantly higher than the platform default of 400ms. The callers in that demographic pause longer, speak more slowly, and are more likely to add a clause after what sounds like a sentence-ending pause. At 400ms the interruption rate was 12% of turns. At 580ms it dropped to 2.4%.

The rule I follow: never tune VAD using your own voice as the test input. Your speaking patterns, your pause duration, your accent, and your background noise environment are almost certainly different from your callers'. VAD must be calibrated on the actual caller population in the actual call environment. A VAD setting that sounds perfect when you test it from your office chair may cut off half your callers when they ring from a car, a kitchen, or a noisy street.

VAD calibration starting points by use case

Starting configuration by caller population

Use case / population	End-of-speech timeout	Notes
Young professional, structured query	350-400ms	Short natural pauses, assertive speech patterns
General consumer, mixed demographics	450-500ms	Safe starting point for most inbound use cases
55+ caller demographic	550-650ms	Longer natural pauses, more mid-sentence additions
Non-native English speakers	550-700ms	Longer processing pauses, more mid-sentence hesitation
Healthcare / clinical disclosures	600-750ms	Callers describing symptoms often pause frequently
Outbound structured (confirmation calls)	300-380ms	Short binary responses - yes/no/1/2. Faster is better.

How to measure VAD performance in production

VAD quality is measurable. Define these three metrics and track them weekly in the first month of production.

Interruption rate - the percentage of turns where the AI's STT transcript contains an incomplete sentence (trailing off mid-clause, no sentence-ending punctuation). Target: under 3%. Anything above 5% indicates aggressive VAD.

False trigger rate - the percentage of AI responses triggered by background noise or non-speech audio (coughs, keyboard clicks, room noise). Target: under 1%. High false trigger rates indicate the speech probability threshold is set too low.

Perceived latency score - from CSAT responses or post-call surveys: "Did the AI respond at the right pace?" Scoring below 3.5/5 on this item consistently indicates conservative VAD adding excessive silence before responses.

"Never tune VAD using your own voice as the test input. Your speaking patterns are almost certainly different from your callers'. A setting that sounds perfect from your office chair may cut off half your callers when they ring from a car."

- The rule I state at the start of every VAD calibration session

Platform with configurable VAD parameters

Vapi - Voice AI Platform

Configurable VAD sensitivity · End-of-speech timeout control · Barge-in settings · Per-turn call logs · Free tier

Vapi exposes VAD sensitivity and end-of-speech timeout as configurable parameters on the assistant level - which means you can test different VAD settings on the same call flow without rebuilding anything. Combined with per-turn call logging that shows the VAD decision timing, this makes Vapi the most practical platform for systematic VAD calibration work.

Try Vapi free affiliate link

The setting that decides whether your AI feels natural

VAD configuration is not glamorous. It does not feature in demos. Vendors do not lead with it in sales conversations. But it is the setting that determines whether callers experience your Voice AI as a natural conversation partner or as a system that cuts them off and makes them repeat themselves.

The calibration process takes half a day if done correctly - listening to real caller recordings, measuring natural pause patterns, setting initial parameters based on the population rather than defaults, and monitoring the three production metrics in the first weeks. That half day of calibration work is visible in every interaction your AI has for the lifetime of the deployment. It is one of the highest-leverage investments a Voice AI PM can make.

Dealing with an AI that interrupts callers?

Get in touch if you are working through VAD calibration and want to discuss your specific caller population and use case.

Get in touch About this blog

Join this blog

Follow Voice AI Insider on Blogger

Follow with your Google account and get new posts in your Blogger reading list automatically.

Follow this blog

Build better Voice AI products.
Faster than your competitors.

Search This Blog

VOICEAIPM