Solutions · Speech & Audio · Voice Activity Detection

Hear the speech, skip the silence.

Voice Activity Detection isolates the parts of an audio stream that actually contain speech, before the transcription engine ever sees them. The result: less compute spent on silence, less hallucinated text from background noise, and cleaner segment boundaries for downstream formatting. Configurable through VadSettings on the SpeechToText engine.

Energy threshold Configurable padding Stream or batch
Compute

Skip silent regions

No tokens generated for empty audio.

Accuracy

Cleaner transcripts

Fewer hallucinations from noise.

Latency

Faster real-time

Tighter segment boundaries.

UX

Natural segments

Break on real pauses, not arbitrary windows.

Why VAD before STT.

Whisper-class models are powerful, but they don't know that the first 12 seconds of your audio is the speaker walking into the room. Without VAD, the model will still try to produce tokens for that silence, often inventing phrases from training data ("Thanks for watching!" is a classic). VAD removes those regions upstream so the model only sees the parts that actually contain speech.

01

Lower compute

A 90-minute recording with 60 % silence becomes a 36-minute transcription job. On a single CPU core that's the difference between "during the call" and "by the time you join the next one."

02

Fewer phantom tokens

VAD removes the input class most likely to provoke Whisper hallucinations. Combine with the engine's multi-layer suppression for production-grade reliability.

03

Natural segment boundaries

Segments break on real pauses (end of sentence, breath, room change) instead of fixed 30-second windows. Subtitles, dictation, and meeting notes all read better.

04

Composable with streaming

VAD runs inline with the streaming pipeline, so OnNewSegment only fires when the model has something to say. Live captions feel snappier and contain less noise.

VadSettings

Four knobs, fully tunable.

Defaults work for most general-purpose audio. Tune any of the four for your environment: quiet office, noisy field recording, rapid-fire conversation, slow dictation.

Energy

Energy threshold

RMS energy floor below which audio is classified as silence. Raise it for noisy environments (background hum, fans), lower it for soft speakers or distant microphones.

Speech

Minimum speech duration

Shortest continuous-speech run that triggers a "speech detected" segment. Filters out single coughs, clicks, and short bursts of noise without losing real speech.

Silence

Minimum silence duration

Gap length that ends a speech segment. Short for rapid conversation (segments break frequently), long for dictation (one paragraph = one segment).

Pad

Speech padding

Extra audio kept around each detected speech region. Prevents the model from clipping the first phoneme of a word when a speaker starts abruptly after silence.

Configure

Three calibrations, three audio types.

Same engine, three different parameter sets. Pick the calibration that matches your input or compose your own.

The default VadSettings works for typical office audio: a single speaker, a quiet room, a decent microphone. Use it unchanged until you have a reason to tune.

VadDefaults.cs
using LMKit.Speech;

var stt = new SpeechToText(model)
{
    UseVad = true          // defaults are tuned for general-purpose audio
};
await stt.TranscribeAsync("meeting.mp3");

Cleaner audio in, cleaner transcript out.

Pair VAD with streaming output for live captions that don't render phantom phrases during quiet moments.

Download free Speech API reference