Skip silent regions
No tokens generated for empty audio.
Voice Activity Detection isolates the parts of an audio stream that
actually contain speech, before the transcription engine ever sees them.
The result: less compute spent on silence, less hallucinated text from
background noise, and cleaner segment boundaries for downstream
formatting. Configurable through VadSettings on the
SpeechToText engine.
No tokens generated for empty audio.
Fewer hallucinations from noise.
Tighter segment boundaries.
Break on real pauses, not arbitrary windows.
Whisper-class models are powerful, but they don't know that the first 12 seconds of your audio is the speaker walking into the room. Without VAD, the model will still try to produce tokens for that silence, often inventing phrases from training data ("Thanks for watching!" is a classic). VAD removes those regions upstream so the model only sees the parts that actually contain speech.
01
A 90-minute recording with 60 % silence becomes a 36-minute transcription job. On a single CPU core that's the difference between "during the call" and "by the time you join the next one."
02
VAD removes the input class most likely to provoke Whisper hallucinations. Combine with the engine's multi-layer suppression for production-grade reliability.
03
Segments break on real pauses (end of sentence, breath, room change) instead of fixed 30-second windows. Subtitles, dictation, and meeting notes all read better.
04
VAD runs inline with the streaming pipeline, so OnNewSegment only fires when the model has something to say. Live captions feel snappier and contain less noise.
Defaults work for most general-purpose audio. Tune any of the four for your environment: quiet office, noisy field recording, rapid-fire conversation, slow dictation.
Energy
RMS energy floor below which audio is classified as silence. Raise it for noisy environments (background hum, fans), lower it for soft speakers or distant microphones.
Speech
Shortest continuous-speech run that triggers a "speech detected" segment. Filters out single coughs, clicks, and short bursts of noise without losing real speech.
Silence
Gap length that ends a speech segment. Short for rapid conversation (segments break frequently), long for dictation (one paragraph = one segment).
Pad
Extra audio kept around each detected speech region. Prevents the model from clipping the first phoneme of a word when a speaker starts abruptly after silence.
Same engine, three different parameter sets. Pick the calibration that matches your input or compose your own.
The default VadSettings works for typical office audio: a single speaker, a quiet room, a decent microphone. Use it unchanged until you have a reason to tune.
using LMKit.Speech;
var stt = new SpeechToText(model)
{
UseVad = true // defaults are tuned for general-purpose audio
};
await stt.TranscribeAsync("meeting.mp3");
Field recording or open-plan office: raise the energy threshold so background noise stays below the silence floor, and extend padding so the model never clips the start of a word.
stt.VadSettings = new VadSettings
{
EnergyThreshold = 0.04, // 2x default
MinSpeechDuration = TimeSpan.FromMilliseconds(250),
MinSilenceDuration = TimeSpan.FromMilliseconds(400),
SpeechPadding = TimeSpan.FromMilliseconds(300),
};
Dictation: long silences between sentences, no rapid turn-taking. Increase the silence threshold so a whole paragraph lands as a single segment instead of dozens.
stt.VadSettings = new VadSettings
{
MinSilenceDuration = TimeSpan.FromSeconds(1.2), // long sentence pauses
SpeechPadding = TimeSpan.FromMilliseconds(150),
};
stt.DictationFormatting = true;
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Console demo: streaming STT with VAD and hallucination suppression.
Open on GitHub → API referenceAPI reference for the VAD configuration class.
Open the reference → How-to guideEnd-to-end transcription guide; VAD configuration in section 4.
Read the guide →Pair VAD with streaming output for live captions that don't render phantom phrases during quiet moments.