Solutions · Speech & Audio · Audio language detection

Know what's being spoken, before you transcribe it.

Identify the language of any audio input in milliseconds, before a single word is decoded. The same Whisper-class model that drives transcription returns the most likely language and a confidence score, so your pipeline can route to the right downstream model, decide whether to translate, or flag uncertain audio for human review. 100+ languages, fully on-device, no cloud calls.

100+ languages ISO code + confidence Whisper-driven
Input

Any audio file

WAV, MP3, FLAC, OGG, M4A, WMA.

Output

ISO code + confidence

"en", "ja", "fr", "ar", ...

Latency

Single-pass

Uses the first ~30 s window only.

Privacy

100% on-device

No upload, no API key, no logs.

Audio in, ISO code out, milliseconds later.

SpeechToText.DetectLanguage runs the encoder on a short window of audio (the first ~30 seconds is enough for nearly all material), reads off the model's language-probability head, and returns the top candidate as an ISO 639-1 code with a confidence score. No full transcription required, no second model, no cloud roundtrip.

Benefits

Five reasons to detect before transcribing.

Detection is fast and cheap; running the wrong transcription model is slow and inaccurate. Spending one detection pass saves you compute, accuracy, and downstream complexity.

01

Auto-route to the right model

Run a fast multilingual model for detection, then dispatch the audio to a language-specialised STT model (or a larger Whisper variant) that's known to be more accurate on that specific language. Best accuracy per cycle, with no hardcoded assumption about user content.

02

Decide when to translate

If the detected language is your target, transcribe directly. If not, switch to SpeechToTextMode.Translation for one-step transcribe-and-translate-to-English. No need to ask the user to declare a source language.

03

Skip work on out-of-scope content

A meeting bot only supports six languages? A help desk only handles English and Japanese? Detect first, then drop or hand off out-of-scope audio before the expensive transcription job even starts.

04

Tag content for search and archive

Stamp every recording in a media archive with its language code. Multilingual search, language-faceted browse, and per-language quality KPIs all become trivial once the tag exists.

05

Flag low-confidence audio for review

Confidence below your threshold? That's a signal: the audio might be silent, music-only, multilingual (code-switching mid-clip), or in an unsupported language. Route those to a human or to a more careful pipeline.

06

Native .NET, no second dependency

Detection lives on the SpeechToText class you already use for transcription. No extra NuGet, no separate model file, no additional licence. One model load, both capabilities.

Three patterns

Detect, then do something useful.

The detection call is one line. The interesting code is what you do with the result. Three production-shaped patterns below.

Bare-minimum call. Returns an ISO 639-1 code ("en", "ja", "fr", ...) plus a confidence score in [0, 1].

Detect.cs
using LMKit.Model;
using LMKit.Speech;

var model = LM.LoadFromModelID("whisper-large-turbo3");
var stt   = new SpeechToText(model);

var result = await stt.DetectLanguageAsync("clip.mp3");
Console.WriteLine($"{result.LanguageCode} ({result.Confidence:P0})");

Detect first, then ship the right pipeline.

Pair audio language detection with streaming transcription and VAD for a production-grade multilingual STT pipeline.

Download free Speech API reference