Any audio file
WAV, MP3, FLAC, OGG, M4A, WMA.
Identify the language of any audio input in milliseconds, before a single word is decoded. The same Whisper-class model that drives transcription returns the most likely language and a confidence score, so your pipeline can route to the right downstream model, decide whether to translate, or flag uncertain audio for human review. 100+ languages, fully on-device, no cloud calls.
WAV, MP3, FLAC, OGG, M4A, WMA.
"en", "ja", "fr", "ar", ...
Uses the first ~30 s window only.
No upload, no API key, no logs.
SpeechToText.DetectLanguage runs the encoder on a short window
of audio (the first ~30 seconds is enough for nearly all material), reads off
the model's language-probability head, and returns the top candidate as an
ISO 639-1 code with a confidence score. No full transcription required, no
second model, no cloud roundtrip.
Detection is fast and cheap; running the wrong transcription model is slow and inaccurate. Spending one detection pass saves you compute, accuracy, and downstream complexity.
01
Run a fast multilingual model for detection, then dispatch the audio to a language-specialised STT model (or a larger Whisper variant) that's known to be more accurate on that specific language. Best accuracy per cycle, with no hardcoded assumption about user content.
02
If the detected language is your target, transcribe directly. If not, switch to SpeechToTextMode.Translation for one-step transcribe-and-translate-to-English. No need to ask the user to declare a source language.
03
A meeting bot only supports six languages? A help desk only handles English and Japanese? Detect first, then drop or hand off out-of-scope audio before the expensive transcription job even starts.
04
Stamp every recording in a media archive with its language code. Multilingual search, language-faceted browse, and per-language quality KPIs all become trivial once the tag exists.
05
Confidence below your threshold? That's a signal: the audio might be silent, music-only, multilingual (code-switching mid-clip), or in an unsupported language. Route those to a human or to a more careful pipeline.
06
Detection lives on the SpeechToText class you already use for transcription. No extra NuGet, no separate model file, no additional licence. One model load, both capabilities.
The detection call is one line. The interesting code is what you do with the result. Three production-shaped patterns below.
Bare-minimum call. Returns an ISO 639-1 code ("en", "ja", "fr", ...) plus a confidence score in [0, 1].
using LMKit.Model;
using LMKit.Speech;
var model = LM.LoadFromModelID("whisper-large-turbo3");
var stt = new SpeechToText(model);
var result = await stt.DetectLanguageAsync("clip.mp3");
Console.WriteLine($"{result.LanguageCode} ({result.Confidence:P0})");
Detect first, then dispatch to a more accurate STT model for languages or scripts where the speed-optimised turbo variant typically gives ground. CJK and right-to-left scripts benefit most from whisper-large3; everything else stays on the fast turbo path.
var detected = await stt.DetectLanguageAsync("clip.mp3");
// Most languages run great on the speed-optimised turbo variant.
// For CJK and a handful of right-to-left scripts, the slower large
// model is measurably more accurate. Detection picks the variant.
var modelId = detected.LanguageCode switch
{
"ja" or "zh" or "ko" or "ar" or "he" => "whisper-large3",
_ => "whisper-large-turbo3",
};
var tuned = new SpeechToText(LM.LoadFromModelID(modelId));
var transcript = await tuned.TranscribeAsync("clip.mp3");
Detect, then transcribe directly if the language is English, otherwise switch to translation mode so the output lands in English without a second pass.
var detected = await stt.DetectLanguageAsync("clip.mp3");
stt.Mode = detected.LanguageCode == "en"
? SpeechToTextMode.Transcription
: SpeechToTextMode.Translation; // any language -> English
await stt.TranscribeAsync("clip.mp3");
LM-Kit ships two distinct language detection paths because they share nothing under the hood and serve different upstream inputs. Pick the one that matches your source data.
Audio
Input: a sound file or microphone stream. Engine: the Whisper encoder's language head. Output: ISO code + confidence. Use it before transcription, translation, or to tag archived recordings.
You're hereText
Input: plain text, an image, or a document. Engine: an LLM with script-aware refiners (CJK, Cyrillic, Slavic). Output: ISO code + confidence + candidate set. Use it for routing chat messages, documents, OCR output, or already-transcribed audio.
Open text language detectionWorking console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Console demo: language detection is part of the streaming STT pipeline.
Open on GitHub → API referenceAPI reference for the audio language detection method.
Open the reference → RelatedThe sibling capability for text input. Lives under Text Analysis.
Open →Pair audio language detection with streaming transcription and VAD for a production-grade multilingual STT pipeline.