Solutions · Speech & Audio · Audio language detection

Know what's being spoken, before you transcribe it.

Identify the language of any audio input in milliseconds, before a single word is decoded. The same Whisper-class model that drives transcription returns the most likely language and a confidence score, so your pipeline can route to the right downstream model, decide whether to translate, or flag uncertain audio for human review. 100+ languages, fully on-device, no cloud calls.

Start building free API reference

100+ languages ISO code + confidence Whisper-driven

Input

Any audio file

WAV, MP3, FLAC, OGG, M4A, WMA.

Output

ISO code + confidence

"en", "ja", "fr", "ar", ...

Latency

Single-pass

Uses the first ~30 s window only.

Privacy

100% on-device

No upload, no API key, no logs.

Audio in, ISO code out, milliseconds later.

SpeechToText.DetectLanguage runs the encoder on a short window of audio (the first ~30 seconds is enough for nearly all material), reads off the model's language-probability head, and returns the top candidate as an ISO 639-1 code with a confidence score. No full transcription required, no second model, no cloud roundtrip.

Benefits

Five reasons to detect before transcribing.

Detection is fast and cheap; running the wrong transcription model is slow and inaccurate. Spending one detection pass saves you compute, accuracy, and downstream complexity.

Auto-route to the right model

Run a fast multilingual model for detection, then dispatch the audio to a language-specialised STT model (or a larger Whisper variant) that's known to be more accurate on that specific language. Best accuracy per cycle, with no hardcoded assumption about user content.

Decide when to translate

If the detected language is your target, transcribe directly. If not, switch to SpeechToTextMode.Translation for one-step transcribe-and-translate-to-English. No need to ask the user to declare a source language.

Skip work on out-of-scope content

A meeting bot only supports six languages? A help desk only handles English and Japanese? Detect first, then drop or hand off out-of-scope audio before the expensive transcription job even starts.

Tag content for search and archive

Stamp every recording in a media archive with its language code. Multilingual search, language-faceted browse, and per-language quality KPIs all become trivial once the tag exists.

Flag low-confidence audio for review

Confidence below your threshold? That's a signal: the audio might be silent, music-only, multilingual (code-switching mid-clip), or in an unsupported language. Route those to a human or to a more careful pipeline.

Native .NET, no second dependency

Detection lives on the SpeechToText class you already use for transcription. No extra NuGet, no separate model file, no additional licence. One model load, both capabilities.

Three patterns

Detect, then do something useful.

The detection call is one line. The interesting code is what you do with the result. Three production-shaped patterns below.

Bare-minimum call. Returns an ISO 639-1 code ("en", "ja", "fr", ...) plus a confidence score in [0, 1].

Detect.cs

using LMKit.Model;
using LMKit.Speech;

var model = LM.LoadFromModelID("whisper-large-turbo3");
var stt   = new SpeechToText(model);

var result = await stt.DetectLanguageAsync("clip.mp3");
Console.WriteLine($"{result.LanguageCode} ({result.Confidence:P0})");

Detect first, then dispatch to a more accurate STT model for languages or scripts where the speed-optimised turbo variant typically gives ground. CJK and right-to-left scripts benefit most from whisper-large3; everything else stays on the fast turbo path.

RouteByLanguage.cs

var detected = await stt.DetectLanguageAsync("clip.mp3");

// Most languages run great on the speed-optimised turbo variant.
// For CJK and a handful of right-to-left scripts, the slower large
// model is measurably more accurate. Detection picks the variant.
var modelId = detected.LanguageCode switch
{
    "ja" or "zh" or "ko" or "ar" or "he" => "whisper-large3",
    _                                => "whisper-large-turbo3",
};

var tuned = new SpeechToText(LM.LoadFromModelID(modelId));
var transcript = await tuned.TranscribeAsync("clip.mp3");

Detect, then transcribe directly if the language is English, otherwise switch to translation mode so the output lands in English without a second pass.

AutoTranslate.cs

var detected = await stt.DetectLanguageAsync("clip.mp3");

stt.Mode = detected.LanguageCode == "en"
    ? SpeechToTextMode.Transcription
    : SpeechToTextMode.Translation;       // any language -> English

await stt.TranscribeAsync("clip.mp3");

Audio vs text

Two detection capabilities, different inputs.

LM-Kit ships two distinct language detection paths because they share nothing under the hood and serve different upstream inputs. Pick the one that matches your source data.

Audio

Audio language detection

Input: a sound file or microphone stream. Engine: the Whisper encoder's language head. Output: ISO code + confidence. Use it before transcription, translation, or to tag archived recordings.

You're here

Text

Text language detection

Input: plain text, an image, or a document. Engine: an LLM with script-aware refiners (CJK, Cyrillic, Slavic). Output: ISO code + confidence + candidate set. Use it for routing chat messages, documents, OCR output, or already-transcribed audio.

Open text language detection

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Detect first, then ship the right pipeline.

Pair audio language detection with streaming transcription and VAD for a production-grade multilingual STT pipeline.

Download free Speech API reference