Real-time translation
Any language to English in one step.
The fastest and most accurate speech-to-text implementation for .NET. Transform speech into structured, searchable text entirely on-device with advanced hallucination suppression, Voice Activity Detection, intelligent dictation formatting, real-time translation to English, and 100+ language support. A growing local STT stack with zero cloud dependency, continuously improved for better accuracy and performance.
Any language to English in one step.
OnNewSegment event for live UIs.
Perfect for SRT/VTT subtitles.
WAV, MP3, FLAC, OGG, M4A, WMA.
SpeechToText is LM-Kit.NET's high-performance engine for converting
audio content into structured, searchable text. This is the most accurate and
fastest speech-to-text implementation available for .NET, with complete native
integration and zero external dependencies.
Built under continuous innovation, LM-Kit delivers production-ready speech recognition with Voice Activity Detection, advanced hallucination suppression, dictation formatting, real-time translation, and support for 100+ languages. Accuracy and performance improve with each release. The same technology can be leveraged to build thousands of other AI capabilities in your .NET applications.
See it in action: LynxTranscribe is a full-featured, open-source transcription application built with LM-Kit.NET and .NET MAUI. Drag-and-drop audio files, record from microphone, export to multiple formats. A complete integration demonstration of LM-Kit speech-to-text technology running 100% locally.
App
Cross-platform desktop transcription app for Windows and macOS. Full integration demo of LM-Kit.NET speech-to-text capabilities: drag-and-drop audio files, live microphone recording, multiple export formats including SRT/VTT subtitles, all processed entirely on-device.
View LynxTranscribe on GitHubMinimal console application demonstrating audio transcription with streaming output, model selection, and confidence scoring.
View audio_transcription on GitHubMultiple model sizes today, from ultra-fast edge deployment to maximum accuracy. The catalog grows with every release; swap models through configuration only.
| Model | Model ID | Size | Speed | Best for |
|---|---|---|---|---|
| Whisper Tiny | whisper-tiny | ~40 MB | Fastest | Edge |
| Whisper Base | whisper-base | ~70 MB | Fast | Lightweight |
| Whisper Small | whisper-small | ~240 MB | Fast | Balanced |
| Whisper Medium | whisper-medium | ~760 MB | Balanced | Accurate |
| Whisper Large Turbo V3 | whisper-large-turbo3 | ~810 MB | Fast | Accurate |
| Whisper Large V2 | whisper-large2 | ~1.54 GB | High accuracy | Translation |
| Whisper Large V3 | whisper-large3 | ~1.54 GB | Most accurate | Translation |
Full Model Catalog
Browse every speech-to-text model shipped with LM-Kit.NET
Strong technology to reduce hallucinations and false positives through adaptive filtering.
Eliminate phantom text
Speech-to-text models can occasionally produce hallucinated outputs, especially during silent or low-energy audio segments. Common hallucinations include phrases like "Thank you for watching", "Subscribe", "Hello", or other phantom text that does not correspond to actual speech in the audio.
LM-Kit's SuppressHallucinations feature applies advanced adaptive filtering that combines multiple validation strategies, including entropy-based adaptive mathematical analysis and innovative signal processing techniques. This technology is continuously improved by our R&D team, delivering better accuracy with each release. Additional proprietary approaches further enhance detection reliability.
Layer 01
Compare segment energy against adaptive thresholds.
Layer 02
Dynamic threshold based on segment history.
Layer 03
Model confidence in speech presence.
Layer 04
High confidence bypasses additional checks.
Layer 05
Words per second within human range.
Everything you need to build production-grade audio transcription pipelines, fully integrated with .NET.
Detection
Automatic language detection across 100+ languages. No manual configuration required. DetectLanguage returns ISO language codes with confidence scores.
VAD
Configurable VAD isolates speech from silence and background noise. Energy thresholds, speech/silence durations, and padding are all adjustable via VadSettings.
Translation
Transcribe any language and translate to English simultaneously using SpeechToTextMode.Translation. One-step multilingual content processing.
Format
Intelligent punctuation and capitalization for dictation workflows. Transform raw speech into properly formatted text automatically.
Timing
Every AudioSegment includes start/end timestamps, text, confidence score, and detected language. Perfect for subtitles, video sync, and searchable archives.
Audio
WAV, MP3, FLAC, OGG, M4A, WMA. Any sample rate, mono or stereo. The WaveFile class handles format detection and conversion automatically.
Suppress
Multi-layer adaptive filtering eliminates false positives and phantom text through RMS analysis, statistical adaptation, and speaking rate validation.
Stream
OnNewSegment event delivers transcription results in real-time as audio is processed. Build responsive UIs with immediate feedback.
.NET
Completely integrated with .NET ecosystem. No external dependencies, no interop complexity. Use familiar C# patterns and async/await throughout.
Configurable VAD parameters for optimal transcription accuracy in any environment.
Isolate what matters
Voice Activity Detection distinguishes speech from background noise, silence, and non-speech audio. By processing only meaningful speech segments, VAD dramatically improves transcription accuracy while reducing processing time and resource usage.
LM-Kit's VadSettings class provides fine-grained control over detection parameters, letting you tune sensitivity for different audio environments, from quiet meeting rooms to noisy field recordings.
Setting
Minimum energy level to detect speech vs silence.
Setting
Minimum continuous speech to trigger detection.
Setting
Gap length that ends a speech segment.
Setting
Extra audio context around detected speech.
Complete examples showing audio transcription and translation with streaming output.
using LMKit.Media.Audio; using LMKit.Model; using LMKit.Speech; namespace YourNamespace { class Program { static void Main(string[] args) { // Instantiate the Whisper model by ID. // See the full model catalog at: // https://docs.lm-kit.com/lm-kit-net/guides/getting-started/model-catalog.html var model = LM.LoadFromModelID("whisper-large-turbo3"); // Open the WAV file from disk for transcription var wavFile = new WaveFile(@"d:\discussion.wav"); // Create the speech-to-text engine for streaming, multi-turn transcription var engine = new SpeechToText(model); // Print each segment of transcription as it's received (e.g., real-time display) engine.OnNewSegment += (sender, e) => Console.WriteLine(e.Segment); // Transcribe the entire WAV file; returns the full transcription information var transcription = engine.Transcribe(wavFile); // TODO: handle transcription results (e.g., save to file or process further) } } }
using LMKit.Media.Audio; using LMKit.Model; using LMKit.Speech; namespace YourNamespace { class Program { static void Main(string[] args) { // Instantiate the Whisper model by ID. // See the full model catalog at: // https://docs.lm-kit.com/lm-kit-net/guides/getting-started/model-catalog.html var model = LM.LoadFromModelID("whisper-large-turbo3"); // Open the WAV file from disk for transcription var wavFile = new WaveFile(@"d:\discussion.wav"); // Create the speech-to-text engine for streaming, multi-turn transcription+translation SpeechToText engine = new(model) { Mode = SpeechToText.SpeechToTextMode.Translation }; // Print each segment of transcription as it's received (e.g., real-time display) engine.OnNewSegment += (sender, e) => Console.WriteLine(e.Segment); // Transcribe the entire WAV file; returns the full transcription information var transcription = engine.Transcribe(wavFile); // TODO: handle transcription results (e.g., save to file or process further) } } }
using LMKit.Model; using LMKit.Media.Audio; using LMKit.Speech; // Load a Whisper model by ID var model = LM.LoadFromModelID("whisper-large-turbo3"); // Open audio file (any format, any sample rate) var wavFile = new WaveFile(@"meeting-recording.wav"); // Create transcription engine with full configuration var stt = new SpeechToText(model) { EnableVoiceActivityDetection = true, SuppressHallucinations = true, // Multi-layer adaptive filtering VadSettings = new VadSettings { EnergyThreshold = 0.5f, MinSpeechDuration = 0.3f, MinSilenceDuration = 0.5f } }; // Stream segments as they're transcribed stt.OnNewSegment += (sender, e) => { Console.WriteLine($"[{e.Segment.Start:mm\\:ss} -> {e.Segment.End:mm\\:ss}]"); Console.WriteLine($" {e.Segment.Text}"); Console.WriteLine($" Language: {e.Segment.Language}, Confidence: {e.Segment.Confidence:P1}"); }; // Track progress stt.OnProgress += (sender, e) => Console.WriteLine($"Progress: {e.ProgressPercentage:P0}"); // Transcribe the full audio file var result = stt.Transcribe(wavFile); Console.WriteLine($"\n=== Full Transcription ===\n{result.Text}"); Console.WriteLine($"\nSegments: {result.Segments.Count}");
A complete integration demonstration of LM-Kit.NET speech-to-text technology. Built with .NET MAUI for Windows and macOS. Your audio stays on your device.
LynxTranscribe
Drag-and-drop audio files, record from microphone, export to multiple formats including SRT/VTT subtitles.
Leverage LM-Kit.NET's text generation to transform transcriptions into actionable outputs.
Summarize
Generate concise summaries from lengthy transcriptions. Extract key points, decisions, and highlights automatically.
Speakers
Identify and label different speakers in conversations. Attribute segments to individual participants.
Actions
Pull out tasks, to-dos, and commitments from meeting transcripts. Generate structured task lists.
Notes
Structure transcriptions with headers, sections, and formatting. Create professional meeting documentation.
Grammar
Fix transcription errors and improve readability. Clean up spoken language into polished written text.
Translate
Convert transcribed text to any target language. Multilingual content distribution from a single source.
From meeting transcription to accessibility features to content indexing.
Meetings
Convert meeting recordings into searchable transcripts. Timestamped segments sync perfectly with video for easy navigation and review.
Healthcare
Capture voice notes and clinical consultations with dictation formatting. Process sensitive medical data entirely on-device for HIPAA compliance.
Subtitles
Generate SRT and VTT subtitle files automatically. Perfect timing with AudioSegment timestamps for video accessibility.
Education
Transcribe multilingual lectures and courses. Make educational content accessible and searchable across languages.
Service
Transcribe support calls for quality analysis and training. Pair with LM-Kit's sentiment analysis for comprehensive customer insights.
Legal
Transcribe depositions, hearings, and consultations with accuracy. On-device processing ensures confidential legal content never leaves your infrastructure.
Voice
Build voice-controlled interfaces with real-time transcription. Low latency on-device processing enables responsive voice interactions.
Index
Make audio and video content searchable. Extract text from podcasts, interviews, and media libraries for full-text search capabilities.
Core components for building speech recognition pipelines.
SpeechToTextMain transcription engine. Provides Transcribe, TranscribeAsync, DetectLanguage methods. Configure VAD, mode (transcription/translation), hallucination suppression, and streaming callbacks.
AudioSegmentRepresents a transcribed speech segment with text, start/end timestamps, confidence score, and detected language.
VadSettingsConfiguration for Voice Activity Detection. Control energy threshold, speech/silence durations, and speech padding for optimal detection.
TranscriptionResultContains full transcription text and collection of AudioSegments. Access combined text or iterate segments for timestamps and metadata.
WaveFileAudio file handler supporting multiple formats. Automatic sample rate and channel detection. Use IsValid to verify audio integrity.
SpeechToTextModeEnum for transcription modes: Transcription (original language) or Translation (any language to English).
Completely integrated with .NET. Leverage speech recognition alongside thousands of other AI capabilities.
01
No FFmpeg, no native interop complexity. Pure .NET solution.
02
Familiar C# patterns with TranscribeAsync for non-blocking operations.
03
Windows, macOS, Linux. Desktop, mobile, and server deployments.
04
Combine with text generation, embeddings, and other AI capabilities.
Raw STT output is one long unpunctuated string. Real dictation
applications need the user to say "comma", "new line", "open
bracket", "question mark" and have those words become actual
punctuation, line breaks, brackets, and question marks in the
final text. LMKit.Speech.Dictation ships exactly
that: a multilingual command-aware formatter built on top of
the transcription engine.
Class
FormatterTransforms a raw transcript into formatted text by interpreting spoken formatting commands. Case-insensitive regex matching with Unicode support; the engine handles punctuation, line breaks, brackets, quotes, currency symbols, and arbitrary custom replacements.
\nFormatterOptionsClass
CommandEach command maps one or more spoken-form regex patterns (across languages) to a text replacement. Ship the built-in catalog or define your own for domain-specific dictation (medical, legal, code).
Languages
English, French, German, Spanish, Italian, Portuguese ship out of the box. Each command carries multiple spoken-form patterns so the same code handles all 6 without language detection. Add additional language patterns by extending the command set.
using LMKit.Speech; using LMKit.Speech.Dictation; // 1. Transcribe the audio (same SpeechToText engine as before). var stt = new SpeechToText(model); TranscriptionResult raw = await stt.TranscribeAsync(audioStream); // 2. Run the dictation formatter to interpret spoken commands. // "Hello comma how are you question mark new line I am fine period" // becomes: "Hello, how are you?\nI am fine." string formatted = Formatter.Format(raw.Text); // 3. (Optional) override or extend the command catalog. var options = new FormatterOptions(); options.Commands.Add(new Command( spokenForms: new[] { @"snippet\s*break", @"saut\s*de\s*code" }, replacement: "\n\n```\n")); string custom = Formatter.Format(raw.Text, options);
The dictation layer composes with everything above on this page: VAD-driven segmentation feeds the transcript, hallucination suppression cleans it, the formatter turns it into production-grade text. Suitable for medical and legal dictation, code-by-voice IDEs, voice-driven email composers, and accessibility tools.
Three dedicated capabilities sit alongside the core transcription engine: live streaming output for responsive UIs, configurable voice activity detection for clean audio preprocessing, and audio language detection for multilingual routing.
Live
Stream tokens as they're produced using OnNewSegment. Build live captions, dictation UIs, and meeting transcripts that update segment-by-segment with sub-second latency.
VAD
Isolate speech from silence and background noise before STT runs. Tune energy thresholds, durations, and padding via VadSettings. Cuts compute on silent audio, sharpens transcripts on noisy audio.
Detect
Identify the spoken language of any audio file before transcription. Auto-route to the right model, decide when to translate, tag content for archives, all from a single fast detection call on the same SpeechToText instance.
Two adjacent capabilities live under the Text pillars but compose naturally with audio output: identify the spoken language, or translate transcripts into a target language.
Detect
Identify the language of any text snippet, including a transcript, before routing to translation, classification, or summarisation. Lives under Text Analysis.
Open language detectionTranslate
Translate transcripts across many language pairs for a microphone-to-target-language pipeline. Lives under Text Generation.
Open text translationWorking console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Console demo: streaming STT with VAD and hallucination suppression.
Open on GitHub → AppCross-platform desktop transcription app built on LM-Kit.NET and .NET MAUI.
Open → How-to guideLoad a model, pick a backend, stream results in real time.
Read the guide → API referenceNamespace landing for the full Speech & Audio stack.
Open the reference →The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.
01 · AI Agents
ReAct planning, supervisors, parallel and pipeline orchestrators, persistent memory, MCP clients, custom tools.
AI Agents02 · Document Intelligence
PDF text and table extraction, on-device OCR reaching SOTA benchmark scores, structured field extraction with grammar-constrained generation.
Document Intelligence03 · Vision & Multimodal
Image understanding, classification, labeling, multimodal chat, image embeddings, VLM-OCR, background removal. Same conversation surface as LLMs.
Vision & Multimodal04 · RAG & Knowledge
Built-in vector store, Qdrant connector, embeddings, hybrid retrieval, document chunking, source citations.
RAG & Knowledge05 · Text Analysis
Built-in classifiers and an extractor that emits typed C# objects via grammar-constrained sampling. Sentiment, keywords, language detection.
Text Analysis06 · Speech & Audio
A growing local speech-to-text stack: hallucination suppression, Voice Activity Detection, real-time translation, streaming output, 100+ languages.
You are here07 · Text Generation
Single-turn, multi-turn, and stateless conversation primitives. Translate, correct, rewrite, summarise. Prompt templates, streaming, grammar-constrained outputs.
Text GenerationThe foundation
Every capability above runs on this runtime.
Foundation
The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.
The fastest and most accurate .NET speech-to-text. On-device transcription with 100+ languages, VAD, hallucination suppression, and real-time translation. Zero cloud dependency.