Solutions · Speech & Audio

On-device audio transcription & translation.

The fastest and most accurate speech-to-text implementation for .NET. Transform speech into structured, searchable text entirely on-device with advanced hallucination suppression, Voice Activity Detection, intelligent dictation formatting, real-time translation to English, and 100+ language support. A growing local STT stack with zero cloud dependency, continuously improved for better accuracy and performance.

100+ languages Voice Activity Detection Hallucination suppression
Capability

Real-time translation

Any language to English in one step.

Capability

Streaming output

OnNewSegment event for live UIs.

Capability

Timestamped segments

Perfect for SRT/VTT subtitles.

Capability

Universal audio

WAV, MP3, FLAC, OGG, M4A, WMA.

The fastest .NET speech recognition.

SpeechToText is LM-Kit.NET's high-performance engine for converting audio content into structured, searchable text. This is the most accurate and fastest speech-to-text implementation available for .NET, with complete native integration and zero external dependencies.

Built under continuous innovation, LM-Kit delivers production-ready speech recognition with Voice Activity Detection, advanced hallucination suppression, dictation formatting, real-time translation, and support for 100+ languages. Accuracy and performance improve with each release. The same technology can be leveraged to build thousands of other AI capabilities in your .NET applications.

See it in action: LynxTranscribe is a full-featured, open-source transcription application built with LM-Kit.NET and .NET MAUI. Drag-and-drop audio files, record from microphone, export to multiple formats. A complete integration demonstration of LM-Kit speech-to-text technology running 100% locally.

Available models

Local STT models for every use case.

Multiple model sizes today, from ultra-fast edge deployment to maximum accuracy. The catalog grows with every release; swap models through configuration only.

ModelModel IDSizeSpeedBest for
Whisper Tinywhisper-tiny~40 MBFastestEdge
Whisper Basewhisper-base~70 MBFastLightweight
Whisper Smallwhisper-small~240 MBFastBalanced
Whisper Mediumwhisper-medium~760 MBBalancedAccurate
Whisper Large Turbo V3whisper-large-turbo3~810 MBFastAccurate
Whisper Large V2whisper-large2~1.54 GBHigh accuracyTranslation
Whisper Large V3whisper-large3~1.54 GBMost accurateTranslation

Full Model Catalog

Browse every speech-to-text model shipped with LM-Kit.NET

View catalog

Hallucination suppression

Multi-layer hallucination suppression.

Strong technology to reduce hallucinations and false positives through adaptive filtering.

Eliminate phantom text

Speech-to-text models can occasionally produce hallucinated outputs, especially during silent or low-energy audio segments. Common hallucinations include phrases like "Thank you for watching", "Subscribe", "Hello", or other phantom text that does not correspond to actual speech in the audio.

LM-Kit's SuppressHallucinations feature applies advanced adaptive filtering that combines multiple validation strategies, including entropy-based adaptive mathematical analysis and innovative signal processing techniques. This technology is continuously improved by our R&D team, delivering better accuracy with each release. Additional proprietary approaches further enhance detection reliability.

  • Audio energy analysis: Computes RMS energy for each segment and compares against adaptive thresholds derived from previously transcribed segments
  • Statistical adaptation: Filtering threshold adjusts dynamically based on median RMS, variance, and stability of prior segments
  • No-speech probability: Segments with high no-speech probability scores from the model are filtered out
  • Token confidence: Segments with very high average token probability bypass additional filtering
  • Speaking rate validation: Validates word count relative to segment duration falls within realistic human speaking rates

Layer 01

RMS energy analysis

Compare segment energy against adaptive thresholds.

Layer 02

Statistical adaptation

Dynamic threshold based on segment history.

Layer 03

No-speech probability

Model confidence in speech presence.

Layer 04

Token confidence

High confidence bypasses additional checks.

Layer 05

Speaking rate validation

Words per second within human range.

Toolkit

Complete speech processing toolkit.

Everything you need to build production-grade audio transcription pipelines, fully integrated with .NET.

Detection

100+ language detection

Automatic language detection across 100+ languages. No manual configuration required. DetectLanguage returns ISO language codes with confidence scores.

VAD

Voice Activity Detection

Configurable VAD isolates speech from silence and background noise. Energy thresholds, speech/silence durations, and padding are all adjustable via VadSettings.

Translation

Real-time translation

Transcribe any language and translate to English simultaneously using SpeechToTextMode.Translation. One-step multilingual content processing.

Format

Dictation formatting

Intelligent punctuation and capitalization for dictation workflows. Transform raw speech into properly formatted text automatically.

Timing

Timestamped segments

Every AudioSegment includes start/end timestamps, text, confidence score, and detected language. Perfect for subtitles, video sync, and searchable archives.

Audio

Universal audio support

WAV, MP3, FLAC, OGG, M4A, WMA. Any sample rate, mono or stereo. The WaveFile class handles format detection and conversion automatically.

Suppress

Hallucination suppression

Multi-layer adaptive filtering eliminates false positives and phantom text through RMS analysis, statistical adaptation, and speaking rate validation.

Stream

Streaming output

OnNewSegment event delivers transcription results in real-time as audio is processed. Build responsive UIs with immediate feedback.

.NET

Native .NET integration

Completely integrated with .NET ecosystem. No external dependencies, no interop complexity. Use familiar C# patterns and async/await throughout.

VAD

Intelligent speech isolation.

Configurable VAD parameters for optimal transcription accuracy in any environment.

Isolate what matters

Voice Activity Detection distinguishes speech from background noise, silence, and non-speech audio. By processing only meaningful speech segments, VAD dramatically improves transcription accuracy while reducing processing time and resource usage.

LM-Kit's VadSettings class provides fine-grained control over detection parameters, letting you tune sensitivity for different audio environments, from quiet meeting rooms to noisy field recordings.

  • Reduce transcription errors from background noise
  • Process only meaningful audio segments
  • Configurable for any audio environment
  • Faster processing with intelligent filtering

Setting

Energy threshold

Minimum energy level to detect speech vs silence.

Setting

Speech duration

Minimum continuous speech to trigger detection.

Setting

Silence duration

Gap length that ends a speech segment.

Setting

Speech padding

Extra audio context around detected speech.

Code samples

Transcribe audio in minutes.

Complete examples showing audio transcription and translation with streaming output.

Transcription.cs
using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

namespace YourNamespace
{
    class Program
    {
        static void Main(string[] args)
        {
            // Instantiate the Whisper model by ID.
            // See the full model catalog at:
            // https://docs.lm-kit.com/lm-kit-net/guides/getting-started/model-catalog.html
            var model = LM.LoadFromModelID("whisper-large-turbo3");

            // Open the WAV file from disk for transcription
            var wavFile = new WaveFile(@"d:\discussion.wav");

            // Create the speech-to-text engine for streaming, multi-turn transcription
            var engine = new SpeechToText(model);

            // Print each segment of transcription as it's received (e.g., real-time display)
            engine.OnNewSegment += (sender, e) =>
                Console.WriteLine(e.Segment);

            // Transcribe the entire WAV file; returns the full transcription information
            var transcription = engine.Transcribe(wavFile);

            // TODO: handle transcription results (e.g., save to file or process further)
        }
    }
}
Reference app

Open-source .NET MAUI transcription app.

A complete integration demonstration of LM-Kit.NET speech-to-text technology. Built with .NET MAUI for Windows and macOS. Your audio stays on your device.

LynxTranscribe

Drag-and-drop audio files, record from microphone, export to multiple formats including SRT/VTT subtitles.

View on GitHub

Post-processing

Post-processing capabilities.

Leverage LM-Kit.NET's text generation to transform transcriptions into actionable outputs.

Summarize

Summarize

Generate concise summaries from lengthy transcriptions. Extract key points, decisions, and highlights automatically.

Speakers

Speaker detection

Identify and label different speakers in conversations. Attribute segments to individual participants.

Actions

Extract action items

Pull out tasks, to-dos, and commitments from meeting transcripts. Generate structured task lists.

Notes

Format as meeting notes

Structure transcriptions with headers, sections, and formatting. Create professional meeting documentation.

Grammar

Correct grammar

Fix transcription errors and improve readability. Clean up spoken language into polished written text.

Translate

Translate text

Convert transcribed text to any target language. Multilingual content distribution from a single source.

Applications

Built for real applications.

From meeting transcription to accessibility features to content indexing.

Meetings

Meeting transcription

Convert meeting recordings into searchable transcripts. Timestamped segments sync perfectly with video for easy navigation and review.

Healthcare

Healthcare documentation

Capture voice notes and clinical consultations with dictation formatting. Process sensitive medical data entirely on-device for HIPAA compliance.

Subtitles

Subtitles & captions

Generate SRT and VTT subtitle files automatically. Perfect timing with AudioSegment timestamps for video accessibility.

Education

Education & e-learning

Transcribe multilingual lectures and courses. Make educational content accessible and searchable across languages.

Service

Customer service analysis

Transcribe support calls for quality analysis and training. Pair with LM-Kit's sentiment analysis for comprehensive customer insights.

Legal

Legal & compliance

Transcribe depositions, hearings, and consultations with accuracy. On-device processing ensures confidential legal content never leaves your infrastructure.

Voice

Voice assistants

Build voice-controlled interfaces with real-time transcription. Low latency on-device processing enables responsive voice interactions.

Index

Content indexing

Make audio and video content searchable. Extract text from podcasts, interviews, and media libraries for full-text search capabilities.

Developer Resources

Key classes & methods.

Core components for building speech recognition pipelines.

SpeechToText

Main transcription engine. Provides Transcribe, TranscribeAsync, DetectLanguage methods. Configure VAD, mode (transcription/translation), hallucination suppression, and streaming callbacks.

View documentation

AudioSegment

Represents a transcribed speech segment with text, start/end timestamps, confidence score, and detected language.

View documentation

VadSettings

Configuration for Voice Activity Detection. Control energy threshold, speech/silence durations, and speech padding for optimal detection.

View documentation

TranscriptionResult

Contains full transcription text and collection of AudioSegments. Access combined text or iterate segments for timestamps and metadata.

View documentation

WaveFile

Audio file handler supporting multiple formats. Automatic sample rate and channel detection. Use IsValid to verify audio integrity.

View documentation

SpeechToTextMode

Enum for transcription modes: Transcription (original language) or Translation (any language to English).

View documentation

.NET

Native .NET integration.

Completely integrated with .NET. Leverage speech recognition alongside thousands of other AI capabilities.

01

Zero dependencies

No FFmpeg, no native interop complexity. Pure .NET solution.

02

Async/await

Familiar C# patterns with TranscribeAsync for non-blocking operations.

03

Cross-platform

Windows, macOS, Linux. Desktop, mobile, and server deployments.

04

LM-Kit ecosystem

Combine with text generation, embeddings, and other AI capabilities.

Dictation formatter

From transcript to formatted text.

Raw STT output is one long unpunctuated string. Real dictation applications need the user to say "comma", "new line", "open bracket", "question mark" and have those words become actual punctuation, line breaks, brackets, and question marks in the final text. LMKit.Speech.Dictation ships exactly that: a multilingual command-aware formatter built on top of the transcription engine.

Class

Formatter

Transforms a raw transcript into formatted text by interpreting spoken formatting commands. Case-insensitive regex matching with Unicode support; the engine handles punctuation, line breaks, brackets, quotes, currency symbols, and arbitrary custom replacements.

  • "new line" / "next line" → \n
  • "comma" / "period" / "question mark" → punctuation
  • "open quote" / "close quote" → matched quotes
  • "open paren" / "close paren" → brackets
  • Configurable via FormatterOptions

Class

Command

Each command maps one or more spoken-form regex patterns (across languages) to a text replacement. Ship the built-in catalog or define your own for domain-specific dictation (medical, legal, code).

  • One command, many spoken forms
  • Patterns are regex (Unicode + case-insensitive)
  • Same command works across English, French, German, Spanish, Italian, Portuguese
  • Add custom commands at runtime

Languages

6 built-in, more on demand

English, French, German, Spanish, Italian, Portuguese ship out of the box. Each command carries multiple spoken-form patterns so the same code handles all 6 without language detection. Add additional language patterns by extending the command set.

DictationPipeline.cs
using LMKit.Speech;
using LMKit.Speech.Dictation;

// 1. Transcribe the audio (same SpeechToText engine as before).
var stt = new SpeechToText(model);
TranscriptionResult raw = await stt.TranscribeAsync(audioStream);

// 2. Run the dictation formatter to interpret spoken commands.
// "Hello comma how are you question mark new line I am fine period"
//   becomes: "Hello, how are you?\nI am fine."
string formatted = Formatter.Format(raw.Text);

// 3. (Optional) override or extend the command catalog.
var options = new FormatterOptions();
options.Commands.Add(new Command(
    spokenForms: new[] { @"snippet\s*break", @"saut\s*de\s*code" },
    replacement: "\n\n```\n"));

string custom = Formatter.Format(raw.Text, options);

The dictation layer composes with everything above on this page: VAD-driven segmentation feeds the transcript, hallucination suppression cleans it, the formatter turns it into production-grade text. Suitable for medical and legal dictation, code-by-voice IDEs, voice-driven email composers, and accessibility tools.

LM-Kit.NET pillars

Seven pillars, one foundation.

The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.

The foundation

Every capability above runs on this runtime.

Foundation

Local Inference

The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.

Explore the foundation

Ready to add speech recognition?

The fastest and most accurate .NET speech-to-text. On-device transcription with 100+ languages, VAD, hallucination suppression, and real-time translation. Zero cloud dependency.

Download free API documentation