Speech & Audio for .NET, On-Device Transcription, Dictation, Translation, LM-Kit

The fastest .NET speech recognition.

SpeechToText is LM-Kit.NET's high-performance engine for converting audio content into structured, searchable text. This is the most accurate and fastest speech-to-text implementation available for .NET, with complete native integration and zero external dependencies.

Built under continuous innovation, LM-Kit delivers production-ready speech recognition with Voice Activity Detection, advanced hallucination suppression, dictation formatting, real-time translation, and support for 100+ languages. Accuracy and performance improve with each release. The same technology can be leveraged to build thousands of other AI capabilities in your .NET applications.

See it in action: LynxTranscribe is a full-featured, open-source transcription application built with LM-Kit.NET and .NET MAUI. Drag-and-drop audio files, record from microphone, export to multiple formats. A complete integration demonstration of LM-Kit speech-to-text technology running 100% locally.

App

LynxTranscribe

Cross-platform desktop transcription app for Windows and macOS. Full integration demo of LM-Kit.NET speech-to-text capabilities: drag-and-drop audio files, live microphone recording, multiple export formats including SRT/VTT subtitles, all processed entirely on-device.

View LynxTranscribe on GitHub

Console demo

Minimal console application demonstrating audio transcription with streaming output, model selection, and confidence scoring.

View audio_transcription on GitHub

Available models

Local STT models for every use case.

Multiple model sizes today, from ultra-fast edge deployment to maximum accuracy. The catalog grows with every release; swap models through configuration only.

Model	Model ID	Size	Speed	Best for
Whisper Tiny	`whisper-tiny`	~40 MB	Fastest	Edge
Whisper Base	`whisper-base`	~70 MB	Fast	Lightweight
Whisper Small	`whisper-small`	~240 MB	Fast	Balanced
Whisper Medium	`whisper-medium`	~760 MB	Balanced	Accurate
Whisper Large Turbo V3	`whisper-large-turbo3`	~810 MB	Fast	Accurate
Whisper Large V2	`whisper-large2`	~1.54 GB	High accuracy	Translation
Whisper Large V3	`whisper-large3`	~1.54 GB	Most accurate	Translation

Full Model Catalog

Browse every speech-to-text model shipped with LM-Kit.NET

View catalog

Hallucination suppression

Multi-layer hallucination suppression.

Strong technology to reduce hallucinations and false positives through adaptive filtering.

Eliminate phantom text

Speech-to-text models can occasionally produce hallucinated outputs, especially during silent or low-energy audio segments. Common hallucinations include phrases like "Thank you for watching", "Subscribe", "Hello", or other phantom text that does not correspond to actual speech in the audio.

LM-Kit's SuppressHallucinations feature applies advanced adaptive filtering that combines multiple validation strategies, including entropy-based adaptive mathematical analysis and innovative signal processing techniques. This technology is continuously improved by our R&D team, delivering better accuracy with each release. Additional proprietary approaches further enhance detection reliability.

Audio energy analysis: Computes RMS energy for each segment and compares against adaptive thresholds derived from previously transcribed segments
Statistical adaptation: Filtering threshold adjusts dynamically based on median RMS, variance, and stability of prior segments
No-speech probability: Segments with high no-speech probability scores from the model are filtered out
Token confidence: Segments with very high average token probability bypass additional filtering
Speaking rate validation: Validates word count relative to segment duration falls within realistic human speaking rates

Layer 01

RMS energy analysis

Compare segment energy against adaptive thresholds.

Layer 02

Statistical adaptation

Dynamic threshold based on segment history.

Layer 03

No-speech probability

Model confidence in speech presence.

Layer 04

Token confidence

High confidence bypasses additional checks.

Layer 05

Speaking rate validation

Words per second within human range.

Toolkit

Complete speech processing toolkit.

Everything you need to build production-grade audio transcription pipelines, fully integrated with .NET.

Detection

100+ language detection

Automatic language detection across 100+ languages. No manual configuration required. DetectLanguage returns ISO language codes with confidence scores.

VAD

Voice Activity Detection

Configurable VAD isolates speech from silence and background noise. Energy thresholds, speech/silence durations, and padding are all adjustable via VadSettings.

Translation

Real-time translation

Transcribe any language and translate to English simultaneously using SpeechToTextMode.Translation. One-step multilingual content processing.

Format

Dictation formatting

Intelligent punctuation and capitalization for dictation workflows. Transform raw speech into properly formatted text automatically.

Timing

Timestamped segments

Every AudioSegment includes start/end timestamps, text, confidence score, and detected language. Perfect for subtitles, video sync, and searchable archives.

Audio

Universal audio support

WAV, MP3, FLAC, OGG, M4A, WMA. Any sample rate, mono or stereo. The WaveFile class handles format detection and conversion automatically.

Suppress

Hallucination suppression

Multi-layer adaptive filtering eliminates false positives and phantom text through RMS analysis, statistical adaptation, and speaking rate validation.

Stream

Streaming output

OnNewSegment event delivers transcription results in real-time as audio is processed. Build responsive UIs with immediate feedback.

.NET

Native .NET integration

Completely integrated with .NET ecosystem. No external dependencies, no interop complexity. Use familiar C# patterns and async/await throughout.

VAD

Intelligent speech isolation.

Configurable VAD parameters for optimal transcription accuracy in any environment.

Isolate what matters

Voice Activity Detection distinguishes speech from background noise, silence, and non-speech audio. By processing only meaningful speech segments, VAD dramatically improves transcription accuracy while reducing processing time and resource usage.

LM-Kit's VadSettings class provides fine-grained control over detection parameters, letting you tune sensitivity for different audio environments, from quiet meeting rooms to noisy field recordings.

Reduce transcription errors from background noise
Process only meaningful audio segments
Configurable for any audio environment
Faster processing with intelligent filtering

Setting

Energy threshold

Minimum energy level to detect speech vs silence.

Setting

Speech duration

Minimum continuous speech to trigger detection.

Setting

Silence duration

Gap length that ends a speech segment.

Setting

Speech padding

Extra audio context around detected speech.

Code samples

Transcribe audio in minutes.

Complete examples showing audio transcription and translation with streaming output.

Transcription.cs

using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

namespace YourNamespace
{
    class Program
    {
        static void Main(string[] args)
        {
            // Instantiate the Whisper model by ID.
            // See the full model catalog at:
            // https://docs.lm-kit.com/lm-kit-net/guides/getting-started/model-catalog.html
            var model = LM.LoadFromModelID("whisper-large-turbo3");

            // Open the WAV file from disk for transcription
            var wavFile = new WaveFile(@"d:\discussion.wav");

            // Create the speech-to-text engine for streaming, multi-turn transcription
            var engine = new SpeechToText(model);

            // Print each segment of transcription as it's received (e.g., real-time display)
            engine.OnNewSegment += (sender, e) =>
                Console.WriteLine(e.Segment);

            // Transcribe the entire WAV file; returns the full transcription information
            var transcription = engine.Transcribe(wavFile);

            // TODO: handle transcription results (e.g., save to file or process further)
        }
    }
}

Translation.cs

using LMKit.Media.Audio;
using LMKit.Model;
using LMKit.Speech;

namespace YourNamespace
{
    class Program
    {
        static void Main(string[] args)
        {
            // Instantiate the Whisper model by ID.
            // See the full model catalog at:
            // https://docs.lm-kit.com/lm-kit-net/guides/getting-started/model-catalog.html
            var model = LM.LoadFromModelID("whisper-large-turbo3");

            // Open the WAV file from disk for transcription
            var wavFile = new WaveFile(@"d:\discussion.wav");

            // Create the speech-to-text engine for streaming, multi-turn transcription+translation
            SpeechToText engine = new(model)
            {
                Mode = SpeechToText.SpeechToTextMode.Translation
            };

            // Print each segment of transcription as it's received (e.g., real-time display)
            engine.OnNewSegment += (sender, e) =>
                Console.WriteLine(e.Segment);

            // Transcribe the entire WAV file; returns the full transcription information
            var transcription = engine.Transcribe(wavFile);

            // TODO: handle transcription results (e.g., save to file or process further)
        }
    }
}

AdvancedOptions.cs

using LMKit.Model;
using LMKit.Media.Audio;
using LMKit.Speech;

// Load a Whisper model by ID
var model = LM.LoadFromModelID("whisper-large-turbo3");

// Open audio file (any format, any sample rate)
var wavFile = new WaveFile(@"meeting-recording.wav");

// Create transcription engine with full configuration
var stt = new SpeechToText(model)
{
    EnableVoiceActivityDetection = true,
    SuppressHallucinations = true, // Multi-layer adaptive filtering
    VadSettings = new VadSettings
    {
        EnergyThreshold = 0.5f,
        MinSpeechDuration = 0.3f,
        MinSilenceDuration = 0.5f
    }
};

// Stream segments as they're transcribed
stt.OnNewSegment += (sender, e) =>
{
    Console.WriteLine($"[{e.Segment.Start:mm\\:ss} -> {e.Segment.End:mm\\:ss}]");
    Console.WriteLine($"  {e.Segment.Text}");
    Console.WriteLine($"  Language: {e.Segment.Language}, Confidence: {e.Segment.Confidence:P1}");
};

// Track progress
stt.OnProgress += (sender, e) =>
    Console.WriteLine($"Progress: {e.ProgressPercentage:P0}");

// Transcribe the full audio file
var result = stt.Transcribe(wavFile);

Console.WriteLine($"\n=== Full Transcription ===\n{result.Text}");
Console.WriteLine($"\nSegments: {result.Segments.Count}");

Reference app

Open-source .NET MAUI transcription app.

A complete integration demonstration of LM-Kit.NET speech-to-text technology. Built with .NET MAUI for Windows and macOS. Your audio stays on your device.

LynxTranscribe

Drag-and-drop audio files, record from microphone, export to multiple formats including SRT/VTT subtitles.

View on GitHub

Post-processing

Post-processing capabilities.

Leverage LM-Kit.NET's text generation to transform transcriptions into actionable outputs.

Summarize

Generate concise summaries from lengthy transcriptions. Extract key points, decisions, and highlights automatically.

Speakers

Speaker detection

Identify and label different speakers in conversations. Attribute segments to individual participants.

Actions

Extract action items

Pull out tasks, to-dos, and commitments from meeting transcripts. Generate structured task lists.

Notes

Format as meeting notes

Structure transcriptions with headers, sections, and formatting. Create professional meeting documentation.

Grammar

Correct grammar

Fix transcription errors and improve readability. Clean up spoken language into polished written text.

Translate

Translate text

Convert transcribed text to any target language. Multilingual content distribution from a single source.

Applications

Built for real applications.

From meeting transcription to accessibility features to content indexing.

Meetings

Meeting transcription

Convert meeting recordings into searchable transcripts. Timestamped segments sync perfectly with video for easy navigation and review.

Healthcare

Healthcare documentation

Capture voice notes and clinical consultations with dictation formatting. Process sensitive medical data entirely on-device for HIPAA compliance.

Subtitles

Subtitles & captions

Generate SRT and VTT subtitle files automatically. Perfect timing with AudioSegment timestamps for video accessibility.

Education

Education & e-learning

Transcribe multilingual lectures and courses. Make educational content accessible and searchable across languages.

Service

Customer service analysis

Transcribe support calls for quality analysis and training. Pair with LM-Kit's sentiment analysis for comprehensive customer insights.

Legal

Legal & compliance

Transcribe depositions, hearings, and consultations with accuracy. On-device processing ensures confidential legal content never leaves your infrastructure.

Voice

Voice assistants

Build voice-controlled interfaces with real-time transcription. Low latency on-device processing enables responsive voice interactions.

Index

Content indexing

Make audio and video content searchable. Extract text from podcasts, interviews, and media libraries for full-text search capabilities.

Developer Resources

Key classes & methods.

Core components for building speech recognition pipelines.

`SpeechToText`

Main transcription engine. Provides Transcribe, TranscribeAsync, DetectLanguage methods. Configure VAD, mode (transcription/translation), hallucination suppression, and streaming callbacks.

View documentation

`AudioSegment`

Represents a transcribed speech segment with text, start/end timestamps, confidence score, and detected language.

View documentation

`VadSettings`

Configuration for Voice Activity Detection. Control energy threshold, speech/silence durations, and speech padding for optimal detection.

View documentation

`TranscriptionResult`

Contains full transcription text and collection of AudioSegments. Access combined text or iterate segments for timestamps and metadata.

View documentation

`WaveFile`

Audio file handler supporting multiple formats. Automatic sample rate and channel detection. Use IsValid to verify audio integrity.

View documentation

`SpeechToTextMode`

Enum for transcription modes: Transcription (original language) or Translation (any language to English).

View documentation

.NET

Native .NET integration.

Completely integrated with .NET. Leverage speech recognition alongside thousands of other AI capabilities.

01

Zero dependencies

No FFmpeg, no native interop complexity. Pure .NET solution.

02

Async/await

Familiar C# patterns with TranscribeAsync for non-blocking operations.

03

Cross-platform

Windows, macOS, Linux. Desktop, mobile, and server deployments.

04

LM-Kit ecosystem

Combine with text generation, embeddings, and other AI capabilities.

Dictation formatter

From transcript to formatted text.

Raw STT output is one long unpunctuated string. Real dictation applications need the user to say "comma", "new line", "open bracket", "question mark" and have those words become actual punctuation, line breaks, brackets, and question marks in the final text. LMKit.Speech.Dictation ships exactly that: a multilingual command-aware formatter built on top of the transcription engine.

Class

`Formatter`

Transforms a raw transcript into formatted text by interpreting spoken formatting commands. Case-insensitive regex matching with Unicode support; the engine handles punctuation, line breaks, brackets, quotes, currency symbols, and arbitrary custom replacements.

"new line" / "next line" → \n
"comma" / "period" / "question mark" → punctuation
"open quote" / "close quote" → matched quotes
"open paren" / "close paren" → brackets
Configurable via FormatterOptions

Class

`Command`

Each command maps one or more spoken-form regex patterns (across languages) to a text replacement. Ship the built-in catalog or define your own for domain-specific dictation (medical, legal, code).

One command, many spoken forms
Patterns are regex (Unicode + case-insensitive)
Same command works across English, French, German, Spanish, Italian, Portuguese
Add custom commands at runtime

Languages

6 built-in, more on demand

English, French, German, Spanish, Italian, Portuguese ship out of the box. Each command carries multiple spoken-form patterns so the same code handles all 6 without language detection. Add additional language patterns by extending the command set.

DictationPipeline.cs

using LMKit.Speech;
using LMKit.Speech.Dictation;

// 1. Transcribe the audio (same SpeechToText engine as before).
var stt = new SpeechToText(model);
TranscriptionResult raw = await stt.TranscribeAsync(audioStream);

// 2. Run the dictation formatter to interpret spoken commands.
// "Hello comma how are you question mark new line I am fine period"
//   becomes: "Hello, how are you?\nI am fine."
string formatted = Formatter.Format(raw.Text);

// 3. (Optional) override or extend the command catalog.
var options = new FormatterOptions();
options.Commands.Add(new Command(
    spokenForms: new[] { @"snippet\s*break", @"saut\s*de\s*code" },
    replacement: "\n\n```\n"));

string custom = Formatter.Format(raw.Text, options);

The dictation layer composes with everything above on this page: VAD-driven segmentation feeds the transcript, hallucination suppression cleans it, the formatter turns it into production-grade text. Suitable for medical and legal dictation, code-by-voice IDEs, voice-driven email composers, and accessibility tools.

Pillar capabilities

More of the Speech & Audio stack.

Three dedicated capabilities sit alongside the core transcription engine: live streaming output for responsive UIs, configurable voice activity detection for clean audio preprocessing, and audio language detection for multilingual routing.

Live

Real-time transcription

Stream tokens as they're produced using OnNewSegment. Build live captions, dictation UIs, and meeting transcripts that update segment-by-segment with sub-second latency.

Open real-time transcription

VAD

Voice activity detection

Isolate speech from silence and background noise before STT runs. Tune energy thresholds, durations, and padding via VadSettings. Cuts compute on silent audio, sharpens transcripts on noisy audio.

Open voice activity detection

Detect

Audio language detection

Identify the spoken language of any audio file before transcription. Auto-route to the right model, decide when to translate, tag content for archives, all from a single fast detection call on the same SpeechToText instance.

Open audio language detection

Compose with text pillars

Pair STT with text pipelines.

Two adjacent capabilities live under the Text pillars but compose naturally with audio output: identify the spoken language, or translate transcripts into a target language.

Detect

Language detection

Identify the language of any text snippet, including a transcript, before routing to translation, classification, or summarisation. Lives under Text Analysis.

Open language detection

Translate

Text translation

Translate transcripts across many language pairs for a microphone-to-target-language pipeline. Lives under Text Generation.

Open text translation

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Audio transcription

Console demo: streaming STT with VAD and hallucination suppression.

Open on GitHub → App

LynxTranscribe

Cross-platform desktop transcription app built on LM-Kit.NET and .NET MAUI.

Open → How-to guide

Transcribe audio with speech-to-text

Load a model, pick a backend, stream results in real time.

Read the guide → API reference

LMKit.Speech

Namespace landing for the full Speech & Audio stack.

Open the reference →

LM-Kit.NET pillars

Seven pillars, one foundation.

The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.

01 · AI Agents

Orchestration patterns

ReAct planning, supervisors, parallel and pipeline orchestrators, persistent memory, MCP clients, custom tools.

AI Agents

02 · Document Intelligence

Parse PDFs, images, EML

PDF text and table extraction, on-device OCR reaching SOTA benchmark scores, structured field extraction with grammar-constrained generation.

Document Intelligence

03 · Vision & Multimodal

VLMs, image classification, chat with image

Image understanding, classification, labeling, multimodal chat, image embeddings, VLM-OCR, background removal. Same conversation surface as LLMs.

Vision & Multimodal

04 · RAG & Knowledge

Vector search and retrieval

Built-in vector store, Qdrant and pgvector connectors, embeddings, hybrid retrieval, document chunking, source citations.

RAG & Knowledge

05 · Text Analysis

Classification, NER, PII, sentiment

Built-in classifiers and an extractor that emits typed C# objects via grammar-constrained sampling. Sentiment, keywords, language detection.

Text Analysis

06 · Speech & Audio

Audio transcription, STT

A growing local speech-to-text stack: hallucination suppression, Voice Activity Detection, real-time translation, streaming output, 100+ languages.

You are here

07 · Text Generation

Conversations, rewriting, summaries

Single-turn, multi-turn, and stateless conversation primitives. Translate, correct, rewrite, summarise. Prompt templates, streaming, grammar-constrained outputs.

Text Generation

The foundation

Every capability above runs on this runtime.

Foundation

Local Inference

The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.

Explore the foundation

Ready to add speech recognition?

The fastest and most accurate .NET speech-to-text. On-device transcription with 100+ languages, VAD, hallucination suppression, and real-time translation. Zero cloud dependency.

Download free API documentation

On-device audio transcription & translation.

Real-time translation

Streaming output

Timestamped segments

Universal audio

LynxTranscribe

Console demo

RMS energy analysis

Statistical adaptation

No-speech probability

Token confidence

Speaking rate validation

100+ language detection

Voice Activity Detection

Real-time translation

Dictation formatting

Timestamped segments

Universal audio support

Hallucination suppression

Streaming output

Native .NET integration

Energy threshold

Speech duration

Silence duration

Speech padding

Summarize

Speaker detection

Extract action items

Format as meeting notes

Correct grammar

Translate text

Meeting transcription

Healthcare documentation

Subtitles & captions

Education & e-learning

Customer service analysis

Legal & compliance

Voice assistants

Content indexing

SpeechToText

AudioSegment

VadSettings

TranscriptionResult

WaveFile

SpeechToTextMode

Zero dependencies

Async/await

Cross-platform

LM-Kit ecosystem

Formatter

Command

6 built-in, more on demand

Real-time transcription

Voice activity detection

Audio language detection

Language detection

Text translation

Audio transcription

LynxTranscribe

Transcribe audio with speech-to-text

LMKit.Speech

Orchestration patterns

Parse PDFs, images, EML

VLMs, image classification, chat with image

Vector search and retrieval

Classification, NER, PII, sentiment

Audio transcription, STT

Conversations, rewriting, summaries

Local Inference

`SpeechToText`

`AudioSegment`

`VadSettings`

`TranscriptionResult`

`WaveFile`

`SpeechToTextMode`

`Formatter`

`Command`