Solutions · Speech & Audio · Real-time transcription

Streaming speech-to-text, segment by segment.

Subscribe to OnNewSegment and the transcription engine pushes each segment to your UI as soon as it lands, complete with timestamps, detected language, and a per-segment confidence score. No buffering until end-of-file. No polling. Sub-second latency on a single CPU core for the smaller Whisper variants.

Start building free API reference

Sub-second latency Timestamped segments 100+ languages

Event

`OnNewSegment`

Fires per audio segment, not per file.

Output

AudioSegment

Text, start/end times, confidence, language.

Input

Microphone or file

WAV, MP3, FLAC, OGG, M4A, WMA.

Async

Cancellable

CancellationToken stops mid-stream.

Build for the second the user hears, not the minute the file finished.

Batch transcription is fine for archival workflows. For live captions, dictation apps, real-time translation, meeting assistants, accessibility tools, anything a person is actually waiting on, you need streaming output. SpeechToText exposes streaming as a first-class event, not as a chunking workaround.

Minimal example

Five lines to streaming output.

Subscribe to OnNewSegment, call TranscribeAsync, segments arrive on your handler as the model produces them.

StreamingStt.cs

using LMKit.Model;
using LMKit.Speech;

var model = LM.LoadFromModelID("whisper-large-turbo3");
var stt   = new SpeechToText(model);

stt.OnNewSegment += (s, seg) =>
    Console.WriteLine($"[{seg.StartTime:hh\\:mm\\:ss}] {seg.Text}");

await stt.TranscribeAsync("meeting.mp3");

AudioSegment shape

What every event delivers.

Each segment carries everything you need to render a caption, fix subtitle timing, route by language, or flag low-confidence regions for human review.

Text

Recognised text

The decoded text for this segment. Already passes dictation formatting if enabled (punctuation, capitalisation, sentence boundaries).

Time

Start / end timestamps

StartTime and EndTime as TimeSpan. Plug straight into an SRT/VTT writer or sync with a video timeline.

Score

Confidence

Per-segment confidence in [0, 1]. Threshold to mark uncertain regions or trigger re-transcription with a larger model.

Lang

Detected language

ISO language code identified for this segment. Useful for multilingual audio: route each segment to the right downstream pipeline.

Patterns

Three real-world streaming patterns.

The same event hook drives live captions, dictation UIs, and meeting transcripts. Pick the shape that matches your app.

Push each segment to a caption overlay. Confidence threshold marks uncertain segments in grey so a human editor can quickly spot review candidates.

LiveCaptions.cs

stt.OnNewSegment += (s, seg) =>
{
    var css = seg.Confidence < 0.6 ? "caption low" : "caption";
    captionView.Append(new CaptionLine(seg.Text, seg.StartTime, css));
};

Append each segment to a text editor. Enable dictation formatting so the editor sees properly punctuated, capitalised sentences rather than raw token streams.

DictationEditor.cs

stt.DictationFormatting = true;
stt.OnNewSegment += (s, seg) =>
    editor.Document.Insert(editor.CaretIndex, seg.Text + " ");

Stream microphone audio chunks. The engine handles segment boundaries internally so the handler only fires when there is something to render.

MicLoop.cs

var wav = WaveFile.FromMicrophone(deviceIndex: 0);
var cts = new CancellationTokenSource();
await stt.TranscribeAsync(wav, cts.Token);

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Ship live captions this week.

Whisper-family models, segment-by-segment streaming, full on-device. Pair with Voice Activity Detection for cleaner output on noisy audio.

Download free Streaming guide