Solutions · Speech & Audio · Real-time transcription

Streaming speech-to-text, segment by segment.

Subscribe to OnNewSegment and the transcription engine pushes each segment to your UI as soon as it lands, complete with timestamps, detected language, and a per-segment confidence score. No buffering until end-of-file. No polling. Sub-second latency on a single CPU core for the smaller Whisper variants.

Sub-second latency Timestamped segments 100+ languages
Event

OnNewSegment

Fires per audio segment, not per file.

Output

AudioSegment

Text, start/end times, confidence, language.

Input

Microphone or file

WAV, MP3, FLAC, OGG, M4A, WMA.

Async

Cancellable

CancellationToken stops mid-stream.

Build for the second the user hears, not the minute the file finished.

Batch transcription is fine for archival workflows. For live captions, dictation apps, real-time translation, meeting assistants, accessibility tools, anything a person is actually waiting on, you need streaming output. SpeechToText exposes streaming as a first-class event, not as a chunking workaround.

Minimal example

Five lines to streaming output.

Subscribe to OnNewSegment, call TranscribeAsync, segments arrive on your handler as the model produces them.

StreamingStt.cs
using LMKit.Model;
using LMKit.Speech;

var model = LM.LoadFromModelID("whisper-large-turbo3");
var stt   = new SpeechToText(model);

stt.OnNewSegment += (s, seg) =>
    Console.WriteLine($"[{seg.StartTime:hh\\:mm\\:ss}] {seg.Text}");

await stt.TranscribeAsync("meeting.mp3");
AudioSegment shape

What every event delivers.

Each segment carries everything you need to render a caption, fix subtitle timing, route by language, or flag low-confidence regions for human review.

Text

Recognised text

The decoded text for this segment. Already passes dictation formatting if enabled (punctuation, capitalisation, sentence boundaries).

Time

Start / end timestamps

StartTime and EndTime as TimeSpan. Plug straight into an SRT/VTT writer or sync with a video timeline.

Score

Confidence

Per-segment confidence in [0, 1]. Threshold to mark uncertain regions or trigger re-transcription with a larger model.

Lang

Detected language

ISO language code identified for this segment. Useful for multilingual audio: route each segment to the right downstream pipeline.

Patterns

Three real-world streaming patterns.

The same event hook drives live captions, dictation UIs, and meeting transcripts. Pick the shape that matches your app.

Push each segment to a caption overlay. Confidence threshold marks uncertain segments in grey so a human editor can quickly spot review candidates.

LiveCaptions.cs
stt.OnNewSegment += (s, seg) =>
{
    var css = seg.Confidence < 0.6 ? "caption low" : "caption";
    captionView.Append(new CaptionLine(seg.Text, seg.StartTime, css));
};

Ship live captions this week.

Whisper-family models, segment-by-segment streaming, full on-device. Pair with Voice Activity Detection for cleaner output on noisy audio.

Download free Streaming guide