Solutions · Vision · VLM-OCR

OCR driven by vision-language models.

Modern OCR is no longer a single text-out engine. VLM-OCR reads layouts, tables, formulas, charts, and seals as structured intents. The output is Markdown, JSON, or plain text. Runs locally; SOTA benchmark scores on public document-OCR datasets.

7 structured intents Markdown / JSON output SOTA benchmark scores
Class

VlmOcr

VLM-driven OCR engine with structured intents.

Class

LMKitOcr

First-party engine, fast on a single core.

Models

PaddleOCR-VL · GLM-OCR

Choose the right OCR model for your hardware.

Why VLM-OCR

A vision-language model is the OCR engine now.

Traditional OCR returns a flat string. VLM-OCR understands what it sees: paragraphs are paragraphs, tables are tables, charts are charts, signatures are signatures. The output is structured from the start.

01

Text

Plain text extraction with reading order preserved across columns and pages.

02

Markdown

Headings, lists, emphasis, code blocks. Drop straight into a Markdown pipeline.

03

Tables

Structured table extraction with cells, headers, spans. Output as JSON or CSV.

04

Formulas

LaTeX or MathML for inline and display math. Recover equations from scanned scientific papers.

05

Charts

Bar, line, pie, scatter, axis labels and values. Extract data points from chart images.

06

Coordinates

Bounding boxes per token, line, paragraph. Anchor downstream redaction or highlighting.

07

Seals & signatures

Detect and extract official stamps, seals, and signatures with bounding boxes. Useful for compliance workflows where legal artefacts must be flagged separately from body text.

How it works

Pick a model, pick an intent.

VlmOcrExample.cs
using LMKit.Model;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;

var ocrModel = LM.LoadFromModelID("paddleocr-vl:0.9b");
var ocr = new VlmOcr(ocrModel);

// Markdown intent: paragraphs, lists, headings.
var md = await ocr.ExtractAsync(
    Attachment.FromFile("page.png"),
    intent: VlmOcrIntent.Markdown);

// Tables intent: structured cells.
var tables = await ocr.ExtractAsync(
    Attachment.FromFile("financials.png"),
    intent: VlmOcrIntent.Tables);

// Formulas intent: LaTeX output.
var formulas = await ocr.ExtractAsync(
    Attachment.FromFile("paper.png"),
    intent: VlmOcrIntent.Formulas);
Use cases

Where VLM-OCR belongs.

Document digitisation

Convert scanned PDFs to clean Markdown for ingestion into RAG, knowledge bases, or LLM context windows.

Scientific publishing

Extract equations, charts, and tables from published papers with structure intact. Reproduce LaTeX from PDFs.

Compliance & legal

Flag seals, signatures, and official stamps as separate artefacts. Drive automated compliance review.

Mailroom & intake

Read mixed-format paper mail (letters, invoices, contracts), output structured Markdown for downstream pipelines.

Document Intelligence OCR page →

LM-Kit.NET pillars

Seven pillars, one foundation.

The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.

The foundation

Every capability above runs on this runtime.

Foundation

Local Inference

The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.

Explore the foundation

Structured OCR, seven intents.

Start in 5 minutes Back to Vision hub