Solutions · Vision · VLM-OCR

OCR driven by vision-language models.

Modern OCR is no longer a single text-out engine. VLM-OCR reads layouts, tables, formulas, charts, and seals as structured intents. The output is Markdown, JSON, or plain text. Runs locally; SOTA benchmark scores on public document-OCR datasets.

5-minute quickstart Document OCR page

7 structured intents Markdown / JSON output SOTA benchmark scores

Class

`VlmOcr`

VLM-driven OCR engine with structured intents.

Class

`LMKitOcr`

First-party engine, fast on a single core.

Models

PaddleOCR-VL · GLM-OCR

Choose the right OCR model for your hardware.

Why VLM-OCR

A vision-language model is the OCR engine now.

Traditional OCR returns a flat string. VLM-OCR understands what it sees: paragraphs are paragraphs, tables are tables, charts are charts, signatures are signatures. The output is structured from the start.

Text

Plain text extraction with reading order preserved across columns and pages.

Markdown

Headings, lists, emphasis, code blocks. Drop straight into a Markdown pipeline.

Tables

Structured table extraction with cells, headers, spans. Output as JSON or CSV.

Formulas

LaTeX or MathML for inline and display math. Recover equations from scanned scientific papers.

Charts

Bar, line, pie, scatter, axis labels and values. Extract data points from chart images.

Coordinates

Bounding boxes per token, line, paragraph. Anchor downstream redaction or highlighting.

Seals & signatures

Detect and extract official stamps, seals, and signatures with bounding boxes. Useful for compliance workflows where legal artefacts must be flagged separately from body text.

How it works

Pick a model, pick an intent.

VlmOcrExample.cs

using LMKit.Model;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;

var ocrModel = LM.LoadFromModelID("paddleocr-vl:0.9b");
var ocr = new VlmOcr(ocrModel);

// Markdown intent: paragraphs, lists, headings.
var md = await ocr.ExtractAsync(
    Attachment.FromFile("page.png"),
    intent: VlmOcrIntent.Markdown);

// Tables intent: structured cells.
var tables = await ocr.ExtractAsync(
    Attachment.FromFile("financials.png"),
    intent: VlmOcrIntent.Tables);

// Formulas intent: LaTeX output.
var formulas = await ocr.ExtractAsync(
    Attachment.FromFile("paper.png"),
    intent: VlmOcrIntent.Formulas);

Use cases

Where VLM-OCR belongs.

Document digitisation

Convert scanned PDFs to clean Markdown for ingestion into RAG, knowledge bases, or LLM context windows.

Scientific publishing

Extract equations, charts, and tables from published papers with structure intact. Reproduce LaTeX from PDFs.

Compliance & legal

Flag seals, signatures, and official stamps as separate artefacts. Drive automated compliance review.

Mailroom & intake

Read mixed-format paper mail (letters, invoices, contracts), output structured Markdown for downstream pipelines.

Document Intelligence OCR page →

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

VLM OCR

Console demo: extract text, Markdown, tables, formulas from images.

Open on GitHub → Sample

VLM OCR walkthrough

Step-by-step doc page: prerequisites, setup, code path, expected output.

Read on docs → Demo

VLM OCR with coordinates

Console demo: get bounding boxes for downstream redaction or highlighting.

Open on GitHub → How-to guide

Extract text with VLM-OCR

Pick a model, pick an intent, get clean structured text.

Read the guide → How-to guide

Extract tables with VLM-OCR

Recover row/column structure from photographed or scanned tables.

Read the guide →

LM-Kit.NET pillars

Seven pillars, one foundation.

The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.

01 · AI Agents

Orchestration patterns

ReAct planning, supervisors, parallel and pipeline orchestrators, persistent memory, MCP clients, custom tools.

AI Agents

02 · Document Intelligence

Parse PDFs, images, EML

PDF text and table extraction, on-device OCR reaching SOTA benchmark scores, structured field extraction with grammar-constrained generation.

Document Intelligence

03 · Vision & Multimodal

VLMs, image classification, chat with image

Image understanding, classification, labeling, multimodal chat, image embeddings, VLM-OCR, background removal. Same conversation surface as LLMs.

Vision & Multimodal

04 · RAG & Knowledge

Vector search and retrieval

Built-in vector store, Qdrant and pgvector connectors, embeddings, hybrid retrieval, document chunking, source citations.

RAG & Knowledge

05 · Text Analysis

Classification, NER, PII, sentiment

Built-in classifiers and an extractor that emits typed C# objects via grammar-constrained sampling. Sentiment, keywords, language detection.

Text Analysis

06 · Speech & Audio

Audio transcription, STT

A growing local speech-to-text stack: hallucination suppression, Voice Activity Detection, real-time translation, streaming output, 100+ languages.

Speech & Audio

07 · Text Generation

Conversations, rewriting, summaries

Single-turn, multi-turn, and stateless conversation primitives. Translate, correct, rewrite, summarise. Prompt templates, streaming, grammar-constrained outputs.

Text Generation

The foundation

Every capability above runs on this runtime.

Foundation

Local Inference

The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.

Explore the foundation

Structured OCR, seven intents.

Start in 5 minutes Back to Vision hub