Solutions · Document Intelligence · Extraction

Any document. Any field. Zero hallucinations.

The most advanced local structured data extraction engine available. Extract precise fields from invoices, contracts, medical records, and any document type. Powered by multimodal AI, proprietary symbolic AI layers, and purpose-trained LM-Kit models that eliminate LLM hallucinations. 100% on-device.

Start building free View invoice demo

Symbolic AI layers LM-Kit trained models Built-in OCR

Symbolic AI layers

Dynamic Sampling and adaptive layers eliminate LLM hallucinations.

LM-Kit models

Purpose-trained models optimized for extraction tasks.

Multimodal engine

Images, scans, PDFs, handwritten notes natively.

Schema-driven

JSON schema or high-level API for typed outputs.

75%
Fewer errors

2x
Faster processing

0
Cloud calls

Beyond GenAI

A unique piece of engineering.

LM-Kit.NET delivers the most advanced local structured data extraction engine available. While other solutions rely solely on LLMs that hallucinate, LM-Kit combines generative AI with multiple symbolic AI layers, fuzzy logic, and expert systems to produce extraction results you can actually trust.

This is not just another wrapper around an LLM. Our proprietary symbolic layers work in concert with the language model, dynamically engaged based on content characteristics, domain semantics, and extraction requirements. The system intelligently orchestrates these components for each extraction scenario, achieving accuracy that pure LLM approaches cannot match.

Built by IDP pioneers: Designed by engineers with 20+ years of experience in document processing and data extraction, processing billions of documents in production worldwide.

InvoiceExtraction.cs

using LMKit.Extraction;
using LMKit.Model;

// Load LM-Kit optimized model (recommended)
var model = LM.LoadFromModelID("lmkit-tasks:4b-preview");

// Create extraction instance
var extractor = new TextExtraction(model);

// Define schema from JSON or programmatically
extractor.SetElementsFromJsonSchema(schemaJson);

// Set content: image, PDF, or text
extractor.SetContent(new Attachment("invoice.pdf"));

// Extract with zero hallucinations
var result = extractor.Parse();

// Access structured JSON output
Console.WriteLine(result.Json);

// Or iterate typed elements
foreach (var elem in result.Elements)
    Console.WriteLine($"{elem.Name}: {elem.Value}");

Core technology

Adaptive symbolic AI layers.

Multiple AI paradigms working together, dynamically engaged based on content type, domain semantics, and extraction context.

Why pure LLMs fail at extraction

Large Language Models are designed for fluency, not precision. They hallucinate values, invent data that doesn't exist, and struggle with structured output constraints. For extraction tasks where accuracy matters, this is unacceptable.

LM-Kit solves this with a multi-layer architecture where symbolic AI systems validate, constrain, and correct LLM outputs in real-time. These layers include techniques such as taxonomy matching, ontology validation, fuzzy logic, and rule-based expert systems. Each component is adaptively engaged based on the extraction scenario.

Dynamic Sampling

Adaptive inference with real-time structural awareness and contextual validation.

Taxonomy matching

Domain-specific classification and entity recognition.

Ontology validation

Semantic relationship verification between extracted fields.

Contextual rules

Expert system rules applied based on document type and content.

Key innovation

Dynamic Sampling: the secret weapon.

A foundational component of our symbolic AI stack that fundamentally reimagines how LLMs generate structured output.

Adaptive inference, not just token selection

Standard LLM sampling picks the most probable next token. This works for chat, but fails catastrophically for structured extraction where precision matters.

Dynamic Sampling is our proprietary inference method that goes far beyond probability-based token selection. It maintains real-time structural awareness of the generation process, applies contextual perplexity assessment with fuzzifiers, and leverages auxiliary content as extended context to guide every token decision.

Speculative Grammar: Hybrid approach combining greedy sampling for constants with speculative validation for variables
Contextual Perplexity: Adaptive guidance using fuzzifiers to reduce hallucinations without over-penalizing valid patterns
Auxiliary Content: Extended context mechanism for semantic validation beyond the attention window
Model-Agnostic: Works across any model, any size, no fine-tuning required

Faster sampling

vs. traditional grammar methods

10x

Faster inference

with full optimization stack

Fine-tuning

works out of the box

100%

Local

on your infrastructure

Capability

Real-time structural awareness

Tracks whether the model is inside a JSON string, object, numeric run, or value start. Maintains a persistent CompletionState that enables structural validation at every step.

Capability

Metric-guided token voting

Perplexity scoring identifies uncertainty between candidates. Per-candidate validation loops explore alternatives when top tokens are invalid or overly risky.

Capability

Model-aware JSON rendering

Monitors model preferences for formatting styles and adapts grammar expectations in real-time, ensuring higher parsing success across different model architectures.

Capability

Graceful fallbacks

Immediate error detection with automatic correction through adaptive fallbacks. Prevents error propagation without restarting inference from scratch.

Capability

Contextual repetition detection

Understands when repetition is valid (e.g., "1000000000") versus problematic, avoiding the crude penalties of traditional approaches that break valid outputs.

Capability

Continuously benchmarked

Refined through experimental research cycles and inference benchmarking on large datasets. Updated regularly to maintain state-of-the-art performance.

Recommended

LM-Kit trained models.

Purpose-built models optimized specifically for LM-Kit extraction tasks. The best option for maximum accuracy and speed.

Best performance

LM-Kit Tasks model

A specialized model optimized for LM-Kit pipelines. Achieves state-of-the-art performance in classification, structured data extraction, language detection, and sentiment analysis while also supporting chat, embeddings, text generation, code completion, math reasoning, and vision understanding.

Optimized for extraction accuracy
Seamless integration with LM-Kit pipelines
Multimodal: text and vision capable
Compact size, efficient inference

LM-Kit Tasks

lmkit-tasks

Extraction Classification Vision Chat Code

131K
Context

~3 GB
Size

4-bit
Quantization

LMK
Format

Multimodal by design

Extract from any content source.

Images, scans, PDFs, handwritten notes, Office documents. If it contains data, we can extract it.

Input

Images

Photos, scans, screenshots with automatic orientation detection.

PNG JPG TIFF BMP

Input

PDF documents

Digital PDFs, scanned documents, multi-page extraction.

PDF PDF/A

Input

Office documents

Word, Excel, PowerPoint with layout preservation.

DOCX XLSX PPTX

Input

Handwritten content

Notes, forms, signatures with VLM understanding.

VLM OCR

Text recognition

Built-in local OCR engine.

Powerful OCR capabilities included out of the box, with support for custom OCR engine integration.

Built-in

Built-in OCR

LM-Kit includes a local OCR engine that works seamlessly with the extraction pipeline. No cloud calls, no external dependencies.

Automatic language detection
Orientation detection and correction
100% local processing

Pluggable

Custom OCR integration

Need a different OCR engine? The extraction pipeline supports pluggable OCR engines to match your specific requirements.

Pluggable OCR engine interface
Pre-integrated alternatives available
Custom engine support

Schema-driven extraction

Rich data type support.

Define extraction schemas with typed fields. Get clean, validated output ready for integration.

String

Text values

Integer

Whole numbers

Float

Decimal values

Double

High precision

Bool

True/false

Date

Date values

Char

Single character

Arrays

Lists of any type

Plus nested objects and complex structures. View all supported types →

Try it now

Invoice data extraction demo.

A complete demo showcasing multimodal extraction from invoice documents.

Featured demo

Invoice extraction demo

Interactive console app demonstrating structured data extraction from invoices in multiple languages with VLM and OCR integration, automatic language detection, and JSON output.

Multiple vision-language models supported
Built-in OCR with language detection
Automatic orientation detection
JSON schema configuration
Sample invoices in French, Spanish, English

View sample guide GitHub source code

terminal

$ dotnet run
# Select a vision-language model...
# Loading model... ████████ 100%
# Select invoice: invoice_fr.png
# Detected language: French
# Detected orientation: 0 degrees
# Extracting structured data...

Vendor:         SARL Dupont & Fils
Invoice Number: 2024-0892
Date:           2024-03-15
Total:          1,247.50 EUR

# Completed in 2.34 seconds

Use cases

Extract anything from anything.

Unlimited use cases. Define your schema and extract.

Use case

Financial documents

Invoices, receipts, bank statements, expense reports. Extract vendor, amounts, dates, line items.

Use case

Legal contracts

NDAs, service agreements, employment contracts. Extract parties, dates, clauses, obligations.

Use case

Medical records

Patient records, lab results, prescriptions. Extract patient info, diagnoses, medications.

Use case

HR documents

Resumes, job offers, employment applications. Extract skills, experience, contact info.

Use case

Academic papers

Research papers, citations, abstracts. Extract authors, methodology, findings, references.

Use case

ID documents

Passports, driver licenses, ID cards. Extract name, number, dates, nationality.

Model support

Broad model compatibility.

Use LM-Kit trained models for best results, or bring your own. We support a wide range of models and are constantly adding new ones.

LM-Kit supports many different models for extraction tasks. While third-party models work well, LM-Kit trained models deliver the best performance as they are specifically optimized for our extraction pipelines and symbolic AI layers. We continuously add new model support and refine our purpose-built models.

Browse full model catalog

API reference

Key classes.

The building blocks for structured data extraction applications.

Class

`TextExtraction`

Main extraction class. Set content, define schema, parse. Supports text, images, PDFs, and Office documents.

View documentation

Type

`TextExtractionElement`

Define extraction fields with name, type, description, and nested elements for complex structures.

View documentation

Type

`TextExtractionResult`

Contains extracted elements and their JSON representation. Iterate typed elements or access raw JSON.

View documentation

Type

`Attachment`

Represents input content. Supports PDF, images, Office docs with page-level access.

View documentation

Related capabilities

Extraction plus the rest of Document Intelligence.

Layout understanding

Paragraphs, reading order, and layout-aware spatial search drive extraction accuracy. Post-validate field positions with FindNear and FindInRegion.

Layout page

OCR

Extraction over scans starts with OCR. Native engine plus VLM OCR for tables, formulas, charts, seals. SOTA benchmark accuracy, on-device, no per-page cost.

OCR page

Document classification

Classify first, then run extraction with the right schema per category. Invoices and passports need different fields.

Classification

Document splitting

Split bundled multi-document scans into separate logical units, then extract from each.

Splitting

Structured content creation

The grammar-constrained generation engine that powers extraction. Same primitive, generation-side audience.

Grammar-constrained generation

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Install the SDK

Ready to extract with zero hallucinations?

The most advanced local data extraction engine. Symbolic AI layers and purpose-trained models. 100% on your infrastructure.

Download free Try the demo

Any document. Any field. Zero hallucinations.

A unique piece of engineering.

Why pure LLMs fail at extraction

Adaptive inference, not just token selection

Real-time structural awareness

Metric-guided token voting

Model-aware JSON rendering

Graceful fallbacks

Contextual repetition detection

Continuously benchmarked

LM-Kit Tasks model

Images

PDF documents

Office documents

Handwritten content

Built-in OCR

Custom OCR integration

Invoice extraction demo

Financial documents

Legal contracts

Medical records

HR documents

Academic papers

ID documents

TextExtraction

TextExtractionElement

TextExtractionResult

Attachment

Layout understanding

OCR

Document classification

Document splitting

Structured content creation

Structured data extraction

Structured data extraction walkthrough

Invoice data extraction

Invoice data extraction walkthrough

Resume parser

Resume parser walkthrough

Extract structured data

`TextExtraction`

`TextExtractionElement`

`TextExtractionResult`

`Attachment`