Dynamic Sampling and adaptive layers eliminate LLM hallucinations.
Any document. Any field. Zero hallucinations.
The most advanced local structured data extraction engine available. Extract precise fields from invoices, contracts, medical records, and any document type. Powered by multimodal AI, proprietary symbolic AI layers, and purpose-trained LM-Kit models that eliminate LLM hallucinations. 100% on-device.
Purpose-trained models optimized for extraction tasks.
Images, scans, PDFs, handwritten notes natively.
JSON schema or high-level API for typed outputs.
Fewer errors
Faster processing
Cloud calls
A unique piece of engineering.
LM-Kit.NET delivers the most advanced local structured data extraction engine available. While other solutions rely solely on LLMs that hallucinate, LM-Kit combines generative AI with multiple symbolic AI layers, fuzzy logic, and expert systems to produce extraction results you can actually trust.
This is not just another wrapper around an LLM. Our proprietary symbolic layers work in concert with the language model, dynamically engaged based on content characteristics, domain semantics, and extraction requirements. The system intelligently orchestrates these components for each extraction scenario, achieving accuracy that pure LLM approaches cannot match.
Built by IDP pioneers: Designed by engineers with 20+ years of experience in document processing and data extraction, processing billions of documents in production worldwide.
Adaptive symbolic AI layers.
Multiple AI paradigms working together, dynamically engaged based on content type, domain semantics, and extraction context.
Why pure LLMs fail at extraction
Large Language Models are designed for fluency, not precision. They hallucinate values, invent data that doesn't exist, and struggle with structured output constraints. For extraction tasks where accuracy matters, this is unacceptable.
LM-Kit solves this with a multi-layer architecture where symbolic AI systems validate, constrain, and correct LLM outputs in real-time. These layers include techniques such as taxonomy matching, ontology validation, fuzzy logic, and rule-based expert systems. Each component is adaptively engaged based on the extraction scenario.
Adaptive inference with real-time structural awareness and contextual validation.
Domain-specific classification and entity recognition.
Semantic relationship verification between extracted fields.
Expert system rules applied based on document type and content.
Dynamic Sampling: the secret weapon.
A foundational component of our symbolic AI stack that fundamentally reimagines how LLMs generate structured output.
Adaptive inference, not just token selection
Standard LLM sampling picks the most probable next token. This works for chat, but fails catastrophically for structured extraction where precision matters.
Dynamic Sampling is our proprietary inference method that goes far beyond probability-based token selection. It maintains real-time structural awareness of the generation process, applies contextual perplexity assessment with fuzzifiers, and leverages auxiliary content as extended context to guide every token decision.
- Speculative Grammar: Hybrid approach combining greedy sampling for constants with speculative validation for variables
- Contextual Perplexity: Adaptive guidance using fuzzifiers to reduce hallucinations without over-penalizing valid patterns
- Auxiliary Content: Extended context mechanism for semantic validation beyond the attention window
- Model-Agnostic: Works across any model, any size, no fine-tuning required
Faster sampling
vs. traditional grammar methods
Faster inference
with full optimization stack
Fine-tuning
works out of the box
Local
on your infrastructure
Capability
Real-time structural awareness
Tracks whether the model is inside a JSON string, object, numeric run, or value start. Maintains a persistent CompletionState that enables structural validation at every step.
Capability
Metric-guided token voting
Perplexity scoring identifies uncertainty between candidates. Per-candidate validation loops explore alternatives when top tokens are invalid or overly risky.
Capability
Model-aware JSON rendering
Monitors model preferences for formatting styles and adapts grammar expectations in real-time, ensuring higher parsing success across different model architectures.
Capability
Graceful fallbacks
Immediate error detection with automatic correction through adaptive fallbacks. Prevents error propagation without restarting inference from scratch.
Capability
Contextual repetition detection
Understands when repetition is valid (e.g., "1000000000") versus problematic, avoiding the crude penalties of traditional approaches that break valid outputs.
Capability
Continuously benchmarked
Refined through experimental research cycles and inference benchmarking on large datasets. Updated regularly to maintain state-of-the-art performance.
LM-Kit trained models.
Purpose-built models optimized specifically for LM-Kit extraction tasks. The best option for maximum accuracy and speed.
LM-Kit Tasks model
A specialized model optimized for LM-Kit pipelines. Achieves state-of-the-art performance in classification, structured data extraction, language detection, and sentiment analysis while also supporting chat, embeddings, text generation, code completion, math reasoning, and vision understanding.
- Optimized for extraction accuracy
- Seamless integration with LM-Kit pipelines
- Multimodal: text and vision capable
- Compact size, efficient inference
lmkit-tasks
Context
Size
Quantization
Format
Extract from any content source.
Images, scans, PDFs, handwritten notes, Office documents. If it contains data, we can extract it.
Input
Images
Photos, scans, screenshots with automatic orientation detection.
Input
PDF documents
Digital PDFs, scanned documents, multi-page extraction.
Input
Office documents
Word, Excel, PowerPoint with layout preservation.
Input
Handwritten content
Notes, forms, signatures with VLM understanding.
Built-in local OCR engine.
Powerful OCR capabilities included out of the box, with support for custom OCR engine integration.
Built-in
Built-in OCR
LM-Kit includes a local OCR engine that works seamlessly with the extraction pipeline. No cloud calls, no external dependencies.
- Automatic language detection
- Orientation detection and correction
- 100% local processing
Pluggable
Custom OCR integration
Need a different OCR engine? The extraction pipeline supports pluggable OCR engines to match your specific requirements.
- Pluggable OCR engine interface
- Pre-integrated alternatives available
- Custom engine support
Rich data type support.
Define extraction schemas with typed fields. Get clean, validated output ready for integration.
StringText values
IntegerWhole numbers
FloatDecimal values
DoubleHigh precision
BoolTrue/false
DateDate values
CharSingle character
ArraysLists of any type
Plus nested objects and complex structures. View all supported types →
Invoice data extraction demo.
A complete demo showcasing multimodal extraction from invoice documents.
Invoice extraction demo
Interactive console app demonstrating structured data extraction from invoices in multiple languages with VLM and OCR integration, automatic language detection, and JSON output.
- Multiple vision-language models supported
- Built-in OCR with language detection
- Automatic orientation detection
- JSON schema configuration
- Sample invoices in French, Spanish, English
Extract anything from anything.
Unlimited use cases. Define your schema and extract.
Use case
Financial documents
Invoices, receipts, bank statements, expense reports. Extract vendor, amounts, dates, line items.
Use case
Legal contracts
NDAs, service agreements, employment contracts. Extract parties, dates, clauses, obligations.
Use case
Medical records
Patient records, lab results, prescriptions. Extract patient info, diagnoses, medications.
Use case
HR documents
Resumes, job offers, employment applications. Extract skills, experience, contact info.
Use case
Academic papers
Research papers, citations, abstracts. Extract authors, methodology, findings, references.
Use case
ID documents
Passports, driver licenses, ID cards. Extract name, number, dates, nationality.
Broad model compatibility.
Use LM-Kit trained models for best results, or bring your own. We support a wide range of models and are constantly adding new ones.
LM-Kit supports many different models for extraction tasks. While third-party models work well, LM-Kit trained models deliver the best performance as they are specifically optimized for our extraction pipelines and symbolic AI layers. We continuously add new model support and refine our purpose-built models.
Browse full model catalogKey classes.
The building blocks for structured data extraction applications.
Class
TextExtraction
Main extraction class. Set content, define schema, parse. Supports text, images, PDFs, and Office documents.
View documentationType
TextExtractionElement
Define extraction fields with name, type, description, and nested elements for complex structures.
View documentationType
TextExtractionResult
Contains extracted elements and their JSON representation. Iterate typed elements or access raw JSON.
View documentationType
Attachment
Represents input content. Supports PDF, images, Office docs with page-level access.
View documentationExtraction plus the rest of Document Intelligence.
Layout understanding
Paragraphs, reading order, and layout-aware spatial search drive extraction accuracy. Post-validate field positions with FindNear and FindInRegion.
OCR
Extraction over scans starts with OCR. Native engine plus VLM OCR for tables, formulas, charts, seals. SOTA benchmark accuracy, on-device, no per-page cost.
Document classification
Classify first, then run extraction with the right schema per category. Invoices and passports need different fields.
Document splitting
Split bundled multi-document scans into separate logical units, then extract from each.
Structured content creation
The grammar-constrained generation engine that powers extraction. Same primitive, generation-side audience.
Build it. Read it. Try it.
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Structured data extraction
Console demo: schema-driven extraction from any document.
Open on GitHub → DemoInvoice data extraction
Console demo: extract invoice fields with grammar-constrained generation.
Open on GitHub → DemoResume parser
Console demo: parse resumes into structured candidate records.
Open on GitHub → How-to guideExtract structured data
Define a schema, validate output, fix errors automatically.
Read the guide →Ready to extract with zero hallucinations?
The most advanced local data extraction engine. Symbolic AI layers and purpose-trained models. 100% on your infrastructure.