Live extraction preview
"Invoice #INV-2024-0892 from Acme Corp dated Jan 15, 2024. Total amount: $4,250.00. Payment due within 30 days. Contact: [email protected]"
Extract structured information from text, images, PDFs, and scanned documents using AI-powered schema inference. Define custom extraction schemas with JSON, auto-discover fields, and get typed results with confidence scores. 100% local processing with Dynamic Sampling for maximum accuracy.
"Invoice #INV-2024-0892 from Acme Corp dated Jan 15, 2024. Total amount: $4,250.00. Payment due within 30 days. Contact: [email protected]"
{
"invoiceNumber": "INV-2024-0892",
"vendor": "Acme Corp",
"date": "2024-01-15",
"total": 4250.00,
"paymentTerms": 30,
"email": "[email protected]"
}
The TextExtraction engine transforms unstructured content from any
source into typed, validated data. Define extraction schemas using code, JSON
Schema, or let the AI auto-discover fields with SchemaDiscovery.
Process text, images, PDFs, and scanned documents with a unified API.
Powered by LM-Kit's proprietary Dynamic Sampling technology, the extraction engine delivers exceptional accuracy even with smaller models, enabling deployment on edge devices without sacrificing quality.
New in 2026: Schema auto-discovery, page-range extraction, spatial bounds for extracted data, per-field confidence scores, VLM-powered document understanding, and fine-tuning dataset export.
Everything you need to build production-grade data extraction pipelines in .NET.
Discovery
Use SchemaDiscovery to automatically detect and suggest extraction fields from your content. Perfect for exploring unknown document formats or bootstrapping new extraction pipelines.
JSON Schema
Define extraction schemas using standard JSON Schema with SetElementsFromJsonSchema. Import existing schemas from your API contracts or database models for seamless integration.
Confidence
Every extraction includes a confidence score (0-1) at both result and field level. Use GetConfidence(path) to validate uncertain fields and implement human review workflows.
Nested
Extract complex hierarchical data with unlimited nesting depth. Define object arrays for line items, addresses, and other repeating structures. Access with EnumerateAt(path).
Bounds
Extracted elements include Bounds property with spatial location. Know exactly where each value was found in the document for verification, highlighting, or downstream processing.
Pages
Extract from specific pages of multi-page documents using SetContent with page index or range parameters. Process only the pages you need for faster extraction.
Process text, images, PDFs, and scanned documents with a unified API powered by OCR and Vision Language Models.
Vision-powered document understanding
LM-Kit's extraction engine supports multiple processing modalities. For clean
digital documents, direct text extraction is fastest. For scanned documents,
integrate the OcrEngine. For complex layouts with tables, forms, and
mixed content, leverage Vision Language Models through
PreferredInferenceModality.
The engine automatically handles handwritten notes, smartphone photos, receipts,
ID cards, and any other visual content. Use VlmOcr for the highest
quality document understanding with layout preservation.
Auto-detect
Digital or scanned pages.
VLM
Photos, scans, screenshots.
Native
Word, Excel, PowerPoint.
OCR
Notes and forms.
Multiple ways to define extraction schemas for maximum flexibility.
Code-first
TextExtractionElement APIDefine extraction elements programmatically with full type safety and IntelliSense support.
InnerElementsIsArrayTextExtractionElementFormatJSON Schema
SetElementsFromJsonSchemaImport extraction schemas from standard JSON Schema definitions for seamless integration with existing systems.
Fine-tune how extracted values are formatted and validated with TextExtractionElementFormat.
Case
TextCaseModeControl text casing: uppercase, lowercase, title case, or preserve original.
Trim
TrimStartRemove leading whitespace or specific characters from extracted values.
Required
IsRequiredMark fields as mandatory. Extraction fails if required fields are not found.
Hint
FormatHintProvide format hints like email, phone, URL with PredefinedStringFormat enum.
Whitelist
WhitelistedValuesConstrain extraction to a predefined set of allowed values for enum-like fields.
Doubt
NullOnDoubtReturn null instead of uncertain values when confidence is below threshold.
Context
MaximumContextLengthLimit context window size for processing large documents efficiently.
Title
Guide extraction with document title and description for better accuracy.
Complete example showing schema definition, multimodal extraction, and result handling.
using LMKit.Model; using LMKit.Extraction; using LMKit.Data; // Load model (text or vision-language model for images) var model = LM.LoadFromModelID("gemma4:e4b"); // Create extraction instance var extraction = new TextExtraction(model) { Title = "Invoice Parser", Description = "Extract invoice details from documents", NullOnDoubt = true }; // Define extraction schema with nested objects and arrays extraction.Elements = new List<TextExtractionElement> { new("InvoiceNumber", ElementType.String, "Invoice identifier") { Format = new TextExtractionElementFormat { IsRequired = true, TextCaseMode = TextCaseMode.Uppercase } }, new("Vendor", ElementType.String, "Vendor company name"), new("Date", ElementType.Date, "Invoice date"), new("Total", ElementType.Double, "Total amount"), new("LineItems", ElementType.Object, "Invoice line items") { IsArray = true, InnerElements = new List<TextExtractionElement> { new("Description", ElementType.String), new("Quantity", ElementType.Integer), new("UnitPrice", ElementType.Double) } } }; // Load content (text, image, or PDF) extraction.SetContent(new Attachment("invoice.pdf")); // Or extract from specific pages // extraction.SetContent(new Attachment("invoice.pdf"), pageIndex: 0); // Parse and get results TextExtractionResult result = await extraction.ParseAsync(); // Access typed values var invoiceNum = result.GetValue<string>("InvoiceNumber"); var total = result.GetValue<double>("Total", 0.0); var confidence = result.GetConfidence("Total"); // Enumerate array items foreach (var item in result.EnumerateAt("LineItems")) { var desc = item.Get("Description").As<string>(); var qty = item.Get("Quantity").As<int>(); Console.WriteLine($"{desc}: {qty} units"); } // Get raw JSON output Console.WriteLine(result.Json); // Overall confidence score Console.WriteLine($"Confidence: {result.Confidence:P0}");
Explore working examples to accelerate your development. Clone, run, and customize.
Featured · Core
Extract data from invoices, job offers, medical records, and more. Full TextExtraction API demonstration.
CLI
Real-world invoice parsing with vendor, amounts, line items, and payment terms extraction.
CLI · NER
Extract people, organizations, locations, dates, and custom entity types from text.
CLI · Privacy
Detect and extract personally identifiable information for compliance and data protection.
CLI · Batch
Process multiple documents at scale with parallel PII detection and extraction.
CLI · Web
Extract structured JSON data from web pages and HTML content automatically.
Related document processing demos
Transform unstructured content into actionable data across industries.
Finance
Extract vendor, amounts, line items, tax details, and payment terms from invoices in any format.
Legal
Identify parties, obligations, dates, clauses, and terms from legal contracts and agreements.
Healthcare
Parse patient data, diagnoses, medications, lab results, and treatment history from clinical documents.
HR
Extract candidate details, skills, experience, education, and certifications from CVs and resumes.
Retail
Digitize receipts from photos: merchant, items, prices, taxes, and payment method.
Forms
Automate data entry from scanned forms, applications, and surveys with field-level extraction.
LM-Kit's proprietary Dynamic Sampling technology optimizes token generation in real-time, dramatically improving extraction accuracy even with smaller, faster models. This enables deployment on edge devices without sacrificing quality.
The extraction engine continuously adapts sampling parameters based on schema constraints, confidence thresholds, and output validation. The result: structured data extraction that rivals cloud-based solutions, running entirely on your infrastructure.
Accuracy
Achieve 95%+ extraction accuracy on complex documents with optimized sampling.
Speed
Reduced token generation time means faster extraction cycles.
Compact
Get LLM-quality results from 4B parameter models on consumer hardware.
Edge
Run extraction pipelines on laptops, mobile devices, and IoT systems.
Core components for building data extraction pipelines.
TextExtractionMain extraction engine. Configure schema, content, and processing options. Supports sync and async parsing.
TextExtractionResultExtraction results with typed accessors, JSON output, confidence scores, and array enumeration.
TextExtractionElementSchema element definition with name, type, description, nesting, and format options.
TextExtractionElementFormatOutput formatting options: case mode, required flag, whitelist, and predefined formats.
AttachmentInput container for text, images, PDFs, and Office documents. Supports streams, files, and URIs.
ElementTypeEnumeration of supported data types including primitives, dates, and array variants.
Explore working examples to accelerate your development. All demos run immediately with no additional setup.
Featured
Extract structured data from invoices, job offers, medical records, and more using customizable schemas.
CLI
Practical example of parsing invoices with vendor, amounts, line items, and payment terms.
CLI
Extract entities like people, organizations, locations, dates, and custom entity types.
CLI
Detect and extract personally identifiable information for compliance and data protection.
CLI
Process multiple documents at scale with parallel extraction and aggregated results.
CLI
Extract structured JSON data from web pages and HTML content automatically.
Related document processing demos
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Console demo: typed C# objects from any document.
Open on GitHub → How-to guideSchema definition, grammar-constrained generation, validation.
Read the guide → How-to guideHave the model infer the schema from examples.
Read the guide →Transform unstructured content into structured data. 100% local, 100% your infrastructure.