Transform Any ContentInto Structured Data.
Extract structured information from text, images, PDFs, and scanned documents using AI-powered schema inference. Define custom extraction schemas with JSON, auto-discover fields, and get typed results with confidence scores. 100% local processing with Dynamic Sampling for maximum accuracy.
"Invoice #INV-2024-0892 from Acme Corp dated Jan 15, 2024. Total amount: $4,250.00. Payment due within 30 days. Contact: [email protected]"
"invoiceNumber": "INV-2024-0892",
"vendor": "Acme Corp",
"date": "2024-01-15",
"total": 4250.00,
"paymentTerms": 30,
"email": "[email protected]"
}
Turn Chaos Into Structure with AI
The TextExtraction engine transforms unstructured content from any source
into typed, validated data. Define extraction schemas using code, JSON Schema, or let the
AI auto-discover fields with SchemaDiscovery. Process text, images, PDFs,
and scanned documents with a unified API.
Powered by LM-Kit's proprietary Dynamic Sampling technology, the extraction engine delivers exceptional accuracy even with smaller models, enabling deployment on edge devices without sacrificing quality.
New in 2026: Schema auto-discovery, page-range extraction, spatial bounds for extracted data, per-field confidence scores, VLM-powered document understanding, and fine-tuning dataset export.
String
Text values
Integer
Whole numbers
Float
Decimal numbers
Double
High precision
Bool
True/false
Date
Date values
Char
Single character
Object
Nested structure
StringArray
Text lists
IntegerArray
Number lists
DateArray
Date lists
ObjectArray
Nested arrays
Complete Extraction Toolkit
Everything you need to build production-grade data extraction pipelines in .NET.
Schema Auto-Discovery
Use SchemaDiscovery to automatically detect and suggest extraction fields from your content. Perfect for exploring unknown document formats or bootstrapping new extraction pipelines.
JSON Schema Support
Define extraction schemas using standard JSON Schema with SetElementsFromJsonSchema. Import existing schemas from your API contracts or database models for seamless integration.
Confidence Scoring
Every extraction includes a confidence score (0-1) at both result and field level. Use GetConfidence(path) to validate uncertain fields and implement human review workflows.
Nested Objects & Arrays
Extract complex hierarchical data with unlimited nesting depth. Define object arrays for line items, addresses, and other repeating structures. Access with EnumerateAt(path).
Spatial Bounds
Extracted elements include Bounds property with spatial location. Know exactly where each value was found in the document for verification, highlighting, or downstream processing.
Page-Range Extraction
Extract from specific pages of multi-page documents using SetContent with page index or range parameters. Process only the pages you need for faster extraction.
Extract From Any Source
Process text, images, PDFs, and scanned documents with a unified API powered by OCR and Vision Language Models.
Vision-Powered Document Understanding
LM-Kit's extraction engine supports multiple processing modalities. For clean digital
documents, direct text extraction is fastest. For scanned documents, integrate the
OcrEngine. For complex layouts with tables, forms, and mixed content,
leverage Vision Language Models through PreferredInferenceModality.
The engine automatically handles handwritten notes, smartphone photos, receipts,
ID cards, and any other visual content. Use VlmOcr for the highest
quality document understanding with layout preservation.
- PDF with embedded text or scanned pages
- Office documents (DOCX, XLSX, PPTX)
- Images (PNG, JPEG, WebP, TIFF)
- HTML and Markdown content
- Plain text from any source
PDF Documents
Digital or scanned pages
Auto-detectImages
Photos, scans, screenshots
VLMOffice Files
Word, Excel, PowerPoint
NativeHandwritten
Notes and forms
OCRDefine What to Extract
Multiple ways to define extraction schemas for maximum flexibility.
Define extraction elements programmatically with full type safety and IntelliSense support.
- Name, type, and description for each field
- Nested objects with
InnerElements - Array support with
IsArray - Format constraints via
TextExtractionElementFormat
Import extraction schemas from standard JSON Schema definitions for seamless integration with existing systems.
- Standard JSON Schema compatibility
- Reuse API contracts as extraction schemas
- Support for complex nested structures
- Generate from database models
Control Extraction Output
Fine-tune how extracted values are formatted and validated with TextExtractionElementFormat.
TextCaseMode
Control text casing: uppercase, lowercase, title case, or preserve original.
TrimStart
Remove leading whitespace or specific characters from extracted values.
IsRequired
Mark fields as mandatory. Extraction fails if required fields are not found.
FormatHint
Provide format hints like email, phone, URL with PredefinedStringFormat enum.
WhitelistedValues
Constrain extraction to a predefined set of allowed values for enum-like fields.
NullOnDoubt
Return null instead of uncertain values when confidence is below threshold.
MaximumContextLength
Limit context window size for processing large documents efficiently.
Title & Description
Guide extraction with document title and description for better accuracy.
Build an Extraction Pipeline
Complete example showing schema definition, multimodal extraction, and result handling.
using LMKit.Model; using LMKit.Extraction; using LMKit.Data; // Load model (text or vision-language model for images) var model = LM.LoadFromModelID("gemma3:4b"); // Create extraction instance var extraction = new TextExtraction(model) { Title = "Invoice Parser", Description = "Extract invoice details from documents", NullOnDoubt = true }; // Define extraction schema with nested objects and arrays extraction.Elements = new List<TextExtractionElement> { new("InvoiceNumber", ElementType.String, "Invoice identifier") { Format = new TextExtractionElementFormat { IsRequired = true, TextCaseMode = TextCaseMode.Uppercase } }, new("Vendor", ElementType.String, "Vendor company name"), new("Date", ElementType.Date, "Invoice date"), new("Total", ElementType.Double, "Total amount"), new("LineItems", ElementType.Object, "Invoice line items") { IsArray = true, InnerElements = new List<TextExtractionElement> { new("Description", ElementType.String), new("Quantity", ElementType.Integer), new("UnitPrice", ElementType.Double) } } }; // Load content (text, image, or PDF) extraction.SetContent(new Attachment("invoice.pdf")); // Or extract from specific pages // extraction.SetContent(new Attachment("invoice.pdf"), pageIndex: 0); // Parse and get results TextExtractionResult result = await extraction.ParseAsync(); // Access typed values var invoiceNum = result.GetValue<string>("InvoiceNumber"); var total = result.GetValue<double>("Total", 0.0); var confidence = result.GetConfidence("Total"); // Enumerate array items foreach (var item in result.EnumerateAt("LineItems")) { var desc = item.Get("Description").As<string>(); var qty = item.Get("Quantity").As<int>(); Console.WriteLine($"{desc}: {qty} units"); } // Get raw JSON output Console.WriteLine(result.Json); // Overall confidence score Console.WriteLine($"Confidence: {result.Confidence:P0}");
Demo Applications
Explore working examples to accelerate your development. Clone, run, and customize.
Structured Data Extraction
Extract data from invoices, job offers, medical records, and more. Full TextExtraction API demonstration.
CLI CoreInvoice Data Extraction
Real-world invoice parsing with vendor, amounts, line items, and payment terms extraction.
CLINamed Entity Recognition
Extract people, organizations, locations, dates, and custom entity types from text.
CLI NERPII Extraction
Detect and extract personally identifiable information for compliance and data protection.
CLI PrivacyBatch PII Extraction
Process multiple documents at scale with parallel PII detection and extraction.
CLI BatchWeb Content Extractor
Extract structured JSON data from web pages and HTML content automatically.
CLI WebRelated Document Processing Demos
Where to Use Data Extraction
Transform unstructured content into actionable data across industries.
Invoice Processing
Extract vendor, amounts, line items, tax details, and payment terms from invoices in any format.
Contract Analysis
Identify parties, obligations, dates, clauses, and terms from legal contracts and agreements.
Medical Records
Parse patient data, diagnoses, medications, lab results, and treatment history from clinical documents.
Resume Parsing
Extract candidate details, skills, experience, education, and certifications from CVs and resumes.
Receipt Capture
Digitize receipts from photos: merchant, items, prices, taxes, and payment method.
Form Processing
Automate data entry from scanned forms, applications, and surveys with field-level extraction.
Powered by Dynamic Sampling
LM-Kit's proprietary Dynamic Sampling technology optimizes token generation in real-time, dramatically improving extraction accuracy even with smaller, faster models. This enables deployment on edge devices without sacrificing quality.
The extraction engine continuously adapts sampling parameters based on schema constraints, confidence thresholds, and output validation. The result: structured data extraction that rivals cloud-based solutions, running entirely on your infrastructure.
Learn About Dynamic SamplingHigher Accuracy
Achieve 95%+ extraction accuracy on complex documents with optimized sampling.
Faster Processing
Reduced token generation time means faster extraction cycles.
Smaller Models
Get LLM-quality results from 4B parameter models on consumer hardware.
Edge Deployment
Run extraction pipelines on laptops, mobile devices, and IoT systems.
Key Classes & Methods
Core components for building data extraction pipelines.
TextExtraction
Main extraction engine. Configure schema, content, and processing options. Supports sync and async parsing.
View DocumentationTextExtractionResult
Extraction results with typed accessors, JSON output, confidence scores, and array enumeration.
View DocumentationTextExtractionElement
Schema element definition with name, type, description, nesting, and format options.
View DocumentationTextExtractionElementFormat
Output formatting options: case mode, required flag, whitelist, and predefined formats.
View DocumentationAttachment
Input container for text, images, PDFs, and Office documents. Supports streams, files, and URIs.
View DocumentationElementType
Enumeration of supported data types including primitives, dates, and array variants.
View DocumentationDemo Applications
Explore working examples to accelerate your development. All demos run immediately with no additional setup.
Structured Data Extraction
Extract structured data from invoices, job offers, medical records, and more using customizable schemas.
CLIInvoice Data Extraction
Practical example of parsing invoices with vendor, amounts, line items, and payment terms.
CLINamed Entity Recognition
Extract entities like people, organizations, locations, dates, and custom entity types.
CLIPII Extraction
Detect and extract personally identifiable information for compliance and data protection.
CLIBatch PII Extraction
Process multiple documents at scale with parallel extraction and aggregated results.
CLIWeb Content Extractor
Extract structured JSON data from web pages and HTML content automatically.
CLIRelated Document Processing Demos
Ready to Build Intelligent Data Extraction?
Transform unstructured content into structured data. 100% local, 100% your infrastructure.