Solutions · Data Processing · Intelligent Data Extraction

Transform any content into structured data.

Extract structured information from text, images, PDFs, and scanned documents using AI-powered schema inference. Define custom extraction schemas with JSON, auto-discover fields, and get typed results with confidence scores. 100% local processing with Dynamic Sampling for maximum accuracy.

Schema discovery Multimodal input Confidence scoring
Unstructured input

Live extraction preview

"Invoice #INV-2024-0892 from Acme Corp dated Jan 15, 2024. Total amount: $4,250.00. Payment due within 30 days. Contact: [email protected]"

Structured JSON
{
  "invoiceNumber": "INV-2024-0892",
  "vendor": "Acme Corp",
  "date": "2024-01-15",
  "total": 4250.00,
  "paymentTerms": 30,
  "email": "[email protected]"
}
PDF & documents
Images & scans
VLM OCR engine
JSON Schema
10+
Data types

Nested depth
100%
Local

Turn chaos into structure with AI.

The TextExtraction engine transforms unstructured content from any source into typed, validated data. Define extraction schemas using code, JSON Schema, or let the AI auto-discover fields with SchemaDiscovery. Process text, images, PDFs, and scanned documents with a unified API.

Powered by LM-Kit's proprietary Dynamic Sampling technology, the extraction engine delivers exceptional accuracy even with smaller models, enabling deployment on edge devices without sacrificing quality.

New in 2026: Schema auto-discovery, page-range extraction, spatial bounds for extracted data, per-field confidence scores, VLM-powered document understanding, and fine-tuning dataset export.

String
Text values
Integer
Whole numbers
Float
Decimal numbers
Double
High precision
Bool
True/false
Date
Date values
Char
Single character
Object
Nested structure
StringArray
Text lists
IntegerArray
Number lists
DateArray
Date lists
ObjectArray
Nested arrays
Core capabilities

Complete extraction toolkit.

Everything you need to build production-grade data extraction pipelines in .NET.

Discovery

Schema auto-discovery

Use SchemaDiscovery to automatically detect and suggest extraction fields from your content. Perfect for exploring unknown document formats or bootstrapping new extraction pipelines.

JSON Schema

JSON Schema support

Define extraction schemas using standard JSON Schema with SetElementsFromJsonSchema. Import existing schemas from your API contracts or database models for seamless integration.

Confidence

Confidence scoring

Every extraction includes a confidence score (0-1) at both result and field level. Use GetConfidence(path) to validate uncertain fields and implement human review workflows.

Nested

Nested objects & arrays

Extract complex hierarchical data with unlimited nesting depth. Define object arrays for line items, addresses, and other repeating structures. Access with EnumerateAt(path).

Bounds

Spatial bounds

Extracted elements include Bounds property with spatial location. Know exactly where each value was found in the document for verification, highlighting, or downstream processing.

Pages

Page-range extraction

Extract from specific pages of multi-page documents using SetContent with page index or range parameters. Process only the pages you need for faster extraction.

Multimodal extraction

Extract from any source.

Process text, images, PDFs, and scanned documents with a unified API powered by OCR and Vision Language Models.

Vision-powered document understanding

LM-Kit's extraction engine supports multiple processing modalities. For clean digital documents, direct text extraction is fastest. For scanned documents, integrate the OcrEngine. For complex layouts with tables, forms, and mixed content, leverage Vision Language Models through PreferredInferenceModality.

The engine automatically handles handwritten notes, smartphone photos, receipts, ID cards, and any other visual content. Use VlmOcr for the highest quality document understanding with layout preservation.

  • PDF with embedded text or scanned pages
  • Office documents (DOCX, XLSX, PPTX)
  • Images (PNG, JPEG, WebP, TIFF)
  • HTML and Markdown content
  • Plain text from any source

Explore Attachment API →

Auto-detect

PDF documents

Digital or scanned pages.

VLM

Images

Photos, scans, screenshots.

Native

Office files

Word, Excel, PowerPoint.

OCR

Handwritten

Notes and forms.

Schema definition

Define what to extract.

Multiple ways to define extraction schemas for maximum flexibility.

JSON Schema

SetElementsFromJsonSchema

Import extraction schemas from standard JSON Schema definitions for seamless integration with existing systems.

  • Standard JSON Schema compatibility
  • Reuse API contracts as extraction schemas
  • Support for complex nested structures
  • Generate from database models
Output formatting

Control extraction output.

Fine-tune how extracted values are formatted and validated with TextExtractionElementFormat.

Case

TextCaseMode

Control text casing: uppercase, lowercase, title case, or preserve original.

Trim

TrimStart

Remove leading whitespace or specific characters from extracted values.

Required

IsRequired

Mark fields as mandatory. Extraction fails if required fields are not found.

Hint

FormatHint

Provide format hints like email, phone, URL with PredefinedStringFormat enum.

Whitelist

WhitelistedValues

Constrain extraction to a predefined set of allowed values for enum-like fields.

Doubt

NullOnDoubt

Return null instead of uncertain values when confidence is below threshold.

Context

MaximumContextLength

Limit context window size for processing large documents efficiently.

Title

Title & description

Guide extraction with document title and description for better accuracy.

Code example

Build an extraction pipeline.

Complete example showing schema definition, multimodal extraction, and result handling.

InvoiceExtraction.cs
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;

// Load model (text or vision-language model for images)
var model = LM.LoadFromModelID("gemma4:e4b");

// Create extraction instance
var extraction = new TextExtraction(model)
{
    Title = "Invoice Parser",
    Description = "Extract invoice details from documents",
    NullOnDoubt = true
};

// Define extraction schema with nested objects and arrays
extraction.Elements = new List<TextExtractionElement>
{
    new("InvoiceNumber", ElementType.String, "Invoice identifier")
    {
        Format = new TextExtractionElementFormat
        {
            IsRequired = true,
            TextCaseMode = TextCaseMode.Uppercase
        }
    },
    new("Vendor", ElementType.String, "Vendor company name"),
    new("Date", ElementType.Date, "Invoice date"),
    new("Total", ElementType.Double, "Total amount"),
    new("LineItems", ElementType.Object, "Invoice line items")
    {
        IsArray = true,
        InnerElements = new List<TextExtractionElement>
        {
            new("Description", ElementType.String),
            new("Quantity", ElementType.Integer),
            new("UnitPrice", ElementType.Double)
        }
    }
};

// Load content (text, image, or PDF)
extraction.SetContent(new Attachment("invoice.pdf"));

// Or extract from specific pages
// extraction.SetContent(new Attachment("invoice.pdf"), pageIndex: 0);

// Parse and get results
TextExtractionResult result = await extraction.ParseAsync();

// Access typed values
var invoiceNum = result.GetValue<string>("InvoiceNumber");
var total = result.GetValue<double>("Total", 0.0);
var confidence = result.GetConfidence("Total");

// Enumerate array items
foreach (var item in result.EnumerateAt("LineItems"))
{
    var desc = item.Get("Description").As<string>();
    var qty = item.Get("Quantity").As<int>();
    Console.WriteLine($"{desc}: {qty} units");
}

// Get raw JSON output
Console.WriteLine(result.Json);

// Overall confidence score
Console.WriteLine($"Confidence: {result.Confidence:P0}");
Get started

Demo applications.

Explore working examples to accelerate your development. Clone, run, and customize.

CLI

Invoice data extraction

Real-world invoice parsing with vendor, amounts, line items, and payment terms extraction.

View on CLI

CLI · NER

Named entity recognition

Extract people, organizations, locations, dates, and custom entity types from text.

View on CLI

CLI · Privacy

PII extraction

Detect and extract personally identifiable information for compliance and data protection.

View on CLI

CLI · Batch

Batch PII extraction

Process multiple documents at scale with parallel PII detection and extraction.

View on CLI

CLI · Web

Web content extractor

Extract structured JSON data from web pages and HTML content automatically.

View on CLI

Industry applications

Where to use data extraction.

Transform unstructured content into actionable data across industries.

Finance

Invoice processing

Extract vendor, amounts, line items, tax details, and payment terms from invoices in any format.

Legal

Contract analysis

Identify parties, obligations, dates, clauses, and terms from legal contracts and agreements.

Healthcare

Medical records

Parse patient data, diagnoses, medications, lab results, and treatment history from clinical documents.

HR

Resume parsing

Extract candidate details, skills, experience, education, and certifications from CVs and resumes.

Retail

Receipt capture

Digitize receipts from photos: merchant, items, prices, taxes, and payment method.

Forms

Form processing

Automate data entry from scanned forms, applications, and surveys with field-level extraction.

Technology

Powered by Dynamic Sampling.

LM-Kit's proprietary Dynamic Sampling technology optimizes token generation in real-time, dramatically improving extraction accuracy even with smaller, faster models. This enables deployment on edge devices without sacrificing quality.

The extraction engine continuously adapts sampling parameters based on schema constraints, confidence thresholds, and output validation. The result: structured data extraction that rivals cloud-based solutions, running entirely on your infrastructure.

Learn about Dynamic Sampling

Accuracy

Higher accuracy

Achieve 95%+ extraction accuracy on complex documents with optimized sampling.

Speed

Faster processing

Reduced token generation time means faster extraction cycles.

Compact

Smaller models

Get LLM-quality results from 4B parameter models on consumer hardware.

Edge

Edge deployment

Run extraction pipelines on laptops, mobile devices, and IoT systems.

API reference

Key classes & methods.

Core components for building data extraction pipelines.

TextExtraction

Main extraction engine. Configure schema, content, and processing options. Supports sync and async parsing.

View documentation

TextExtractionResult

Extraction results with typed accessors, JSON output, confidence scores, and array enumeration.

View documentation

TextExtractionElement

Schema element definition with name, type, description, nesting, and format options.

View documentation

TextExtractionElementFormat

Output formatting options: case mode, required flag, whitelist, and predefined formats.

View documentation

Attachment

Input container for text, images, PDFs, and Office documents. Supports streams, files, and URIs.

View documentation

ElementType

Enumeration of supported data types including primitives, dates, and array variants.

View documentation

Get started

Demo applications.

Explore working examples to accelerate your development. All demos run immediately with no additional setup.

CLI

Invoice data extraction

Practical example of parsing invoices with vendor, amounts, line items, and payment terms.

View on CLI

CLI

Named entity recognition

Extract entities like people, organizations, locations, dates, and custom entity types.

View on CLI

CLI

PII extraction

Detect and extract personally identifiable information for compliance and data protection.

View on CLI

CLI

Batch PII extraction

Process multiple documents at scale with parallel extraction and aggregated results.

View on CLI

CLI

Web content extractor

Extract structured JSON data from web pages and HTML content automatically.

View on CLI

Ready to build intelligent data extraction?

Transform unstructured content into structured data. 100% local, 100% your infrastructure.

Download free View demo code