Solutions · Data Processing · Intelligent Data Extraction

Transform any content into structured data.

Extract structured information from text, images, PDFs, and scanned documents using AI-powered schema inference. Define custom extraction schemas with JSON, auto-discover fields, and get typed results with confidence scores. 100% local processing with Dynamic Sampling for maximum accuracy.

Start building free API reference

Schema discovery Multimodal input Confidence scoring

Unstructured input

Live extraction preview

"Invoice #INV-2024-0892 from Acme Corp dated Jan 15, 2024. Total amount: $4,250.00. Payment due within 30 days. Contact: [email protected]"

Structured JSON

{
  "invoiceNumber": "INV-2024-0892",
  "vendor": "Acme Corp",
  "date": "2024-01-15",
  "total": 4250.00,
  "paymentTerms": 30,
  "email": "[email protected]"
}

PDF & documents

Images & scans

VLM OCR engine

JSON Schema

10+
Data types

∞
Nested depth

100%
Local

Turn chaos into structure with AI.

The TextExtraction engine transforms unstructured content from any source into typed, validated data. Define extraction schemas using code, JSON Schema, or let the AI auto-discover fields with SchemaDiscovery. Process text, images, PDFs, and scanned documents with a unified API.

Powered by LM-Kit's proprietary Dynamic Sampling technology, the extraction engine delivers exceptional accuracy even with smaller models, enabling deployment on edge devices without sacrificing quality.

New in 2026: Schema auto-discovery, page-range extraction, spatial bounds for extracted data, per-field confidence scores, VLM-powered document understanding, and fine-tuning dataset export.

String
Text values

Integer
Whole numbers

Float
Decimal numbers

Double
High precision

Bool
True/false

Date
Date values

Char
Single character

Object
Nested structure

StringArray
Text lists

IntegerArray
Number lists

DateArray
Date lists

ObjectArray
Nested arrays

Core capabilities

Complete extraction toolkit.

Everything you need to build production-grade data extraction pipelines in .NET.

Discovery

Schema auto-discovery

Use SchemaDiscovery to automatically detect and suggest extraction fields from your content. Perfect for exploring unknown document formats or bootstrapping new extraction pipelines.

JSON Schema

JSON Schema support

Define extraction schemas using standard JSON Schema with SetElementsFromJsonSchema. Import existing schemas from your API contracts or database models for seamless integration.

Confidence

Confidence scoring

Every extraction includes a confidence score (0-1) at both result and field level. Use GetConfidence(path) to validate uncertain fields and implement human review workflows.

Nested

Nested objects & arrays

Extract complex hierarchical data with unlimited nesting depth. Define object arrays for line items, addresses, and other repeating structures. Access with EnumerateAt(path).

Bounds

Spatial bounds

Extracted elements include Bounds property with spatial location. Know exactly where each value was found in the document for verification, highlighting, or downstream processing.

Pages

Page-range extraction

Extract from specific pages of multi-page documents using SetContent with page index or range parameters. Process only the pages you need for faster extraction.

Multimodal extraction

Extract from any source.

Process text, images, PDFs, and scanned documents with a unified API powered by OCR and Vision Language Models.

Vision-powered document understanding

LM-Kit's extraction engine supports multiple processing modalities. For clean digital documents, direct text extraction is fastest. For scanned documents, integrate the OcrEngine. For complex layouts with tables, forms, and mixed content, leverage Vision Language Models through PreferredInferenceModality.

The engine automatically handles handwritten notes, smartphone photos, receipts, ID cards, and any other visual content. Use VlmOcr for the highest quality document understanding with layout preservation.

PDF with embedded text or scanned pages
Office documents (DOCX, XLSX, PPTX)
Images (PNG, JPEG, WebP, TIFF)
HTML and Markdown content
Plain text from any source

Explore Attachment API →

Auto-detect

PDF documents

Digital or scanned pages.

VLM

Images

Photos, scans, screenshots.

Native

Office files

Word, Excel, PowerPoint.

OCR

Handwritten

Notes and forms.

Schema definition

Define what to extract.

Multiple ways to define extraction schemas for maximum flexibility.

Code-first

`TextExtractionElement` API

Define extraction elements programmatically with full type safety and IntelliSense support.

Name, type, and description for each field
Nested objects with InnerElements
Array support with IsArray
Format constraints via TextExtractionElementFormat

JSON Schema

`SetElementsFromJsonSchema`

Import extraction schemas from standard JSON Schema definitions for seamless integration with existing systems.

Standard JSON Schema compatibility
Reuse API contracts as extraction schemas
Support for complex nested structures
Generate from database models

Output formatting

Control extraction output.

Fine-tune how extracted values are formatted and validated with TextExtractionElementFormat.

Case

`TextCaseMode`

Control text casing: uppercase, lowercase, title case, or preserve original.

Trim

`TrimStart`

Remove leading whitespace or specific characters from extracted values.

Required

`IsRequired`

Mark fields as mandatory. Extraction fails if required fields are not found.

Hint

`FormatHint`

Provide format hints like email, phone, URL with PredefinedStringFormat enum.

Whitelist

`WhitelistedValues`

Constrain extraction to a predefined set of allowed values for enum-like fields.

Doubt

`NullOnDoubt`

Return null instead of uncertain values when confidence is below threshold.

Context

`MaximumContextLength`

Limit context window size for processing large documents efficiently.

Title

Title & description

Guide extraction with document title and description for better accuracy.

Code example

Build an extraction pipeline.

Complete example showing schema definition, multimodal extraction, and result handling.

InvoiceExtraction.cs

using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;

// Load model (text or vision-language model for images)
var model = LM.LoadFromModelID("gemma4:e4b");

// Create extraction instance
var extraction = new TextExtraction(model)
{
    Title = "Invoice Parser",
    Description = "Extract invoice details from documents",
    NullOnDoubt = true
};

// Define extraction schema with nested objects and arrays
extraction.Elements = new List<TextExtractionElement>
{
    new("InvoiceNumber", ElementType.String, "Invoice identifier")
    {
        Format = new TextExtractionElementFormat
        {
            IsRequired = true,
            TextCaseMode = TextCaseMode.Uppercase
        }
    },
    new("Vendor", ElementType.String, "Vendor company name"),
    new("Date", ElementType.Date, "Invoice date"),
    new("Total", ElementType.Double, "Total amount"),
    new("LineItems", ElementType.Object, "Invoice line items")
    {
        IsArray = true,
        InnerElements = new List<TextExtractionElement>
        {
            new("Description", ElementType.String),
            new("Quantity", ElementType.Integer),
            new("UnitPrice", ElementType.Double)
        }
    }
};

// Load content (text, image, or PDF)
extraction.SetContent(new Attachment("invoice.pdf"));

// Or extract from specific pages
// extraction.SetContent(new Attachment("invoice.pdf"), pageIndex: 0);

// Parse and get results
TextExtractionResult result = await extraction.ParseAsync();

// Access typed values
var invoiceNum = result.GetValue<string>("InvoiceNumber");
var total = result.GetValue<double>("Total", 0.0);
var confidence = result.GetConfidence("Total");

// Enumerate array items
foreach (var item in result.EnumerateAt("LineItems"))
{
    var desc = item.Get("Description").As<string>();
    var qty = item.Get("Quantity").As<int>();
    Console.WriteLine($"{desc}: {qty} units");
}

// Get raw JSON output
Console.WriteLine(result.Json);

// Overall confidence score
Console.WriteLine($"Confidence: {result.Confidence:P0}");

Get started

Demo applications.

Explore working examples to accelerate your development. Clone, run, and customize.

Featured · Core

Structured data extraction

Extract data from invoices, job offers, medical records, and more. Full TextExtraction API demonstration.

View on CLI

CLI

Invoice data extraction

Real-world invoice parsing with vendor, amounts, line items, and payment terms extraction.

View on CLI

CLI · NER

Named entity recognition

Extract people, organizations, locations, dates, and custom entity types from text.

View on CLI

CLI · Privacy

PII extraction

Detect and extract personally identifiable information for compliance and data protection.

View on CLI

CLI · Batch

Batch PII extraction

Process multiple documents at scale with parallel PII detection and extraction.

View on CLI

CLI · Web

Web content extractor

Extract structured JSON data from web pages and HTML content automatically.

View on CLI

Where to use data extraction.

Transform unstructured content into actionable data across industries.

Finance

Invoice processing

Extract vendor, amounts, line items, tax details, and payment terms from invoices in any format.

Legal

Contract analysis

Identify parties, obligations, dates, clauses, and terms from legal contracts and agreements.

Healthcare

Medical records

Parse patient data, diagnoses, medications, lab results, and treatment history from clinical documents.

Resume parsing

Extract candidate details, skills, experience, education, and certifications from CVs and resumes.

Retail

Receipt capture

Digitize receipts from photos: merchant, items, prices, taxes, and payment method.

Forms

Form processing

Automate data entry from scanned forms, applications, and surveys with field-level extraction.

Technology

Powered by Dynamic Sampling.

LM-Kit's proprietary Dynamic Sampling technology optimizes token generation in real-time, dramatically improving extraction accuracy even with smaller, faster models. This enables deployment on edge devices without sacrificing quality.

The extraction engine continuously adapts sampling parameters based on schema constraints, confidence thresholds, and output validation. The result: structured data extraction that rivals cloud-based solutions, running entirely on your infrastructure.

Learn about Dynamic Sampling

Accuracy

Higher accuracy

Achieve 95%+ extraction accuracy on complex documents with optimized sampling.

Speed

Faster processing

Reduced token generation time means faster extraction cycles.

Compact

Smaller models

Get LLM-quality results from 4B parameter models on consumer hardware.

Edge

Edge deployment

Run extraction pipelines on laptops, mobile devices, and IoT systems.

API reference

Key classes & methods.

Core components for building data extraction pipelines.

`TextExtraction`

Main extraction engine. Configure schema, content, and processing options. Supports sync and async parsing.

View documentation

`TextExtractionResult`

Extraction results with typed accessors, JSON output, confidence scores, and array enumeration.

View documentation

`TextExtractionElement`

Schema element definition with name, type, description, nesting, and format options.

View documentation

`TextExtractionElementFormat`

Output formatting options: case mode, required flag, whitelist, and predefined formats.

View documentation

`Attachment`

Input container for text, images, PDFs, and Office documents. Supports streams, files, and URIs.

View documentation

`ElementType`

Enumeration of supported data types including primitives, dates, and array variants.

View documentation

Get started

Demo applications.

Explore working examples to accelerate your development. All demos run immediately with no additional setup.

Featured

Structured data extraction

Extract structured data from invoices, job offers, medical records, and more using customizable schemas.

View on CLI

CLI

Invoice data extraction

Practical example of parsing invoices with vendor, amounts, line items, and payment terms.

View on CLI

CLI

Named entity recognition

Extract entities like people, organizations, locations, dates, and custom entity types.

View on CLI

CLI

PII extraction

Detect and extract personally identifiable information for compliance and data protection.

View on CLI

CLI

Batch PII extraction

Process multiple documents at scale with parallel extraction and aggregated results.

View on CLI

CLI

Web content extractor

Extract structured JSON data from web pages and HTML content automatically.

View on CLI

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Ready to build intelligent data extraction?

Transform unstructured content into structured data. 100% local, 100% your infrastructure.

Download free View demo code

Transform any content into structured data.

Live extraction preview

Schema auto-discovery

JSON Schema support

Confidence scoring

Nested objects & arrays

Spatial bounds

Page-range extraction

PDF documents

Images

Office files

Handwritten

TextExtractionElement API

SetElementsFromJsonSchema

TextCaseMode

TrimStart

IsRequired

FormatHint

WhitelistedValues

NullOnDoubt

MaximumContextLength

Title & description

Structured data extraction

Invoice data extraction

Named entity recognition

PII extraction

Batch PII extraction

Web content extractor

Invoice processing

Contract analysis

Medical records

Resume parsing

Receipt capture

Form processing

Higher accuracy

Faster processing

Smaller models

Edge deployment

TextExtraction

TextExtractionResult

TextExtractionElement

TextExtractionElementFormat

Attachment

ElementType

Structured data extraction

Invoice data extraction

Named entity recognition

PII extraction

Batch PII extraction

Web content extractor

Structured data extraction

Structured data extraction walkthrough

Extract structured data

Auto-discover extraction schemas

`TextExtractionElement` API

`SetElementsFromJsonSchema`

`TextCaseMode`

`TrimStart`

`IsRequired`

`FormatHint`

`WhitelistedValues`

`NullOnDoubt`

`MaximumContextLength`

`TextExtraction`

`TextExtractionResult`

`TextExtractionElement`

`TextExtractionElementFormat`

`Attachment`

`ElementType`