Get Free Community License
Intelligent Data Extraction

Transform Any ContentInto Structured Data.

Extract structured information from text, images, PDFs, and scanned documents using AI-powered schema inference. Define custom extraction schemas with JSON, auto-discover fields, and get typed results with confidence scores. 100% local processing with Dynamic Sampling for maximum accuracy.

Schema Discovery Multimodal Input Confidence Scoring
Live Extraction Preview
Unstructured Input

"Invoice #INV-2024-0892 from Acme Corp dated Jan 15, 2024. Total amount: $4,250.00. Payment due within 30 days. Contact: [email protected]"

Structured JSON
{
  "invoiceNumber": "INV-2024-0892",
  "vendor": "Acme Corp",
  "date": "2024-01-15",
  "total": 4250.00,
  "paymentTerms": 30,
  "email": "[email protected]"
}
PDF & Documents
Images & Scans VLM
OCR Engine
JSON Schema
10+
Data Types
∞
Nested Depth
100%
Local
0.9+
Confidence

Turn Chaos Into Structure with AI

The TextExtraction engine transforms unstructured content from any source into typed, validated data. Define extraction schemas using code, JSON Schema, or let the AI auto-discover fields with SchemaDiscovery. Process text, images, PDFs, and scanned documents with a unified API.

Powered by LM-Kit's proprietary Dynamic Sampling technology, the extraction engine delivers exceptional accuracy even with smaller models, enabling deployment on edge devices without sacrificing quality.

New in 2026: Schema auto-discovery, page-range extraction, spatial bounds for extracted data, per-field confidence scores, VLM-powered document understanding, and fine-tuning dataset export.

String Text values
Integer Whole numbers
Float Decimal numbers
Double High precision
Bool True/false
Date Date values
Char Single character
Object Nested structure
StringArray Text lists
IntegerArray Number lists
DateArray Date lists
ObjectArray Nested arrays

Complete Extraction Toolkit

Everything you need to build production-grade data extraction pipelines in .NET.

Schema Auto-Discovery

Use SchemaDiscovery to automatically detect and suggest extraction fields from your content. Perfect for exploring unknown document formats or bootstrapping new extraction pipelines.

JSON Schema Support

Define extraction schemas using standard JSON Schema with SetElementsFromJsonSchema. Import existing schemas from your API contracts or database models for seamless integration.

Confidence Scoring

Every extraction includes a confidence score (0-1) at both result and field level. Use GetConfidence(path) to validate uncertain fields and implement human review workflows.

Nested Objects & Arrays

Extract complex hierarchical data with unlimited nesting depth. Define object arrays for line items, addresses, and other repeating structures. Access with EnumerateAt(path).

Spatial Bounds

Extracted elements include Bounds property with spatial location. Know exactly where each value was found in the document for verification, highlighting, or downstream processing.

Page-Range Extraction

Extract from specific pages of multi-page documents using SetContent with page index or range parameters. Process only the pages you need for faster extraction.

Extract From Any Source

Process text, images, PDFs, and scanned documents with a unified API powered by OCR and Vision Language Models.

Vision-Powered Document Understanding

LM-Kit's extraction engine supports multiple processing modalities. For clean digital documents, direct text extraction is fastest. For scanned documents, integrate the OcrEngine. For complex layouts with tables, forms, and mixed content, leverage Vision Language Models through PreferredInferenceModality.

The engine automatically handles handwritten notes, smartphone photos, receipts, ID cards, and any other visual content. Use VlmOcr for the highest quality document understanding with layout preservation.

  • PDF with embedded text or scanned pages
  • Office documents (DOCX, XLSX, PPTX)
  • Images (PNG, JPEG, WebP, TIFF)
  • HTML and Markdown content
  • Plain text from any source
Explore Attachment API

PDF Documents

Digital or scanned pages

Auto-detect

Images

Photos, scans, screenshots

VLM

Office Files

Word, Excel, PowerPoint

Native

Handwritten

Notes and forms

OCR

Define What to Extract

Multiple ways to define extraction schemas for maximum flexibility.

Code-First Definition
TextExtractionElement API

Define extraction elements programmatically with full type safety and IntelliSense support.

  • Name, type, and description for each field
  • Nested objects with InnerElements
  • Array support with IsArray
  • Format constraints via TextExtractionElementFormat
JSON Schema Import
SetElementsFromJsonSchema

Import extraction schemas from standard JSON Schema definitions for seamless integration with existing systems.

  • Standard JSON Schema compatibility
  • Reuse API contracts as extraction schemas
  • Support for complex nested structures
  • Generate from database models

Control Extraction Output

Fine-tune how extracted values are formatted and validated with TextExtractionElementFormat.

TextCaseMode

Control text casing: uppercase, lowercase, title case, or preserve original.

TrimStart

Remove leading whitespace or specific characters from extracted values.

IsRequired

Mark fields as mandatory. Extraction fails if required fields are not found.

FormatHint

Provide format hints like email, phone, URL with PredefinedStringFormat enum.

WhitelistedValues

Constrain extraction to a predefined set of allowed values for enum-like fields.

NullOnDoubt

Return null instead of uncertain values when confidence is below threshold.

MaximumContextLength

Limit context window size for processing large documents efficiently.

Title & Description

Guide extraction with document title and description for better accuracy.

Build an Extraction Pipeline

Complete example showing schema definition, multimodal extraction, and result handling.

InvoiceExtraction.cs
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;

// Load model (text or vision-language model for images)
var model = LM.LoadFromModelID("gemma3:4b");

// Create extraction instance
var extraction = new TextExtraction(model)
{
    Title = "Invoice Parser",
    Description = "Extract invoice details from documents",
    NullOnDoubt = true
};

// Define extraction schema with nested objects and arrays
extraction.Elements = new List<TextExtractionElement>
{
    new("InvoiceNumber", ElementType.String, "Invoice identifier")
    {
        Format = new TextExtractionElementFormat
        {
            IsRequired = true,
            TextCaseMode = TextCaseMode.Uppercase
        }
    },
    new("Vendor", ElementType.String, "Vendor company name"),
    new("Date", ElementType.Date, "Invoice date"),
    new("Total", ElementType.Double, "Total amount"),
    new("LineItems", ElementType.Object, "Invoice line items")
    {
        IsArray = true,
        InnerElements = new List<TextExtractionElement>
        {
            new("Description", ElementType.String),
            new("Quantity", ElementType.Integer),
            new("UnitPrice", ElementType.Double)
        }
    }
};

// Load content (text, image, or PDF)
extraction.SetContent(new Attachment("invoice.pdf"));

// Or extract from specific pages
// extraction.SetContent(new Attachment("invoice.pdf"), pageIndex: 0);

// Parse and get results
TextExtractionResult result = await extraction.ParseAsync();

// Access typed values
var invoiceNum = result.GetValue<string>("InvoiceNumber");
var total = result.GetValue<double>("Total", 0.0);
var confidence = result.GetConfidence("Total");

// Enumerate array items
foreach (var item in result.EnumerateAt("LineItems"))
{
    var desc = item.Get("Description").As<string>();
    var qty = item.Get("Quantity").As<int>();
    Console.WriteLine($"{desc}: {qty} units");
}

// Get raw JSON output
Console.WriteLine(result.Json);

// Overall confidence score
Console.WriteLine($"Confidence: {result.Confidence:P0}");

Where to Use Data Extraction

Transform unstructured content into actionable data across industries.

Invoice Processing

Extract vendor, amounts, line items, tax details, and payment terms from invoices in any format.

Contract Analysis

Identify parties, obligations, dates, clauses, and terms from legal contracts and agreements.

Medical Records

Parse patient data, diagnoses, medications, lab results, and treatment history from clinical documents.

Resume Parsing

Extract candidate details, skills, experience, education, and certifications from CVs and resumes.

Receipt Capture

Digitize receipts from photos: merchant, items, prices, taxes, and payment method.

Form Processing

Automate data entry from scanned forms, applications, and surveys with field-level extraction.

Powered by Dynamic Sampling

LM-Kit's proprietary Dynamic Sampling technology optimizes token generation in real-time, dramatically improving extraction accuracy even with smaller, faster models. This enables deployment on edge devices without sacrificing quality.

The extraction engine continuously adapts sampling parameters based on schema constraints, confidence thresholds, and output validation. The result: structured data extraction that rivals cloud-based solutions, running entirely on your infrastructure.

Learn About Dynamic Sampling

Higher Accuracy

Achieve 95%+ extraction accuracy on complex documents with optimized sampling.

Faster Processing

Reduced token generation time means faster extraction cycles.

Smaller Models

Get LLM-quality results from 4B parameter models on consumer hardware.

Edge Deployment

Run extraction pipelines on laptops, mobile devices, and IoT systems.

Key Classes & Methods

Core components for building data extraction pipelines.

TextExtraction

Main extraction engine. Configure schema, content, and processing options. Supports sync and async parsing.

View Documentation
TextExtractionResult

Extraction results with typed accessors, JSON output, confidence scores, and array enumeration.

View Documentation
TextExtractionElement

Schema element definition with name, type, description, nesting, and format options.

View Documentation
TextExtractionElementFormat

Output formatting options: case mode, required flag, whitelist, and predefined formats.

View Documentation
Attachment

Input container for text, images, PDFs, and Office documents. Supports streams, files, and URIs.

View Documentation
ElementType

Enumeration of supported data types including primitives, dates, and array variants.

View Documentation

Ready to Build Intelligent Data Extraction?

Transform unstructured content into structured data. 100% local, 100% your infrastructure.