Get Free Community License
Document Intelligence

One PDF. Multiple Documents.Automatically Separated.

Intelligent document splitting powered by LM-Kit's neuro-symbolic AI engine. Vision language models combined with symbolic validation detect where one document ends and another begins inside multi-page PDFs. No templates, no rules, no training required. Powered by our super-fast document and image processing engine, continuously improved across text and vision modalities. 100% on-device.

Neuro-Symbolic Engine Zero Templates Text & Vision Modalities

Neuro-Symbolic Engine

Neural vision models combined with symbolic validation layers for reliable results.

Page Range Detection

Returns exact start/end pages for each logical document.

Automatic Labels

Each segment gets a descriptive label: "Invoice", "Contract", "ID Card".

Continuously Improved

Engine updated with every release across both text and vision modalities.

3
Lines of Code
0
Templates Needed
0
Cloud Calls

The Missing Step in Every Document Pipeline

Scanners, copiers, and email attachments routinely bundle unrelated documents into a single PDF: an invoice stapled to a purchase order, an ID card next to a bank statement, a contract followed by its appendices. Before you can classify, extract, or route these documents, you need to know where each one starts and ends.

LM-Kit.NET's DocumentSplitting class solves this using our internal neuro-symbolic engine. Vision language models analyze each page visually, while symbolic AI layers (grammar constraints, fuzzy logic, and rule-based validation) enforce structural correctness on every output. The result: precise page ranges with descriptive labels, faster processing, and significantly fewer errors than pure LLM approaches.

Powered by LM-Kit's super-fast document and image processing engine: Our neuro-symbolic architecture is continuously improved across both text and vision modalities. Scanned documents, digital PDFs, mixed layouts, different languages, rotated pages. If a human can see where one document ends and another begins, so can DocumentSplitting.

SplitDocuments.cs
using LMKit.Extraction;
using LMKit.Model;
using LMKit.Data;

// Load a vision-capable model
var model = LM.LoadFromModelID("qwen3-vl:8b");

// Create the splitter
var splitter = new DocumentSplitting(model);

// Optionally provide guidance
splitter.Guidance = "Invoices and purchase orders.";

// Analyze a multi-page PDF
var result = splitter.Split(
    new Attachment("scanned_batch.pdf"));

// Check results
Console.WriteLine(
    $"Found {result.DocumentCount} documents");
Console.WriteLine(
    $"Confidence: {result.Confidence:P0}");

// Iterate detected segments
foreach (var seg in result.Segments)
{
    Console.WriteLine(
        $"Pages {seg.StartPage}-{seg.EndPage}");
    Console.WriteLine(
        $"  Type: {seg.Label}");
}

Neuro-Symbolic Boundary Detection

Page images are processed by LM-Kit's neuro-symbolic engine: a vision language model generates boundary hypotheses while symbolic AI layers validate, correct, and enforce structural integrity on every result.

Why the Neuro-Symbolic Approach Excels

Traditional rule-based splitters rely on text patterns, barcodes, or separator pages. They break when documents have inconsistent formatting. Pure LLM approaches hallucinate boundaries and produce structurally invalid outputs.

LM-Kit takes a fundamentally different approach with its Dynamic Sampling framework: a vision language model sees each page as an image and understands the visual layout, while symbolic AI layers (grammar constraints, fuzzy logic, taxonomy matching, and rule-based validation) enforce correctness at every generation step. This neuro-symbolic architecture, built on top of LM-Kit's super-fast document and image processing engine, delivers 75% fewer errors and 2× faster processing compared to pure LLM approaches.

1

Page Rendering

LM-Kit's fast image processing engine renders each PDF page for the VLM.

2

Neural Analysis

The VLM classifies each page by document type and detects visual transitions.

3

Symbolic Validation

Grammar constraints and rule-based validation enforce structural correctness on the output.

4

Coverage & Normalization

Page coverage is validated with no gaps or overlaps. Labels are normalized for consistency.

Split First, Then Process

Document splitting is the natural first step in any document intelligence pipeline. LM-Kit's neuro-symbolic engine handles the split, then each document flows to specialized downstream processing.

Ingest

Load multi-page PDF from scanner, email, or upload.

Split

Neuro-symbolic engine detects boundaries and returns page ranges with labels.

Classify

Route each segment by type: invoice, contract, ID, form.

Extract

Apply schema-specific extraction to each individual document.

Built for Real-World Document Batches

Handles the messy reality of production document processing.

Reliable, Accurate, Flexible

Powered by LM-Kit's neuro-symbolic engine, DocumentSplitting handles the edge cases that break template-based and pure LLM systems: mixed document types, varying page counts, scanned vs digital content, and documents in multiple languages. The engine is continuously improved with every release across both text and vision modalities.

  • Confidence Scoring: Each result includes a confidence score so you can flag low-confidence splits for human review
  • Semantic Guidance: Provide hints about expected document types ("invoices and purchase orders") for higher accuracy
  • Multi-page Documents: Correctly groups multi-page documents (e.g., a 5-page contract) using page-numbering and continuation markers
  • Optional OCR Integration: Plug in an OCR engine for scanned documents that benefit from text-level analysis
  • Async Support: Both synchronous and asynchronous APIs with cancellation token support
0
Templates
No rules or patterns to maintain
0
Training Data
Works out of the box
0
Cloud Calls
100% local processing
Any
Document Type
Invoices, contracts, IDs, forms...

Page-Number Awareness

Detects pagination markers like "Page 2/5" and continuation headers to keep multi-page documents together as a single segment.

Label Normalization

Strips page-numbering suffixes from labels before comparison, ensuring "Invoice (1/3)" and "Invoice (2/3)" are recognized as the same document.

Page Coverage Validation

Validates that every page is accounted for exactly once with no gaps or overlaps. Falls back gracefully if the model output is incomplete.

Dynamic Sampling Engine

LM-Kit's proprietary neuro-symbolic inference framework combines neural generation with symbolic validation at every step, delivering 75% fewer errors.

Text & Vision Modalities

Processes both visual page layouts and extracted text simultaneously. The engine is continuously improved across both modalities with every SDK release.

Single-Page Fast Path

Single-page PDFs skip inference entirely and return instantly with 100% confidence. No wasted compute.

From Mailroom to Compliance

Any workflow that handles batched or bundled documents benefits from intelligent splitting.

Mailroom Automation

Incoming scanned mail batches contain mixed documents. Split into individual items before routing to departments.

Accounts Payable

Vendors send multi-page PDFs with invoices, credit notes, and remittance advice bundled together. Separate each for processing.

Insurance Claims

Claims packages contain application forms, supporting evidence, medical reports, and photos. Split before adjudication.

Legal Document Bundles

Court filings, contracts with exhibits, and deposition packages. Separate each legal document for indexing and review.

KYC and Onboarding

Customer onboarding packets combine ID cards, proof of address, bank statements, and signed forms. Split for individual verification.

Healthcare Records

Patient folders contain lab results, prescriptions, referral letters, and consent forms. Split while maintaining HIPAA compliance with local processing.

Document Splitting Demo

Interactive console application demonstrating neuro-symbolic document boundary detection powered by LM-Kit's fast processing engine.

Splitting Demo

A complete console application that loads a vision model, processes multi-page PDFs using LM-Kit's neuro-symbolic engine, and displays detected document segments with page ranges, labels, and confidence scores.

  • Multiple vision model support (Qwen, Gemma, MiniCPM)
  • Interactive model selection menu
  • Progress tracking during model download
  • Detailed segment output with confidence
$ dotnet run
Select a vision-language model...
Loading qwen3-vl:8b... ████████ 100%
Enter PDF path: scanned_batch.pdf
Analyzing 12 pages...
 
Multiple documents: Yes
Document count: 4
Confidence: 94%
 
Segment 1: Pages 1-3: Invoice #2024-0892
Segment 2: Pages 4-4: Purchase Order
Segment 3: Pages 5-9: Service Contract
Segment 4: Pages 10-12: Bank Statement
 
Completed in 4.82 seconds

Key Classes

The building blocks for neuro-symbolic document splitting.

DocumentSplitting

Main class for detecting logical document boundaries using the neuro-symbolic engine. Requires a vision-capable model. Supports guidance and optional OCR integration.

View Documentation
DocumentSplittingResult

Contains the detected segments, document count, confidence score, and whether multiple documents were found.

View Documentation
DocumentSegment

Represents a single logical document with StartPage, EndPage, PageCount, and a descriptive Label.

View Documentation
Attachment

Represents the input PDF. Provides page-level access and integrates with the vision analysis pipeline.

View Documentation

Stop Sorting Pages Manually

Intelligent document splitting powered by LM-Kit's neuro-symbolic AI engine. Three lines of code. Zero templates. 100% local.