Solutions · Document Intelligence · Splitting

One PDF. Multiple documents. Automatically separated.

Intelligent document splitting powered by LM-Kit's neuro-symbolic AI engine. Vision language models combined with symbolic validation detect where one document ends and another begins inside multi-page PDFs. No templates, no rules, no training required. Powered by our super-fast document and image processing engine, continuously improved across text and vision modalities. 100% on-device.

Start building free View splitting demo

Neuro-symbolic engine Zero templates Text & vision modalities

Neuro-symbolic engine

Neural vision models combined with symbolic validation layers for reliable results.

Page range detection

Returns exact start/end pages for each logical document.

Automatic labels

Each segment gets a descriptive label: "Invoice", "Contract", "ID Card".

Continuously improved

Engine updated with every release across both text and vision modalities.

3
Lines of code

0
Templates needed

0
Cloud calls

Why splitting matters

The missing step in every document pipeline.

Scanners, copiers, and email attachments routinely bundle unrelated documents into a single PDF: an invoice stapled to a purchase order, an ID card next to a bank statement, a contract followed by its appendices. Before you can classify, extract, or route these documents, you need to know where each one starts and ends.

LM-Kit.NET's DocumentSplitting class solves this using our internal neuro-symbolic engine. Vision language models analyze each page visually, while symbolic AI layers (grammar constraints, fuzzy logic, and rule-based validation) enforce structural correctness on every output. The result: precise page ranges with descriptive labels, faster processing, and significantly fewer errors than pure LLM approaches.

Powered by LM-Kit's super-fast document and image processing engine: Our neuro-symbolic architecture is continuously improved across both text and vision modalities. Scanned documents, digital PDFs, mixed layouts, different languages, rotated pages. If a human can see where one document ends and another begins, so can DocumentSplitting.

SplitDocuments.cs

using LMKit.Extraction;
using LMKit.Model;
using LMKit.Data;

// Load a vision-capable model
var model = LM.LoadFromModelID("qwen3.5:9b");

// Create the splitter
var splitter = new DocumentSplitting(model);

// Optionally provide guidance
splitter.Guidance = "Invoices and purchase orders.";

// Analyze a multi-page PDF
var result = splitter.Split(
    new Attachment("scanned_batch.pdf"));

// Check results
Console.WriteLine(
    $"Found {result.DocumentCount} documents");
Console.WriteLine(
    $"Confidence: {result.Confidence:P0}");

// Iterate detected segments
foreach (var seg in result.Segments) {
    Console.WriteLine(
        $"Pages {seg.StartPage}-{seg.EndPage}");
    Console.WriteLine(
        $"  Type: {seg.Label}");
}

Under the hood

Neuro-symbolic boundary detection.

Page images are processed by LM-Kit's neuro-symbolic engine: a vision language model generates boundary hypotheses while symbolic AI layers validate, correct, and enforce structural integrity on every result.

Why the neuro-symbolic approach excels

Traditional rule-based splitters rely on text patterns, barcodes, or separator pages. They break when documents have inconsistent formatting. Pure LLM approaches hallucinate boundaries and produce structurally invalid outputs.

LM-Kit takes a fundamentally different approach with its Dynamic Sampling framework: a vision language model sees each page as an image and understands the visual layout, while symbolic AI layers (grammar constraints, fuzzy logic, taxonomy matching, and rule-based validation) enforce correctness at every generation step. This neuro-symbolic architecture, built on top of LM-Kit's super-fast document and image processing engine, delivers 75% fewer errors and 2× faster processing compared to pure LLM approaches.

Step 01

Page rendering

LM-Kit's fast image processing engine renders each PDF page for the VLM.

Step 02

Neural analysis

The VLM classifies each page by document type and detects visual transitions.

Step 03

Symbolic validation

Grammar constraints and rule-based validation enforce structural correctness on the output.

Step 04

Coverage & normalization

Page coverage is validated with no gaps or overlaps. Labels are normalized for consistency.

Pipeline integration

Split first, then process.

Document splitting is the natural first step in any document intelligence pipeline. LM-Kit's neuro-symbolic engine handles the split, then each document flows to specialized downstream processing.

Step 01

Ingest

Load multi-page PDF from scanner, email, or upload.

Step 02

Split

Neuro-symbolic engine detects boundaries and returns page ranges with labels.

Step 03

Classify

Route each segment by type: invoice, contract, ID, form.

Step 04

Extract

Apply schema-specific extraction to each individual document.

Key capabilities

Built for real-world document batches.

Handles the messy reality of production document processing.

Reliable, accurate, flexible

Powered by LM-Kit's neuro-symbolic engine, DocumentSplitting handles the edge cases that break template-based and pure LLM systems: mixed document types, varying page counts, scanned vs digital content, and documents in multiple languages. The engine is continuously improved with every release across both text and vision modalities.

Confidence Scoring: Each result includes a confidence score so you can flag low-confidence splits for human review
Semantic Guidance: Provide hints about expected document types ("invoices and purchase orders") for higher accuracy
Multi-page Documents: Correctly groups multi-page documents (e.g., a 5-page contract) using page-numbering and continuation markers
Optional OCR Integration: Plug in an OCR engine for scanned documents that benefit from text-level analysis
Async Support: Both synchronous and asynchronous APIs with cancellation token support

Templates

No rules or patterns to maintain

Training data

Works out of the box

Cloud calls

Interactive console application demonstrating neuro-symbolic document boundary detection powered by LM-Kit's fast processing engine.

Featured demo

Splitting demo

A complete console application that loads a vision model, processes multi-page PDFs using LM-Kit's neuro-symbolic engine, and displays detected document segments with page ranges, labels, and confidence scores.

Multiple vision model support (Qwen, Gemma, MiniCPM)
Interactive model selection menu
Progress tracking during model download
Detailed segment output with confidence

View sample guide GitHub source code

terminal

$ dotnet run
# Select a vision-language model...
# Loading qwen3.5:9b... ████████ 100%
# Enter PDF path: scanned_batch.pdf
# Analyzing 12 pages...

Multiple documents: Yes
Document count:     4
Confidence:         94%

Segment 1: Pages 1-3: Invoice #2024-0892
Segment 2: Pages 4-4: Purchase Order
Segment 3: Pages 5-9: Service Contract
Segment 4: Pages 10-12: Bank Statement

# Completed in 4.82 seconds

API reference

Key classes.

Extraction

PDF toolkit

After detecting boundaries, save each segment as a separate PDF via pdf_split.

PDF toolkit

OCR

When the input is a stack of scans, OCR runs as part of the splitting pipeline. Tables and seals included.

OCR page

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Install the SDK

Stop sorting pages manually.

Intelligent document splitting powered by LM-Kit's neuro-symbolic AI engine. Three lines of code. Zero templates. 100% local.

Download free Read the how-to guide

One PDF. Multiple documents. Automatically separated.

The missing step in every document pipeline.

Why the neuro-symbolic approach excels

Page rendering

Neural analysis

Symbolic validation

Coverage & normalization

Ingest

Split

Classify

Extract

Reliable, accurate, flexible

Page-number awareness

Label normalization

Page coverage validation

Dynamic Sampling engine

Text & vision modalities

Single-page fast path

Mailroom automation

Accounts payable

Insurance claims

Legal document bundles

KYC and onboarding

Healthcare records

Splitting demo

DocumentSplitting

DocumentSplittingResult

DocumentSegment

Attachment

Document classification

Structured extraction

PDF toolkit

OCR

Document splitting

Document splitting walkthrough

Split multi-document files

Build a multi-format document ingestion pipeline

`DocumentSplitting`

`DocumentSplittingResult`

`DocumentSegment`

`Attachment`