Neural vision models combined with symbolic validation layers for reliable results.
One PDF. Multiple documents. Automatically separated.
Intelligent document splitting powered by LM-Kit's neuro-symbolic AI engine. Vision language models combined with symbolic validation detect where one document ends and another begins inside multi-page PDFs. No templates, no rules, no training required. Powered by our super-fast document and image processing engine, continuously improved across text and vision modalities. 100% on-device.
Returns exact start/end pages for each logical document.
Each segment gets a descriptive label: "Invoice", "Contract", "ID Card".
Engine updated with every release across both text and vision modalities.
Lines of code
Templates needed
Cloud calls
The missing step in every document pipeline.
Scanners, copiers, and email attachments routinely bundle unrelated documents into a single PDF: an invoice stapled to a purchase order, an ID card next to a bank statement, a contract followed by its appendices. Before you can classify, extract, or route these documents, you need to know where each one starts and ends.
LM-Kit.NET's DocumentSplitting class solves this using our internal neuro-symbolic engine. Vision language models analyze each page visually, while symbolic AI layers (grammar constraints, fuzzy logic, and rule-based validation) enforce structural correctness on every output. The result: precise page ranges with descriptive labels, faster processing, and significantly fewer errors than pure LLM approaches.
Powered by LM-Kit's super-fast document and image processing engine: Our neuro-symbolic architecture is continuously improved across both text and vision modalities. Scanned documents, digital PDFs, mixed layouts, different languages, rotated pages. If a human can see where one document ends and another begins, so can DocumentSplitting.
Neuro-symbolic boundary detection.
Page images are processed by LM-Kit's neuro-symbolic engine: a vision language model generates boundary hypotheses while symbolic AI layers validate, correct, and enforce structural integrity on every result.
Why the neuro-symbolic approach excels
Traditional rule-based splitters rely on text patterns, barcodes, or separator pages. They break when documents have inconsistent formatting. Pure LLM approaches hallucinate boundaries and produce structurally invalid outputs.
LM-Kit takes a fundamentally different approach with its Dynamic Sampling framework: a vision language model sees each page as an image and understands the visual layout, while symbolic AI layers (grammar constraints, fuzzy logic, taxonomy matching, and rule-based validation) enforce correctness at every generation step. This neuro-symbolic architecture, built on top of LM-Kit's super-fast document and image processing engine, delivers 75% fewer errors and 2× faster processing compared to pure LLM approaches.
Page rendering
LM-Kit's fast image processing engine renders each PDF page for the VLM.
Neural analysis
The VLM classifies each page by document type and detects visual transitions.
Symbolic validation
Grammar constraints and rule-based validation enforce structural correctness on the output.
Coverage & normalization
Page coverage is validated with no gaps or overlaps. Labels are normalized for consistency.
Split first, then process.
Document splitting is the natural first step in any document intelligence pipeline. LM-Kit's neuro-symbolic engine handles the split, then each document flows to specialized downstream processing.
Step 01
Ingest
Load multi-page PDF from scanner, email, or upload.
Step 02
Split
Neuro-symbolic engine detects boundaries and returns page ranges with labels.
Step 03
Classify
Route each segment by type: invoice, contract, ID, form.
Step 04
Extract
Apply schema-specific extraction to each individual document.
Built for real-world document batches.
Handles the messy reality of production document processing.
Reliable, accurate, flexible
Powered by LM-Kit's neuro-symbolic engine, DocumentSplitting handles the edge cases that break template-based and pure LLM systems: mixed document types, varying page counts, scanned vs digital content, and documents in multiple languages. The engine is continuously improved with every release across both text and vision modalities.
- Confidence Scoring: Each result includes a confidence score so you can flag low-confidence splits for human review
- Semantic Guidance: Provide hints about expected document types ("invoices and purchase orders") for higher accuracy
- Multi-page Documents: Correctly groups multi-page documents (e.g., a 5-page contract) using page-numbering and continuation markers
- Optional OCR Integration: Plug in an OCR engine for scanned documents that benefit from text-level analysis
- Async Support: Both synchronous and asynchronous APIs with cancellation token support
Templates
No rules or patterns to maintain
Training data
Works out of the box
Cloud calls
100% local processing
Document type
Invoices, contracts, IDs, forms...
Capability
Page-number awareness
Detects pagination markers like "Page 2/5" and continuation headers to keep multi-page documents together as a single segment.
Capability
Label normalization
Strips page-numbering suffixes from labels before comparison, ensuring "Invoice (1/3)" and "Invoice (2/3)" are recognized as the same document.
Capability
Page coverage validation
Validates that every page is accounted for exactly once with no gaps or overlaps. Falls back gracefully if the model output is incomplete.
Capability
Dynamic Sampling engine
LM-Kit's proprietary neuro-symbolic inference framework combines neural generation with symbolic validation at every step, delivering 75% fewer errors.
Capability
Text & vision modalities
Processes both visual page layouts and extracted text simultaneously. The engine is continuously improved across both modalities with every SDK release.
Capability
Single-page fast path
Single-page PDFs skip inference entirely and return instantly with 100% confidence. No wasted compute.
From mailroom to compliance.
Any workflow that handles batched or bundled documents benefits from intelligent splitting.
Use case
Mailroom automation
Incoming scanned mail batches contain mixed documents. Split into individual items before routing to departments.
Use case
Accounts payable
Vendors send multi-page PDFs with invoices, credit notes, and remittance advice bundled together. Separate each for processing.
Use case
Insurance claims
Claims packages contain application forms, supporting evidence, medical reports, and photos. Split before adjudication.
Use case
Legal document bundles
Court filings, contracts with exhibits, and deposition packages. Separate each legal document for indexing and review.
Use case
KYC and onboarding
Customer onboarding packets combine ID cards, proof of address, bank statements, and signed forms. Split for individual verification.
Use case
Healthcare records
Patient folders contain lab results, prescriptions, referral letters, and consent forms. Split while maintaining HIPAA compliance with local processing.
Document splitting demo.
Interactive console application demonstrating neuro-symbolic document boundary detection powered by LM-Kit's fast processing engine.
Splitting demo
A complete console application that loads a vision model, processes multi-page PDFs using LM-Kit's neuro-symbolic engine, and displays detected document segments with page ranges, labels, and confidence scores.
- Multiple vision model support (Qwen, Gemma, MiniCPM)
- Interactive model selection menu
- Progress tracking during model download
- Detailed segment output with confidence
Key classes.
The building blocks for neuro-symbolic document splitting.
Class
DocumentSplitting
Main class for detecting logical document boundaries using the neuro-symbolic engine. Requires a vision-capable model. Supports guidance and optional OCR integration.
View documentationType
DocumentSplittingResult
Contains the detected segments, document count, confidence score, and whether multiple documents were found.
View documentationType
DocumentSegment
Represents a single logical document with StartPage, EndPage, PageCount, and a descriptive Label.
View documentationType
Attachment
Represents the input PDF. Provides page-level access and integrates with the vision analysis pipeline.
View documentationSplitting plus the rest of Document Intelligence.
Document classification
After splitting, classify each segment. Different document types route to different downstream pipelines.
Structured extraction
Extract typed fields from each split document. Schema per category.
PDF toolkit
After detecting boundaries, save each segment as a separate PDF via pdf_split.
OCR
When the input is a stack of scans, OCR runs as part of the splitting pipeline. Tables and seals included.
Build it. Read it. Try it.
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Document splitting
Console demo: split multi-document scans into individual documents.
Open on GitHub → How-to guideSplit multi-document files
Mailroom-style splitting with confidence scoring.
Read the guide → How-to guideBuild a multi-format document ingestion pipeline
Ingest PDF, DOCX, HTML, EML in one pipeline; route by type.
Read the guide →Stop sorting pages manually.
Intelligent document splitting powered by LM-Kit's neuro-symbolic AI engine. Three lines of code. Zero templates. 100% local.