One PDF. Multiple Documents.Automatically Separated.
Intelligent document splitting powered by LM-Kit's neuro-symbolic AI engine. Vision language models combined with symbolic validation detect where one document ends and another begins inside multi-page PDFs. No templates, no rules, no training required. Powered by our super-fast document and image processing engine, continuously improved across text and vision modalities. 100% on-device.
Neuro-Symbolic Engine
Neural vision models combined with symbolic validation layers for reliable results.
Page Range Detection
Returns exact start/end pages for each logical document.
Automatic Labels
Each segment gets a descriptive label: "Invoice", "Contract", "ID Card".
Continuously Improved
Engine updated with every release across both text and vision modalities.
The Missing Step in Every Document Pipeline
Scanners, copiers, and email attachments routinely bundle unrelated documents into a single PDF: an invoice stapled to a purchase order, an ID card next to a bank statement, a contract followed by its appendices. Before you can classify, extract, or route these documents, you need to know where each one starts and ends.
LM-Kit.NET's DocumentSplitting class solves this using our internal neuro-symbolic engine. Vision language models analyze each page visually, while symbolic AI layers (grammar constraints, fuzzy logic, and rule-based validation) enforce structural correctness on every output. The result: precise page ranges with descriptive labels, faster processing, and significantly fewer errors than pure LLM approaches.
Powered by LM-Kit's super-fast document and image processing engine: Our neuro-symbolic architecture is continuously improved across both text and vision modalities. Scanned documents, digital PDFs, mixed layouts, different languages, rotated pages. If a human can see where one document ends and another begins, so can DocumentSplitting.
using LMKit.Extraction; using LMKit.Model; using LMKit.Data; // Load a vision-capable model var model = LM.LoadFromModelID("qwen3-vl:8b"); // Create the splitter var splitter = new DocumentSplitting(model); // Optionally provide guidance splitter.Guidance = "Invoices and purchase orders."; // Analyze a multi-page PDF var result = splitter.Split( new Attachment("scanned_batch.pdf")); // Check results Console.WriteLine( $"Found {result.DocumentCount} documents"); Console.WriteLine( $"Confidence: {result.Confidence:P0}"); // Iterate detected segments foreach (var seg in result.Segments) { Console.WriteLine( $"Pages {seg.StartPage}-{seg.EndPage}"); Console.WriteLine( $" Type: {seg.Label}"); }
Neuro-Symbolic Boundary Detection
Page images are processed by LM-Kit's neuro-symbolic engine: a vision language model generates boundary hypotheses while symbolic AI layers validate, correct, and enforce structural integrity on every result.
Why the Neuro-Symbolic Approach Excels
Traditional rule-based splitters rely on text patterns, barcodes, or separator pages. They break when documents have inconsistent formatting. Pure LLM approaches hallucinate boundaries and produce structurally invalid outputs.
LM-Kit takes a fundamentally different approach with its Dynamic Sampling framework: a vision language model sees each page as an image and understands the visual layout, while symbolic AI layers (grammar constraints, fuzzy logic, taxonomy matching, and rule-based validation) enforce correctness at every generation step. This neuro-symbolic architecture, built on top of LM-Kit's super-fast document and image processing engine, delivers 75% fewer errors and 2× faster processing compared to pure LLM approaches.
Page Rendering
LM-Kit's fast image processing engine renders each PDF page for the VLM.
Neural Analysis
The VLM classifies each page by document type and detects visual transitions.
Symbolic Validation
Grammar constraints and rule-based validation enforce structural correctness on the output.
Coverage & Normalization
Page coverage is validated with no gaps or overlaps. Labels are normalized for consistency.
Split First, Then Process
Document splitting is the natural first step in any document intelligence pipeline. LM-Kit's neuro-symbolic engine handles the split, then each document flows to specialized downstream processing.
Ingest
Load multi-page PDF from scanner, email, or upload.
Split
Neuro-symbolic engine detects boundaries and returns page ranges with labels.
Classify
Route each segment by type: invoice, contract, ID, form.
Extract
Apply schema-specific extraction to each individual document.
Built for Real-World Document Batches
Handles the messy reality of production document processing.
Reliable, Accurate, Flexible
Powered by LM-Kit's neuro-symbolic engine, DocumentSplitting handles the edge cases that break template-based and pure LLM systems: mixed document types, varying page counts, scanned vs digital content, and documents in multiple languages. The engine is continuously improved with every release across both text and vision modalities.
- Confidence Scoring: Each result includes a confidence score so you can flag low-confidence splits for human review
- Semantic Guidance: Provide hints about expected document types ("invoices and purchase orders") for higher accuracy
- Multi-page Documents: Correctly groups multi-page documents (e.g., a 5-page contract) using page-numbering and continuation markers
- Optional OCR Integration: Plug in an OCR engine for scanned documents that benefit from text-level analysis
- Async Support: Both synchronous and asynchronous APIs with cancellation token support
Page-Number Awareness
Detects pagination markers like "Page 2/5" and continuation headers to keep multi-page documents together as a single segment.
Label Normalization
Strips page-numbering suffixes from labels before comparison, ensuring "Invoice (1/3)" and "Invoice (2/3)" are recognized as the same document.
Page Coverage Validation
Validates that every page is accounted for exactly once with no gaps or overlaps. Falls back gracefully if the model output is incomplete.
Dynamic Sampling Engine
LM-Kit's proprietary neuro-symbolic inference framework combines neural generation with symbolic validation at every step, delivering 75% fewer errors.
Text & Vision Modalities
Processes both visual page layouts and extracted text simultaneously. The engine is continuously improved across both modalities with every SDK release.
Single-Page Fast Path
Single-page PDFs skip inference entirely and return instantly with 100% confidence. No wasted compute.
From Mailroom to Compliance
Any workflow that handles batched or bundled documents benefits from intelligent splitting.
Mailroom Automation
Incoming scanned mail batches contain mixed documents. Split into individual items before routing to departments.
Accounts Payable
Vendors send multi-page PDFs with invoices, credit notes, and remittance advice bundled together. Separate each for processing.
Insurance Claims
Claims packages contain application forms, supporting evidence, medical reports, and photos. Split before adjudication.
Legal Document Bundles
Court filings, contracts with exhibits, and deposition packages. Separate each legal document for indexing and review.
KYC and Onboarding
Customer onboarding packets combine ID cards, proof of address, bank statements, and signed forms. Split for individual verification.
Healthcare Records
Patient folders contain lab results, prescriptions, referral letters, and consent forms. Split while maintaining HIPAA compliance with local processing.
Document Splitting Demo
Interactive console application demonstrating neuro-symbolic document boundary detection powered by LM-Kit's fast processing engine.
Splitting Demo
A complete console application that loads a vision model, processes multi-page PDFs using LM-Kit's neuro-symbolic engine, and displays detected document segments with page ranges, labels, and confidence scores.
- Multiple vision model support (Qwen, Gemma, MiniCPM)
- Interactive model selection menu
- Progress tracking during model download
- Detailed segment output with confidence
Key Classes
The building blocks for neuro-symbolic document splitting.
DocumentSplitting
Main class for detecting logical document boundaries using the neuro-symbolic engine. Requires a vision-capable model. Supports guidance and optional OCR integration.
View DocumentationDocumentSplittingResult
Contains the detected segments, document count, confidence score, and whether multiple documents were found.
View DocumentationDocumentSegment
Represents a single logical document with StartPage, EndPage, PageCount, and a descriptive Label.
View DocumentationAttachment
Represents the input PDF. Provides page-level access and integrates with the vision analysis pipeline.
View DocumentationStop Sorting Pages Manually
Intelligent document splitting powered by LM-Kit's neuro-symbolic AI engine. Three lines of code. Zero templates. 100% local.