Solutions · Document Intelligence · Splitting

One PDF. Multiple documents. Automatically separated.

Intelligent document splitting powered by LM-Kit's neuro-symbolic AI engine. Vision language models combined with symbolic validation detect where one document ends and another begins inside multi-page PDFs. No templates, no rules, no training required. Powered by our super-fast document and image processing engine, continuously improved across text and vision modalities. 100% on-device.

Neuro-symbolic engine Zero templates Text & vision modalities
Neuro-symbolic engine

Neural vision models combined with symbolic validation layers for reliable results.

Page range detection

Returns exact start/end pages for each logical document.

Automatic labels

Each segment gets a descriptive label: "Invoice", "Contract", "ID Card".

Continuously improved

Engine updated with every release across both text and vision modalities.

3
Lines of code
0
Templates needed
0
Cloud calls
Why splitting matters

The missing step in every document pipeline.

Scanners, copiers, and email attachments routinely bundle unrelated documents into a single PDF: an invoice stapled to a purchase order, an ID card next to a bank statement, a contract followed by its appendices. Before you can classify, extract, or route these documents, you need to know where each one starts and ends.

LM-Kit.NET's DocumentSplitting class solves this using our internal neuro-symbolic engine. Vision language models analyze each page visually, while symbolic AI layers (grammar constraints, fuzzy logic, and rule-based validation) enforce structural correctness on every output. The result: precise page ranges with descriptive labels, faster processing, and significantly fewer errors than pure LLM approaches.

Powered by LM-Kit's super-fast document and image processing engine: Our neuro-symbolic architecture is continuously improved across both text and vision modalities. Scanned documents, digital PDFs, mixed layouts, different languages, rotated pages. If a human can see where one document ends and another begins, so can DocumentSplitting.

Under the hood

Neuro-symbolic boundary detection.

Page images are processed by LM-Kit's neuro-symbolic engine: a vision language model generates boundary hypotheses while symbolic AI layers validate, correct, and enforce structural integrity on every result.

Why the neuro-symbolic approach excels

Traditional rule-based splitters rely on text patterns, barcodes, or separator pages. They break when documents have inconsistent formatting. Pure LLM approaches hallucinate boundaries and produce structurally invalid outputs.

LM-Kit takes a fundamentally different approach with its Dynamic Sampling framework: a vision language model sees each page as an image and understands the visual layout, while symbolic AI layers (grammar constraints, fuzzy logic, taxonomy matching, and rule-based validation) enforce correctness at every generation step. This neuro-symbolic architecture, built on top of LM-Kit's super-fast document and image processing engine, delivers 75% fewer errors and 2× faster processing compared to pure LLM approaches.

Step 01

Page rendering

LM-Kit's fast image processing engine renders each PDF page for the VLM.

Step 02

Neural analysis

The VLM classifies each page by document type and detects visual transitions.

Step 03

Symbolic validation

Grammar constraints and rule-based validation enforce structural correctness on the output.

Step 04

Coverage & normalization

Page coverage is validated with no gaps or overlaps. Labels are normalized for consistency.

Pipeline integration

Split first, then process.

Document splitting is the natural first step in any document intelligence pipeline. LM-Kit's neuro-symbolic engine handles the split, then each document flows to specialized downstream processing.

Step 01

Ingest

Load multi-page PDF from scanner, email, or upload.

Step 02

Split

Neuro-symbolic engine detects boundaries and returns page ranges with labels.

Step 03

Classify

Route each segment by type: invoice, contract, ID, form.

Step 04

Extract

Apply schema-specific extraction to each individual document.

Key capabilities

Built for real-world document batches.

Handles the messy reality of production document processing.

Reliable, accurate, flexible

Powered by LM-Kit's neuro-symbolic engine, DocumentSplitting handles the edge cases that break template-based and pure LLM systems: mixed document types, varying page counts, scanned vs digital content, and documents in multiple languages. The engine is continuously improved with every release across both text and vision modalities.

  • Confidence Scoring: Each result includes a confidence score so you can flag low-confidence splits for human review
  • Semantic Guidance: Provide hints about expected document types ("invoices and purchase orders") for higher accuracy
  • Multi-page Documents: Correctly groups multi-page documents (e.g., a 5-page contract) using page-numbering and continuation markers
  • Optional OCR Integration: Plug in an OCR engine for scanned documents that benefit from text-level analysis
  • Async Support: Both synchronous and asynchronous APIs with cancellation token support
0

Templates

No rules or patterns to maintain

0

Training data

Works out of the box

0

Cloud calls

100% local processing

Any

Document type

Invoices, contracts, IDs, forms...

Capability

Page-number awareness

Detects pagination markers like "Page 2/5" and continuation headers to keep multi-page documents together as a single segment.

Capability

Label normalization

Strips page-numbering suffixes from labels before comparison, ensuring "Invoice (1/3)" and "Invoice (2/3)" are recognized as the same document.

Capability

Page coverage validation

Validates that every page is accounted for exactly once with no gaps or overlaps. Falls back gracefully if the model output is incomplete.

Capability

Dynamic Sampling engine

LM-Kit's proprietary neuro-symbolic inference framework combines neural generation with symbolic validation at every step, delivering 75% fewer errors.

Capability

Text & vision modalities

Processes both visual page layouts and extracted text simultaneously. The engine is continuously improved across both modalities with every SDK release.

Capability

Single-page fast path

Single-page PDFs skip inference entirely and return instantly with 100% confidence. No wasted compute.

Use cases

From mailroom to compliance.

Any workflow that handles batched or bundled documents benefits from intelligent splitting.

Use case

Mailroom automation

Incoming scanned mail batches contain mixed documents. Split into individual items before routing to departments.

Use case

Accounts payable

Vendors send multi-page PDFs with invoices, credit notes, and remittance advice bundled together. Separate each for processing.

Use case

Insurance claims

Claims packages contain application forms, supporting evidence, medical reports, and photos. Split before adjudication.

Use case

Legal document bundles

Court filings, contracts with exhibits, and deposition packages. Separate each legal document for indexing and review.

Use case

KYC and onboarding

Customer onboarding packets combine ID cards, proof of address, bank statements, and signed forms. Split for individual verification.

Use case

Healthcare records

Patient folders contain lab results, prescriptions, referral letters, and consent forms. Split while maintaining HIPAA compliance with local processing.

Try it now

Document splitting demo.

Interactive console application demonstrating neuro-symbolic document boundary detection powered by LM-Kit's fast processing engine.

Featured demo

Splitting demo

A complete console application that loads a vision model, processes multi-page PDFs using LM-Kit's neuro-symbolic engine, and displays detected document segments with page ranges, labels, and confidence scores.

  • Multiple vision model support (Qwen, Gemma, MiniCPM)
  • Interactive model selection menu
  • Progress tracking during model download
  • Detailed segment output with confidence
Related capabilities

Splitting plus the rest of Document Intelligence.

Document classification

After splitting, classify each segment. Different document types route to different downstream pipelines.

Classification

Structured extraction

Extract typed fields from each split document. Schema per category.

Extraction

PDF toolkit

After detecting boundaries, save each segment as a separate PDF via pdf_split.

PDF toolkit

OCR

When the input is a stack of scans, OCR runs as part of the splitting pipeline. Tables and seals included.

OCR page

Install the SDK

Stop sorting pages manually.

Intelligent document splitting powered by LM-Kit's neuro-symbolic AI engine. Three lines of code. Zero templates. 100% local.

Download free Read the how-to guide