Solutions · Document Intelligence · Layout understanding

Structure, without a language model.

Underneath every accurate conversion, extraction, classification, and RAG ingestion in LM-Kit.NET sits a single foundation: a deterministic, multi-layer layout-understanding pipeline written from first principles. Connected-component analysis, baseline-alignment scoring, column-aware line construction, paragraph detection, reading-order reconstruction, and layout-aware search. No LLM, no probabilistic guessing, no surprises. Predictable output, auditable decisions, observable at every stage.

40+ public types 6 reading-order output modes 6 spatial search modes

Deterministic

Same input, same output. Every decision is traceable and reproducible. No model drift between releases.

Original research

Custom equations for column geometry, proprietary scoring functions for line cohesion and reading-order recovery, bespoke clustering models. Calibrated against curated document corpora through controlled experimentation, not borrowed from a textbook.

Foundational

Consumed by conversion, extraction, RAG, classification, summarisation, OCR, and search throughout LM-Kit.NET.

Why deterministic layout

Some answers cannot be hallucinated.

A language model can describe a page. It cannot tell you the exact bounding box of a footnote, identify whether two columns of numbers share a baseline, or guarantee a stable reading order across reruns. Those are signal-processing and geometric questions. Solving them deterministically gives downstream consumers (conversion, extraction, RAG, classification, validation) a foundation that does not change under their feet.

Stable across reruns

No sampling, no temperature, no token randomness. Identical input produces byte-identical output. Critical for diff-based document workflows.

Auditable

Every paragraph, line, and column decision is the result of named algorithms with named thresholds. Stage-level callbacks expose the pipeline at every refinement pass.

Compute-light

Pure CPU, no model loading, no inference cost. Sub-second per page on commodity hardware. Runs on machines too small for any LLM.

Composable

Each layer is independent. Use line detection alone. Use paragraph grouping alone. Use the search engine alone. Or take the full pipeline.

Observable

Subscribe to per-stage callbacks (seam detection, line construction, paragraph refinement) for diagnostics and debugging. No black boxes.

Continuously refined

Built and tuned over years against real corpora: invoices, contracts, scientific papers, multi-column reports, scanned forms. Empirical thresholds, mathematically grounded.

The pipeline

Seven layers, one structured output.

Each layer feeds the next. Raw text and image data enter at the top; a fully structured PageElement tree (lines, paragraphs, regions, reading order, layer tags) emerges at the bottom.

  1. 01 Preprocessing Rotation, skew, units, diacritics
  2. 02 Column inference Cooperating strategies vote
  3. 03 Line construction Adaptive tolerances per column
  4. 04 Flow seams Density valleys, row consensus
  5. 05 Paragraph grouping Spacing, indentation, signals
  6. 06 Refinement loop Corrective passes to convergence
  7. 07 Reading order 2D sort, layer tagging

Layer 1

Preprocessing & artifact filtering

Tiny diacritics and punctuation are flagged with up-scale factors. Page rotation, skew, and unit (points / pixels) normalised. Fragmented characters merged. Source order preserved as fallback.

Layer 2

Column structure inference

Multiple cooperating strategies infer column structure from page-level geometry, signal density, and statistical clustering. Each strategy votes; consensus wins. Fallbacks handle the cases each one alone misses.

Layer 3

Column-aware line construction

Per-column line construction with adaptive tolerances driven by page-local geometry. Word membership validated by a custom scoring function rather than hard thresholds. Anomalous gaps detected and respected as boundaries.

Layer 4

Flow seams & corridor detection

Mid-page voids analysed as column corridors. Density valleys surface intra-paragraph splits the initial geometry missed. Row-level consensus catches tabular content disguised as prose.

Layer 5

Paragraph grouping

Lines clustered into paragraphs using line spacing, indentation, hyphenation continuation, sentence-start signals, and font-size transitions. Initial grouping intentionally permissive; refinement passes correct it.

Layer 6

Iterative refinement loop

A convergence loop of corrective passes. Earlier passes intentionally permissive; later passes refining. Each pass targets one specific failure mode the initial grouping is known to produce. Iteration continues until structure stabilises.

Layer 7

Reading order & layer tagging

Final 2D sort recovers reading order across columns and rows. Each paragraph carries a layer tag (MainBody, Header, Sidebar, Footer) for downstream filtering.

Engineered from first principles

Equations we derived. Algorithms we engineered.

Layout understanding is, at root, a signal-processing and geometric-inference problem. We treat it like one. Custom-derived equations and proprietary scoring functions sit alongside well-understood techniques, every coefficient calibrated against curated document corpora, every threshold regression-tested. Methodology over folklore. Reproducible over mysterious. No probabilistic black boxes; every decision traces back to a coefficient, a tolerance, or a published method.

Fuzzy decision membership

Boolean thresholds discard information. Membership functions assign each candidate a continuous score over multiple weighted criteria, then combine those scores under named-rule sets. Decisions look like cliff-edges in code; they behave like gradients in practice.

Robust statistical inference

Order-statistic estimators rather than means, so single outliers cannot dominate page-level metrics. Tolerances expressed as ratios over the natural unit of each page, making the pipeline scale-invariant across font sizes, DPIs, and document genres.

Custom geometric equations

Proprietary equations for column inference, line cohesion, paragraph segmentation, and reading-order recovery. Each equation derived for the noise patterns real documents produce, not the textbook ideal case. Coefficients fit by controlled experimentation against curated corpora.

Layout-density signal processing

Projection-profile inference and density analysis applied to the page surface. Surfaces column corridors, intra-paragraph voids, sparse-vs-dense regions. Signal-processing primitives translated into layout primitives.

Linguistic feature extraction

Lightweight deterministic NLP cooperating with geometry: sentence-onset signals, continuation cues, list-marker recognition, header signatures, script-direction inference. Each signal contributes to the membership scores upstream.

Calibrated approximate matching

Edit-distance and token-aware retrieval calibrated to the noise patterns OCR introduces. Tolerance bounds set empirically against real-world matches and misses, not by inspection.

Public surface

The types you reach for.

Layout understanding is exposed as a small set of primitive types you can compose freely. Use one, use them all.

Hub

PageElement

The page-level analysis object. Holds text elements, dimensions, rotation, skew. Methods: GetText(mode), DetectLines(), DetectParagraphs(), InferTextOutputMode(), ToJson(), FromJson(), Clone().

Granular

TextElement / LineElement / ParagraphElement

Three levels of granularity. Each carries bounds, text direction, font metrics, style flags. Paragraphs add layer-id, gap metrics, and structural flags (heading, list, quote, table-like).

Search

LayoutSearchEngine

Six search modes (text, regex, fuzzy, region, near, between). Returns TextMatch with bounds, score, and constituent elements. Single-page or cross-page.

Output

TextOutputMode

Six output styles: RawLines, GridAligned (financial / tabular), ParagraphFlow (articles), Structured (mixed RAG-ready), Auto, Markdown.

Internal

LayoutAnalysis / LogicalLayoutAnalysis

Static utility classes exposing the geometric and linguistic primitives the pipeline composes. Useful when you want to bypass the high-level API and assemble custom structural analysis from lower-level signals.

Image

ConnectedComponentLabeler / TextBlockComposer

Image-side block recovery. Identifies coherent text blocks from raw raster pixels through density inference and proximity clustering. The bridge between OCR-style image input and the structural pipeline.

Layout-aware search

Six modes, one search engine.

LayoutSearchEngine turns reconstructed pages into queryable surfaces. Every match returns a TextMatch with bounds, relevance score, page index, and the constituent text elements that produced it.

FindText

Exact substring

Case-sensitive or whole-word options. Fastest path for known phrases.

FindRegex

Pattern matching

Full .NET regex. Pull every monetary amount, every invoice number, every date in one call.

FindFuzzy

Edit-distance search

Edit-distance retrieval with token-aware mode. Tolerates OCR errors and transpositions.

FindInRegion

Rectangular bounds

Pull all text inside a given Rectangle. Useful for form-field extraction at known coordinates.

FindNear

Proximity radius

Find text within radius of an anchor point. Useful for label-value pairs (find amount near "Total").

FindBetween

Anchor extraction

Pull the content between two anchors (between "Terms" and "Signature"). Document-section extraction in one call.

Real working code

From bytes to structured output.

Open any document and inspect lines, paragraphs, and structured text in reading order across the page.

Detect.cs
using LMKit.Data;
using LMKit.Document.Layout;

// Open any document. Pages expose layout-analysed PageElement objects.
var doc  = new Attachment(@"C:\inputs\report.pdf");
var page = doc.PageElements[0];

// Reading-order text in any of six output modes.
string markdown   = page.GetText(TextOutputMode.Markdown);
string tabular    = page.GetText(TextOutputMode.GridAligned);
string structured = page.GetText(TextOutputMode.Structured);

// Or directly inspect the structure.
List<LineElement>      lines = page.DetectLines();
List<ParagraphElement> paras = page.DetectParagraphs();

foreach (var p in paras)
{
    Console.WriteLine($"{p.LayerId,-12} {p.Bounds}  flags={p.Flags}");
    // MainBody     [..]  flags=None
    // Header       [..]  flags=IsHeading
    // MainBody     [..]  flags=IsList
}
Across the toolkit

The foundation under everything else.

Layout analysis is not a stand-alone product. It is the layer that makes every higher-level capability accurate, deterministic, and debuggable. Each consumer below relies on it.

Document to Markdown

The TextExtraction strategy emits Markdown directly from PageElement.GetText(Markdown). Headings, lists, tables, columns all preserved.

Structured extraction

TextExtraction uses paragraph segmentation and reading order so the model receives clean structured input. Lower hallucination rate, higher field-level accuracy.

RAG ingestion

DocumentRag, RagChat, and PdfChat chunk along paragraph boundaries instead of arbitrary token windows. Retrieval matches semantic units.

Classification

Categorization uses layout cues (heading hierarchy, indentation, list density) as classification signals before any model runs.

OCR consumers

LMKitOcr and VlmOcr emit positioned text elements that flow back into layout analysis for downstream structure.

Search-highlight

SearchHighlightEngine uses match bounds from the layout pipeline to render visible highlights in PDFs and images.

Decision routing

Agents inspect layout flags (headings present, table-like, dirty-layout) to decide whether a document needs OCR, splitting, or human review.

Post-validation

After an LLM emits structured fields, layout-aware search verifies that values appear at expected coordinates. Catches hallucinations that pure text matching misses.

Feeding LLMs

Whatever the input format, layout produces a normalised, reading-order, structured intermediate. Models see clean prose and tables instead of zigzag PDF byte order.

Continuous R&D

Years of empirical tuning.

The pipeline is the product of years of iteration on real corpora: invoices, contracts, scientific papers, multi-column reports, scanned forms, financial statements, faxes, screenshots. Every threshold has a story behind it.

Calibrated parameter set

Dozens of named parameters spanning fuzzy membership rules, statistical tolerances, and equation coefficients. Each parameter carries a documented rationale, an experimental basis, and a regression test that pins it. None chosen by inspection.

Iterative refinement loop

A convergence loop of corrective passes. Earlier passes intentionally permissive; later passes refining. Each targets one specific failure mode the initial grouping is known to produce. Iteration continues until the structure stabilises.

Multi-script aware

Text rotation (0, 90, 180, 270), skew correction, right-to-left support, mixed directionality, diacritic geometry, and emoji classification all handled inside the pipeline rather than bolted on afterwards.

Stage-level observability

Subscribe to per-stage callbacks for diagnostics, regression testing, and tuning. The full pipeline is observable: every seam detected, every line built, every paragraph refined, every final boundary surfaced for inspection.

Methodologically grounded

Estimator choices, scoring functions, and clustering models selected for the failure modes real documents exhibit. Order-statistic robustness, scale-invariant tolerances, fuzzy decision membership. Each design choice motivated by an observed noise pattern, not by aesthetic preference.

Cached and serialisable

PageElement.ToJson() / FromJson() persist the analysis. Re-open without re-parsing. Critical for large-archive workflows where the same page is queried hundreds of times.

Related capabilities

The consumers of this foundation.

Document to Markdown

The TextExtraction strategy is built directly on TextOutputMode.Markdown. Output is layout-aware out of the box.

Markdown converter

OCR

OCR engines emit positioned text that flows into the layout pipeline. Same primitives, raster input.

OCR page

Structured extraction

Layout-aware paragraphs and reading order make extraction more accurate. Post-validation uses spatial search to verify field positions.

Extraction page

PDF toolkit

SearchHighlightEngine uses layout-search bounds to render visible highlights in marked-up PDFs.

PDF toolkit

Structure first. Then everything else.

Get Community Edition Download