Solutions · Document Intelligence · Layout understanding

Structure, without a language model.

Underneath every accurate conversion, extraction, classification, and RAG ingestion in LM-Kit.NET sits a single foundation: a deterministic, multi-layer layout-understanding pipeline written from first principles. Connected-component analysis, baseline-alignment scoring, column-aware line construction, paragraph detection, reading-order reconstruction, and layout-aware search. No LLM, no probabilistic guessing, no surprises. Predictable output, auditable decisions, observable at every stage.

Start building free PageElement API

40+ public types 6 reading-order output modes 6 spatial search modes

Deterministic

Same input, same output. Every decision is traceable and reproducible. No model drift between releases.

Original research

Custom equations for column geometry, proprietary scoring functions for line cohesion and reading-order recovery, bespoke clustering models. Calibrated against curated document corpora through controlled experimentation, not borrowed from a textbook.

Foundational

Consumed by conversion, extraction, RAG, classification, summarisation, OCR, and search throughout LM-Kit.NET.

Why deterministic layout

Some answers cannot be hallucinated.

A language model can describe a page. It cannot tell you the exact bounding box of a footnote, identify whether two columns of numbers share a baseline, or guarantee a stable reading order across reruns. Those are signal-processing and geometric questions. Solving them deterministically gives downstream consumers (conversion, extraction, RAG, classification, validation) a foundation that does not change under their feet.

Stable across reruns

No sampling, no temperature, no token randomness. Identical input produces byte-identical output. Critical for diff-based document workflows.

Auditable

Every paragraph, line, and column decision is the result of named algorithms with named thresholds. Stage-level callbacks expose the pipeline at every refinement pass.

Compute-light

Pure CPU, no model loading, no inference cost. Sub-second per page on commodity hardware. Runs on machines too small for any LLM.

Composable

Each layer is independent. Use line detection alone. Use paragraph grouping alone. Use the search engine alone. Or take the full pipeline.

Observable

Subscribe to per-stage callbacks (seam detection, line construction, paragraph refinement) for diagnostics and debugging. No black boxes.

Continuously refined

Built and tuned over years against real corpora: invoices, contracts, scientific papers, multi-column reports, scanned forms. Empirical thresholds, mathematically grounded.

The pipeline

Seven layers, one structured output.

Each layer feeds the next. Raw text and image data enter at the top; a fully structured PageElement tree (lines, paragraphs, regions, reading order, layer tags) emerges at the bottom.

01 Preprocessing Rotation, skew, units, diacritics
02 Column inference Cooperating strategies vote
03 Line construction Adaptive tolerances per column
04 Flow seams Density valleys, row consensus
05 Paragraph grouping Spacing, indentation, signals
06 Refinement loop Corrective passes to convergence
07 Reading order 2D sort, layer tagging

Layer 1

Preprocessing & artifact filtering

Tiny diacritics and punctuation are flagged with up-scale factors. Page rotation, skew, and unit (points / pixels) normalised. Fragmented characters merged. Source order preserved as fallback.

Layer 2

Column structure inference

Multiple cooperating strategies infer column structure from page-level geometry, signal density, and statistical clustering. Each strategy votes; consensus wins. Fallbacks handle the cases each one alone misses.

Layer 3

Column-aware line construction

Per-column line construction with adaptive tolerances driven by page-local geometry. Word membership validated by a custom scoring function rather than hard thresholds. Anomalous gaps detected and respected as boundaries.

Layer 4

Flow seams & corridor detection

Mid-page voids analysed as column corridors. Density valleys surface intra-paragraph splits the initial geometry missed. Row-level consensus catches tabular content disguised as prose.

Layer 5

Paragraph grouping

Lines clustered into paragraphs using line spacing, indentation, hyphenation continuation, sentence-start signals, and font-size transitions. Initial grouping intentionally permissive; refinement passes correct it.

Layer 6

Iterative refinement loop

A convergence loop of corrective passes. Earlier passes intentionally permissive; later passes refining. Each pass targets one specific failure mode the initial grouping is known to produce. Iteration continues until structure stabilises.

Layer 7

Reading order & layer tagging

Final 2D sort recovers reading order across columns and rows. Each paragraph carries a layer tag (MainBody, Header, Sidebar, Footer) for downstream filtering.

Engineered from first principles

Equations we derived. Algorithms we engineered.

Layout understanding is, at root, a signal-processing and geometric-inference problem. We treat it like one. Custom-derived equations and proprietary scoring functions sit alongside well-understood techniques, every coefficient calibrated against curated document corpora, every threshold regression-tested. Methodology over folklore. Reproducible over mysterious. No probabilistic black boxes; every decision traces back to a coefficient, a tolerance, or a published method.

Fuzzy decision membership

Boolean thresholds discard information. Membership functions assign each candidate a continuous score over multiple weighted criteria, then combine those scores under named-rule sets. Decisions look like cliff-edges in code; they behave like gradients in practice.

Robust statistical inference

Order-statistic estimators rather than means, so single outliers cannot dominate page-level metrics. Tolerances expressed as ratios over the natural unit of each page, making the pipeline scale-invariant across font sizes, DPIs, and document genres.

Custom geometric equations

Proprietary equations for column inference, line cohesion, paragraph segmentation, and reading-order recovery. Each equation derived for the noise patterns real documents produce, not the textbook ideal case. Coefficients fit by controlled experimentation against curated corpora.

Layout-density signal processing

Projection-profile inference and density analysis applied to the page surface. Surfaces column corridors, intra-paragraph voids, sparse-vs-dense regions. Signal-processing primitives translated into layout primitives.

Linguistic feature extraction

Lightweight deterministic NLP cooperating with geometry: sentence-onset signals, continuation cues, list-marker recognition, header signatures, script-direction inference. Each signal contributes to the membership scores upstream.

Calibrated approximate matching

Edit-distance and token-aware retrieval calibrated to the noise patterns OCR introduces. Tolerance bounds set empirically against real-world matches and misses, not by inspection.

Public surface

The types you reach for.

Layout understanding is exposed as a small set of primitive types you can compose freely. Use one, use them all.

Hub

`PageElement`

The page-level analysis object. Holds text elements, dimensions, rotation, skew. Methods: GetText(mode), DetectLines(), DetectParagraphs(), InferTextOutputMode(), ToJson(), FromJson(), Clone().

Granular

`TextElement` / `LineElement` / `ParagraphElement`

Three levels of granularity. Each carries bounds, text direction, font metrics, style flags. Paragraphs add layer-id, gap metrics, and structural flags (heading, list, quote, table-like).

`LayoutSearchEngine`

Six search modes (text, regex, fuzzy, region, near, between). Returns TextMatch with bounds, score, and constituent elements. Single-page or cross-page.

Output

`TextOutputMode`

Six output styles: RawLines, GridAligned (financial / tabular), ParagraphFlow (articles), Structured (mixed RAG-ready), Auto, Markdown.

Internal

`LayoutAnalysis` / `LogicalLayoutAnalysis`

Static utility classes exposing the geometric and linguistic primitives the pipeline composes. Useful when you want to bypass the high-level API and assemble custom structural analysis from lower-level signals.

Image

`ConnectedComponentLabeler` / `TextBlockComposer`

Image-side block recovery. Identifies coherent text blocks from raw raster pixels through density inference and proximity clustering. The bridge between OCR-style image input and the structural pipeline.

Layout-aware search

Six modes, one search engine.

LayoutSearchEngine turns reconstructed pages into queryable surfaces. Every match returns a TextMatch with bounds, relevance score, page index, and the constituent text elements that produced it.

FindText

Exact substring

Case-sensitive or whole-word options. Fastest path for known phrases.

FindRegex

Pattern matching

Full .NET regex. Pull every monetary amount, every invoice number, every date in one call.

FindFuzzy

Edit-distance search

Edit-distance retrieval with token-aware mode. Tolerates OCR errors and transpositions.

FindInRegion

Rectangular bounds

Pull all text inside a given Rectangle. Useful for form-field extraction at known coordinates.

FindNear

Proximity radius

Find text within radius of an anchor point. Useful for label-value pairs (find amount near "Total").

FindBetween

Anchor extraction

Pull the content between two anchors (between "Terms" and "Signature"). Document-section extraction in one call.

Real working code

From bytes to structured output.

Open any document and inspect lines, paragraphs, and structured text in reading order across the page.

Detect.cs

using LMKit.Data;
using LMKit.Document.Layout;

// Open any document. Pages expose layout-analysed PageElement objects.
var doc  = new Attachment(@"C:\inputs\report.pdf");
var page = doc.PageElements[0];

// Reading-order text in any of six output modes.
string markdown   = page.GetText(TextOutputMode.Markdown);
string tabular    = page.GetText(TextOutputMode.GridAligned);
string structured = page.GetText(TextOutputMode.Structured);

// Or directly inspect the structure.
List<LineElement>      lines = page.DetectLines();
List<ParagraphElement> paras = page.DetectParagraphs();

foreach (var p in paras)
{
    Console.WriteLine($"{p.LayerId,-12} {p.Bounds}  flags={p.Flags}");
    // MainBody     [..]  flags=None
    // Header       [..]  flags=IsHeading
    // MainBody     [..]  flags=IsList
}

Run regex, anchor-bounded, proximity, and fuzzy searches against the layout graph rather than raw text.

SpatialSearch.cs

using LMKit.Document.Search;

var engine = new LayoutSearchEngine();

// Pull every monetary amount across the whole document.
var amounts = engine.FindRegex(doc.PageElements, @"\$[\d,]+(\.\d{2})?");

// Pull a section bounded by two known anchors.
var indemn = engine.FindBetween(page,
    "Indemnification", "Limitation of Liability");

// Find amounts near the word "Total".
var totalLabel = engine.FindText(page, "Total").First();
var totalValue = engine.FindNear(page, @"\$[\d,.]+",
    new ProximityOptions
    {
        AnchorBounds = totalLabel.Bounds,
        RadiusPoints = 120
    });

// Tolerant search over OCR output.
var fuzzy = engine.FindFuzzy(page, "Invoice Number",
    new FuzzySearchOptions { MaxEditDistance = 2, TokenAware = true });

Serialize the deterministic layout result to JSON, then reload it later without re-parsing the PDF.

CacheLayout.cs

// Layout analysis is deterministic. Cache the JSON result and reload it instantly.
string cached = page.ToJson();
File.WriteAllText(@"C:\cache\report-page-1.json", cached);

// Months later, no PDF parser, no image decoder, no recomputation.
PageElement reloaded = PageElement.FromJson(File.ReadAllText(@"C:\cache\report-page-1.json"));
List<ParagraphElement> paras = reloaded.DetectParagraphs();

Across the toolkit

The foundation under everything else.

Layout analysis is not a stand-alone product. It is the layer that makes every higher-level capability accurate, deterministic, and debuggable. Each consumer below relies on it.

Document to Markdown

The TextExtraction strategy emits Markdown directly from PageElement.GetText(Markdown). Headings, lists, tables, columns all preserved.

Structured extraction

TextExtraction uses paragraph segmentation and reading order so the model receives clean structured input. Lower hallucination rate, higher field-level accuracy.

RAG ingestion

DocumentRag, RagChat, and PdfChat chunk along paragraph boundaries instead of arbitrary token windows. Retrieval matches semantic units.

Classification

Categorization uses layout cues (heading hierarchy, indentation, list density) as classification signals before any model runs.

OCR consumers

LMKitOcr and VlmOcr emit positioned text elements that flow back into layout analysis for downstream structure.

Search-highlight

SearchHighlightEngine uses match bounds from the layout pipeline to render visible highlights in PDFs and images.

Decision routing

Agents inspect layout flags (headings present, table-like, dirty-layout) to decide whether a document needs OCR, splitting, or human review.

Post-validation

After an LLM emits structured fields, layout-aware search verifies that values appear at expected coordinates. Catches hallucinations that pure text matching misses.

Feeding LLMs

Whatever the input format, layout produces a normalised, reading-order, structured intermediate. Models see clean prose and tables instead of zigzag PDF byte order.

Continuous R&D

Years of empirical tuning.

The pipeline is the product of years of iteration on real corpora: invoices, contracts, scientific papers, multi-column reports, scanned forms, financial statements, faxes, screenshots. Every threshold has a story behind it.

Calibrated parameter set

Dozens of named parameters spanning fuzzy membership rules, statistical tolerances, and equation coefficients. Each parameter carries a documented rationale, an experimental basis, and a regression test that pins it. None chosen by inspection.

Iterative refinement loop

A convergence loop of corrective passes. Earlier passes intentionally permissive; later passes refining. Each targets one specific failure mode the initial grouping is known to produce. Iteration continues until the structure stabilises.

Multi-script aware

Text rotation (0, 90, 180, 270), skew correction, right-to-left support, mixed directionality, diacritic geometry, and emoji classification all handled inside the pipeline rather than bolted on afterwards.

Stage-level observability

Subscribe to per-stage callbacks for diagnostics, regression testing, and tuning. The full pipeline is observable: every seam detected, every line built, every paragraph refined, every final boundary surfaced for inspection.

Methodologically grounded

Estimator choices, scoring functions, and clustering models selected for the failure modes real documents exhibit. Order-statistic robustness, scale-invariant tolerances, fuzzy decision membership. Each design choice motivated by an observed noise pattern, not by aesthetic preference.

Cached and serialisable

PageElement.ToJson() / FromJson() persist the analysis. Re-open without re-parsing. Critical for large-archive workflows where the same page is queried hundreds of times.

Demos and guides

Working references.

GuideAnalyse document layout and search by structure GuideExtract text with layout preservation GuideSearch and locate content within documents APIPageElement APILayoutSearchEngine APITextOutputMode

Related capabilities

The consumers of this foundation.

Document to Markdown

The TextExtraction strategy is built directly on TextOutputMode.Markdown. Output is layout-aware out of the box.

Markdown converter

OCR

OCR engines emit positioned text that flows into the layout pipeline. Same primitives, raster input.

OCR page

Structured extraction

Layout-aware paragraphs and reading order make extraction more accurate. Post-validation uses spatial search to verify field positions.

Extraction page

PDF toolkit

SearchHighlightEngine uses layout-search bounds to render visible highlights in marked-up PDFs.

PDF toolkit

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Structure first. Then everything else.

Get Community Edition Download

Structure, without a language model.

Deterministic

Original research

Foundational

Stable across reruns

Auditable

Compute-light

Composable

Observable

Continuously refined

Preprocessing & artifact filtering

Column structure inference

Column-aware line construction

Flow seams & corridor detection

Paragraph grouping

Iterative refinement loop

Reading order & layer tagging

Fuzzy decision membership

Robust statistical inference

Custom geometric equations

Layout-density signal processing

Linguistic feature extraction

Calibrated approximate matching

PageElement

TextElement / LineElement / ParagraphElement

LayoutSearchEngine

TextOutputMode

LayoutAnalysis / LogicalLayoutAnalysis

ConnectedComponentLabeler / TextBlockComposer

Exact substring

Pattern matching

Edit-distance search

Rectangular bounds

Proximity radius

Anchor extraction

Document to Markdown

Structured extraction

RAG ingestion

Classification

OCR consumers

Search-highlight

Decision routing

Post-validation

Feeding LLMs

Calibrated parameter set

Iterative refinement loop

Multi-script aware

Stage-level observability

Methodologically grounded

Cached and serialisable

Document to Markdown

OCR

Structured extraction

PDF toolkit

Layout Understanding

Layout Understanding walkthrough

Analyze document layout and search by structure

Extract text with layout preservation

Search and locate content within documents

`PageElement`

`TextElement` / `LineElement` / `ParagraphElement`

`LayoutSearchEngine`

`TextOutputMode`

`LayoutAnalysis` / `LogicalLayoutAnalysis`

`ConnectedComponentLabeler` / `TextBlockComposer`