Deterministic
Same input, same output. Every decision is traceable and reproducible. No model drift between releases.
Underneath every accurate conversion, extraction, classification, and RAG ingestion in LM-Kit.NET sits a single foundation: a deterministic, multi-layer layout-understanding pipeline written from first principles. Connected-component analysis, baseline-alignment scoring, column-aware line construction, paragraph detection, reading-order reconstruction, and layout-aware search. No LLM, no probabilistic guessing, no surprises. Predictable output, auditable decisions, observable at every stage.
Same input, same output. Every decision is traceable and reproducible. No model drift between releases.
Custom equations for column geometry, proprietary scoring functions for line cohesion and reading-order recovery, bespoke clustering models. Calibrated against curated document corpora through controlled experimentation, not borrowed from a textbook.
Consumed by conversion, extraction, RAG, classification, summarisation, OCR, and search throughout LM-Kit.NET.
A language model can describe a page. It cannot tell you the exact bounding box of a footnote, identify whether two columns of numbers share a baseline, or guarantee a stable reading order across reruns. Those are signal-processing and geometric questions. Solving them deterministically gives downstream consumers (conversion, extraction, RAG, classification, validation) a foundation that does not change under their feet.
No sampling, no temperature, no token randomness. Identical input produces byte-identical output. Critical for diff-based document workflows.
Every paragraph, line, and column decision is the result of named algorithms with named thresholds. Stage-level callbacks expose the pipeline at every refinement pass.
Pure CPU, no model loading, no inference cost. Sub-second per page on commodity hardware. Runs on machines too small for any LLM.
Each layer is independent. Use line detection alone. Use paragraph grouping alone. Use the search engine alone. Or take the full pipeline.
Subscribe to per-stage callbacks (seam detection, line construction, paragraph refinement) for diagnostics and debugging. No black boxes.
Built and tuned over years against real corpora: invoices, contracts, scientific papers, multi-column reports, scanned forms. Empirical thresholds, mathematically grounded.
Each layer feeds the next. Raw text and image data enter at the top;
a fully structured PageElement tree (lines, paragraphs,
regions, reading order, layer tags) emerges at the bottom.
Layer 1
Tiny diacritics and punctuation are flagged with up-scale factors. Page rotation, skew, and unit (points / pixels) normalised. Fragmented characters merged. Source order preserved as fallback.
Layer 2
Multiple cooperating strategies infer column structure from page-level geometry, signal density, and statistical clustering. Each strategy votes; consensus wins. Fallbacks handle the cases each one alone misses.
Layer 3
Per-column line construction with adaptive tolerances driven by page-local geometry. Word membership validated by a custom scoring function rather than hard thresholds. Anomalous gaps detected and respected as boundaries.
Layer 4
Mid-page voids analysed as column corridors. Density valleys surface intra-paragraph splits the initial geometry missed. Row-level consensus catches tabular content disguised as prose.
Layer 5
Lines clustered into paragraphs using line spacing, indentation, hyphenation continuation, sentence-start signals, and font-size transitions. Initial grouping intentionally permissive; refinement passes correct it.
Layer 6
A convergence loop of corrective passes. Earlier passes intentionally permissive; later passes refining. Each pass targets one specific failure mode the initial grouping is known to produce. Iteration continues until structure stabilises.
Layer 7
Final 2D sort recovers reading order across columns and rows. Each paragraph carries a layer tag (MainBody, Header, Sidebar, Footer) for downstream filtering.
Layout understanding is, at root, a signal-processing and geometric-inference problem. We treat it like one. Custom-derived equations and proprietary scoring functions sit alongside well-understood techniques, every coefficient calibrated against curated document corpora, every threshold regression-tested. Methodology over folklore. Reproducible over mysterious. No probabilistic black boxes; every decision traces back to a coefficient, a tolerance, or a published method.
Boolean thresholds discard information. Membership functions assign each candidate a continuous score over multiple weighted criteria, then combine those scores under named-rule sets. Decisions look like cliff-edges in code; they behave like gradients in practice.
Order-statistic estimators rather than means, so single outliers cannot dominate page-level metrics. Tolerances expressed as ratios over the natural unit of each page, making the pipeline scale-invariant across font sizes, DPIs, and document genres.
Proprietary equations for column inference, line cohesion, paragraph segmentation, and reading-order recovery. Each equation derived for the noise patterns real documents produce, not the textbook ideal case. Coefficients fit by controlled experimentation against curated corpora.
Projection-profile inference and density analysis applied to the page surface. Surfaces column corridors, intra-paragraph voids, sparse-vs-dense regions. Signal-processing primitives translated into layout primitives.
Lightweight deterministic NLP cooperating with geometry: sentence-onset signals, continuation cues, list-marker recognition, header signatures, script-direction inference. Each signal contributes to the membership scores upstream.
Edit-distance and token-aware retrieval calibrated to the noise patterns OCR introduces. Tolerance bounds set empirically against real-world matches and misses, not by inspection.
Layout understanding is exposed as a small set of primitive types you can compose freely. Use one, use them all.
Hub
PageElementThe page-level analysis object. Holds text elements, dimensions, rotation, skew. Methods: GetText(mode), DetectLines(), DetectParagraphs(), InferTextOutputMode(), ToJson(), FromJson(), Clone().
Granular
TextElement / LineElement / ParagraphElementThree levels of granularity. Each carries bounds, text direction, font metrics, style flags. Paragraphs add layer-id, gap metrics, and structural flags (heading, list, quote, table-like).
Search
LayoutSearchEngineSix search modes (text, regex, fuzzy, region, near, between). Returns TextMatch with bounds, score, and constituent elements. Single-page or cross-page.
Output
TextOutputModeSix output styles: RawLines, GridAligned (financial / tabular), ParagraphFlow (articles), Structured (mixed RAG-ready), Auto, Markdown.
Internal
LayoutAnalysis / LogicalLayoutAnalysisStatic utility classes exposing the geometric and linguistic primitives the pipeline composes. Useful when you want to bypass the high-level API and assemble custom structural analysis from lower-level signals.
Image
ConnectedComponentLabeler / TextBlockComposerImage-side block recovery. Identifies coherent text blocks from raw raster pixels through density inference and proximity clustering. The bridge between OCR-style image input and the structural pipeline.
LayoutSearchEngine turns reconstructed pages into queryable
surfaces. Every match returns a TextMatch with bounds,
relevance score, page index, and the constituent text elements that
produced it.
FindText
Case-sensitive or whole-word options. Fastest path for known phrases.
FindRegex
Full .NET regex. Pull every monetary amount, every invoice number, every date in one call.
FindFuzzy
Edit-distance retrieval with token-aware mode. Tolerates OCR errors and transpositions.
FindInRegion
Pull all text inside a given Rectangle. Useful for form-field extraction at known coordinates.
FindNear
Find text within radius of an anchor point. Useful for label-value pairs (find amount near "Total").
FindBetween
Pull the content between two anchors (between "Terms" and "Signature"). Document-section extraction in one call.
Open any document and inspect lines, paragraphs, and structured text in reading order across the page.
using LMKit.Data; using LMKit.Document.Layout; // Open any document. Pages expose layout-analysed PageElement objects. var doc = new Attachment(@"C:\inputs\report.pdf"); var page = doc.PageElements[0]; // Reading-order text in any of six output modes. string markdown = page.GetText(TextOutputMode.Markdown); string tabular = page.GetText(TextOutputMode.GridAligned); string structured = page.GetText(TextOutputMode.Structured); // Or directly inspect the structure. List<LineElement> lines = page.DetectLines(); List<ParagraphElement> paras = page.DetectParagraphs(); foreach (var p in paras) { Console.WriteLine($"{p.LayerId,-12} {p.Bounds} flags={p.Flags}"); // MainBody [..] flags=None // Header [..] flags=IsHeading // MainBody [..] flags=IsList }
Run regex, anchor-bounded, proximity, and fuzzy searches against the layout graph rather than raw text.
using LMKit.Document.Search; var engine = new LayoutSearchEngine(); // Pull every monetary amount across the whole document. var amounts = engine.FindRegex(doc.PageElements, @"\$[\d,]+(\.\d{2})?"); // Pull a section bounded by two known anchors. var indemn = engine.FindBetween(page, "Indemnification", "Limitation of Liability"); // Find amounts near the word "Total". var totalLabel = engine.FindText(page, "Total").First(); var totalValue = engine.FindNear(page, @"\$[\d,.]+", new ProximityOptions { AnchorBounds = totalLabel.Bounds, RadiusPoints = 120 }); // Tolerant search over OCR output. var fuzzy = engine.FindFuzzy(page, "Invoice Number", new FuzzySearchOptions { MaxEditDistance = 2, TokenAware = true });
Serialize the deterministic layout result to JSON, then reload it later without re-parsing the PDF.
// Layout analysis is deterministic. Cache the JSON result and reload it instantly. string cached = page.ToJson(); File.WriteAllText(@"C:\cache\report-page-1.json", cached); // Months later, no PDF parser, no image decoder, no recomputation. PageElement reloaded = PageElement.FromJson(File.ReadAllText(@"C:\cache\report-page-1.json")); List<ParagraphElement> paras = reloaded.DetectParagraphs();
Layout analysis is not a stand-alone product. It is the layer that makes every higher-level capability accurate, deterministic, and debuggable. Each consumer below relies on it.
The TextExtraction strategy emits Markdown directly from PageElement.GetText(Markdown). Headings, lists, tables, columns all preserved.
TextExtraction uses paragraph segmentation and reading order so the model receives clean structured input. Lower hallucination rate, higher field-level accuracy.
DocumentRag, RagChat, and PdfChat chunk along paragraph boundaries instead of arbitrary token windows. Retrieval matches semantic units.
Categorization uses layout cues (heading hierarchy, indentation, list density) as classification signals before any model runs.
LMKitOcr and VlmOcr emit positioned text elements that flow back into layout analysis for downstream structure.
SearchHighlightEngine uses match bounds from the layout pipeline to render visible highlights in PDFs and images.
Agents inspect layout flags (headings present, table-like, dirty-layout) to decide whether a document needs OCR, splitting, or human review.
After an LLM emits structured fields, layout-aware search verifies that values appear at expected coordinates. Catches hallucinations that pure text matching misses.
Whatever the input format, layout produces a normalised, reading-order, structured intermediate. Models see clean prose and tables instead of zigzag PDF byte order.
The pipeline is the product of years of iteration on real corpora: invoices, contracts, scientific papers, multi-column reports, scanned forms, financial statements, faxes, screenshots. Every threshold has a story behind it.
Dozens of named parameters spanning fuzzy membership rules, statistical tolerances, and equation coefficients. Each parameter carries a documented rationale, an experimental basis, and a regression test that pins it. None chosen by inspection.
A convergence loop of corrective passes. Earlier passes intentionally permissive; later passes refining. Each targets one specific failure mode the initial grouping is known to produce. Iteration continues until the structure stabilises.
Text rotation (0, 90, 180, 270), skew correction, right-to-left support, mixed directionality, diacritic geometry, and emoji classification all handled inside the pipeline rather than bolted on afterwards.
Subscribe to per-stage callbacks for diagnostics, regression testing, and tuning. The full pipeline is observable: every seam detected, every line built, every paragraph refined, every final boundary surfaced for inspection.
Estimator choices, scoring functions, and clustering models selected for the failure modes real documents exhibit. Order-statistic robustness, scale-invariant tolerances, fuzzy decision membership. Each design choice motivated by an observed noise pattern, not by aesthetic preference.
PageElement.ToJson() / FromJson() persist the analysis. Re-open without re-parsing. Critical for large-archive workflows where the same page is queried hundreds of times.
The TextExtraction strategy is built directly on TextOutputMode.Markdown. Output is layout-aware out of the box.
OCR engines emit positioned text that flows into the layout pipeline. Same primitives, raster input.
Layout-aware paragraphs and reading order make extraction more accurate. Post-validation uses spatial search to verify field positions.
SearchHighlightEngine uses layout-search bounds to render visible highlights in marked-up PDFs.
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Layout-aware search across hierarchical document trees.
Read the guide → How-to guideRecover reading order, columns, headings without LLM hallucinations.
Read the guide → How-to guideFind passages by structure (heading, table, footnote) not just text.
Read the guide →