TextExtraction
Embedded text-layer parsing. Fastest path. Sub-second per page on modern desktops. Optional traditional OCR enrichment for embedded raster images.
LM-Kit ships a first-class document-to-Markdown engine
built specifically for LLM consumption. DocumentToMarkdown
handles PDF, DOCX, PPTX, XLSX, EML, MBOX, HTML, and any common image
format. Three interchangeable strategies (text-layer extraction,
vision-language OCR, hybrid per-page selection) cover every input
shape. Nineteen configuration knobs control output. A calibrated
per-page certainty score tells you exactly which
pages to ship and which to review. State-of-the-art accuracy
on public document-conversion benchmarks.
Embedded text-layer parsing. Fastest path. Sub-second per page on modern desktops. Optional traditional OCR enrichment for embedded raster images.
Per-page auto-select. Text layer where it is clean and image-free; vision OCR where the page demands it. Recommended default.
Rasterise every page, transcribe with a vision-language model. Best on scans, handwriting, complex layouts. Default model: lightonocr-2:1b.
A naive Markdown converter handles digital PDFs and stops there. Real
corpora are messier: 200-page reports half-digital and half-scanned,
DOCX with embedded raster images, contracts with footnotes that look
like headings, financial tables that span columns, scientific papers
with formulas, screenshots of legacy applications, EML attachments
buried three levels deep. Each one breaks a one-liner. The
DocumentToMarkdown engine handles them as one API.
File path, byte[], Stream, ImageBuffer, Uri, or pre-built Attachment. Sync and async variants. Direct-to-disk shortcuts.
EML, MBOX, HTML, DOCX route through dedicated converters tuned for each format. PDF, images, PPTX, XLSX, TXT route through the page-by-page strategy pipeline.
Every page and every document carries a confidence score in [0, 1] blended from 30+ signal families. Ship with confidence, review with intent.
PageStarting fires before each page; subscribers can set e.Cancel = true to abort. PageCompleted reports diagnostics including elapsed, strategy used, token count.
Smart-punctuation normalisation, TOC reconstruction, table fusion, heading-demotion filters, numbered-section promotion, abbreviation capitalisation. The Markdown is clean.
The output drops straight into DocumentRag, any embedding pipeline, or a SimpleSearch index. Markdown is your normalised intermediate.
Strategy is a per-call enum. Hybrid auto-resolves per page. Each strategy exposes its own configuration knobs.
TextExtraction
Reads the embedded text layer of digital PDFs and Office documents directly. No model, no inference, sub-second per page. Optional OcrEngine extends with traditional OCR for embedded raster images and full-page fallback when a page is text-sparse. Configurable embedded-image OCR parallelism (1-12 concurrent calls).
Best for: digital reports, contracts, technical documentation, software-generated PDFs.
VlmOcr
Rasterises every page and transcribes through a vision-language model. Layout-aware: tables, formulas, charts, multi-column structure preserved. Configurable image detail (Minimal / Low / Standard / High / Maximal pixel budget) and per-page completion-token cap. Default model: lightonocr-2:1b, lazy-loaded.
Best for: scans, photographs, screenshots, hand-annotated forms, layout-heavy financial reports.
Hybrid
Inspects each page. Picks TextExtraction when the text layer is clean and the page is image-free. Picks VlmOcr when the page is scanned, image-heavy, or has missing text. Image-only attachments always route to VLM. Recommended default for mixed corpora.
Best for: heterogeneous archives, third-party document drops, RAG ingestion pipelines that cannot pre-classify.
Every DocumentToMarkdownPageResult exposes a
Certainty in [0, 1]. The aggregate
DocumentToMarkdownResult.Certainty is a word-count-weighted
average across pages. Both are calibrated against a curated benchmark
corpus with non-negative least-squares regression and 5-fold
cross-validation, which makes the score interpretable and the weights
non-negative (each signal can only increase confidence).
Token naturalness (garbled-token ratio), repetition-free score, mixed-script detection, character-run repetitions. Catches OCR-style noise that LLMs hallucinate around.
Fraction of words injected by the OCR pipeline versus extracted from the native text layer. High native ratio means high confidence.
Table shape (nested, ragged, sparse, colspan / rowspan), heading hierarchy, layout coverage (Markdown words vs source words), paragraph cohesion, reading order.
Character entropy in the natural-prose window (3.5-5.5 bits/char), symbol ratio (0.10-0.20 for natural prose), sentence closure, numeric-token ratio, mean token length.
Truncation risk (token budget within 95% of the cap), the vision model's own quality score. Detects when the VLM ran out of room.
Fraction of post-emission refinement stages that fired, normalised Markdown-length delta. High refinement intensity flags lower confidence.
Practical interpretation.
Certainty >= 0.85: ship as-is.
0.70 - 0.85: spot-check tables and headings.
< 0.70: re-run with a stronger strategy or route to human review.
DocumentToMarkdownOptions exposes the full surface. Defaults
are LLM-ready out of the box, so most callers set only one or two
properties.
Strategy
StrategyPick TextExtraction, VlmOcr, or Hybrid. Default: Hybrid.
Pages
PageRangeString like "1-5, 7, 9-12". Default: all pages.
OCR
OcrEngineOptional traditional OCR for embedded raster images and text-sparse pages. Pass LMKitOcr or any custom OcrEngine.
Parallelism
OcrImageParallelismConcurrent OCR calls per page's embedded images. Range [1, 12]. Default: 4.
Detail
VlmImageDetailPixel budget per page: Minimal, Low, Standard, High, Maximal. Default: High.
Tokens
VlmMaximumCompletionTokensCap completion tokens per page. -1 means unlimited. Default: 3072.
Strip
VlmStripImageMarkupRemove Markdown image references (![...]()) from VLM output. Default: true.
Strip
VlmStripStyleAttributesStrip inline style="..." attributes from output. Default: true.
Pages
IncludePageSeparatorsInsert separator between pages. Default: true.
Format
PageSeparatorFormatTemplate with {pageNumber} placeholder. Default: "\n\n---\n\n<!-- Page {pageNumber} -->\n\n".
Front matter
EmitFrontMatterPrepend YAML front-matter (source, pages, strategy, converted_at, elapsed_ms). Default: false.
Whitespace
NormalizeWhitespaceCollapse consecutive blank lines to one. Default: true.
Tables
PreferMarkdownTablesForNonNestedRewrite non-nested HTML tables to GitHub-flavoured pipe syntax. Nested or span-using tables stay as HTML. Default: false.
DOCX
IncludeTablesPreserve DOCX tables. Default: true.
DOCX
IncludeImagesPreserve image references. Default: true.
DOCX
IncludeHyperlinksPreserve hyperlinks. Default: true.
DOCX
PreserveLineBreaksPreserve intra-paragraph line breaks. Default: true.
DOCX
IncludeEmptyParagraphsPreserve blank paragraphs. Default: false.
EmlStripQuotesStrip quoted reply content from EML / MBOX bodies. Default: false.
Raw text-layer extraction or VLM output is rarely shippable as-is. The refinement pipeline runs every page through 30+ ordered stages that fix the small things at scale.
Quote-spacing normalisation, leader-dot tightening, smart-punctuation. The output reads like a writer typed it.
Multi-line TOC entries are paired, numeric ranges joined, leader dots cleaned. TOCs survive the conversion.
Adjacent tables merged when they share columns. Header rows merged. Row-duplicates removed. Trailing footnote markers stripped.
Six false-positive filters (preposition-start, chart-legend, hyphen-end, short-numbered-first, overlapping-headings, etc.) demote bogus headings to body text.
Real numbered sections (1.2 Introduction) promoted to headings even when source typography missed them.
Cell alignment fixes, abbreviation capitalisation, space normalisation. Tables read naturally.
Pick a tab to see the zero-config quick path, then the diagnostics pattern that streams per-page progress and routes low-confidence pages to human review.
Zero configuration: new DocumentToMarkdown(), call
ConvertAsync, read the result. The engine picks the
Hybrid strategy and emits Markdown plus calibrated certainty.
using LMKit.Document.Conversion; // Zero config: Hybrid strategy, default vision model, defaults everywhere. var converter = new DocumentToMarkdown(); var result = await converter.ConvertAsync(@"C:\inputs\report.pdf"); Console.WriteLine(result.Markdown); // final Markdown Console.WriteLine($"certainty: {result.Certainty:F2}"); Console.WriteLine($"strategy: {result.EffectiveStrategy}"); Console.WriteLine($"elapsed: {result.Elapsed.TotalSeconds:F1}s");
Subscribe to PageStarting and PageCompleted
for live progress and per-page certainty. Route low-certainty pages
to a review queue without blocking the conversion.
// Stream per-page progress and route low-confidence pages to review. var reviewQueue = new List<int>(); converter.PageStarting += (_, e) => { log.Info($"page {e.PageNumber}/{e.PageCount} via {e.PlannedStrategy}"); }; converter.PageCompleted += (_, e) => { if (e.PageResult is null) { log.Warn($"page {e.PageNumber} failed"); return; } var p = e.PageResult; log.Info($"page {p.PageNumber}: {p.StrategyUsed}, " + $"{p.Elapsed.TotalMilliseconds:F0}ms, " + $"certainty {p.Certainty:F2}, " + $"{p.GeneratedTokenCount} tokens"); if (p.Certainty < 0.70) reviewQueue.Add(p.PageNumber); }; var result = await converter.ConvertAsync(@"C:\inputs\mixed.pdf");
Three production-grade configurations covering RAG ingestion, page-range processing with traditional OCR, and a custom VLM at maximum fidelity.
Front matter, normalised whitespace, Markdown tables, high-detail
VLM rendering. The pattern that ships clean Markdown straight into
a DocumentRag index. Pages under 70% certainty go to
the human review queue.
// RAG-tuned: front matter on, page separators visible, tables as Markdown. var options = new DocumentToMarkdownOptions { Strategy = DocumentToMarkdownStrategy.Hybrid, EmitFrontMatter = true, PreferMarkdownTablesForNonNested = true, NormalizeWhitespace = true, VlmImageDetail = ImageDetail.High, VlmMaximumCompletionTokens = 4096, }; var rag = new DocumentRag(model, embedder); foreach (var path in Directory.EnumerateFiles(@"C:\corpus", "*.*")) { var r = await converter.ConvertAsync(path, options); if (r.Certainty < 0.70) { humanQueue.Add(path); continue; } await rag.ImportDocumentAsync(r.Markdown, metadata: new() { Name = path }); }
Pin a page range, route raster images through the traditional OCR engine in parallel, write straight to disk. The path for annual reports and other long documents where you only need the front section.
// First 50 pages only. Add traditional OCR for embedded raster images. var options = new DocumentToMarkdownOptions { Strategy = DocumentToMarkdownStrategy.TextExtraction, PageRange = "1-50", OcrEngine = new LMKitOcr(), OcrImageParallelism = 8, }; await converter.ConvertToFileAsync( @"C:\inputs\annual-report.pdf", @"C:\out\annual-report-pages-1-50.md", options);
Maximal VLM detail, unlimited completion tokens, figure references preserved. The right configuration for scientific papers where equations and diagrams matter as much as the prose.
// Custom vision model with maxed-out detail for high-fidelity scientific papers. var vlm = VisionLanguageModel.LoadFromModelID("paddleocr-vl:0.9b"); var converter = new DocumentToMarkdown(vlm); var options = new DocumentToMarkdownOptions { Strategy = DocumentToMarkdownStrategy.VlmOcr, VlmImageDetail = ImageDetail.Maximal, VlmMaximumCompletionTokens = -1, // no cap VlmStripImageMarkup = false, // keep figure references }; var r = await converter.ConvertAsync(@"C:\papers\nature-paper.pdf", options);
Format-by-format. No OCR for scanned PDFs. No vision model for image-only pages. Reading order broken on multi-column. No certainty signal. No per-page diagnostics.
Strong on accuracy but per-page billing and data egress. No local pipeline path. No way to keep the conversion on a developer laptop or in an air-gapped environment.
First-class .NET engine, ten input formats, three strategies, nineteen knobs, calibrated certainty score, 30+ refinement stages, 100% local, SOTA accuracy on public benchmarks. Plugs straight into the rest of the LM-Kit stack with no glue.
Universal entry point for vector databases. Markdown is the normalised intermediate before chunking and embedding.
Convert legacy PDF / DOCX libraries into Markdown for static-site generators, wikis, doc portals.
Build training corpora from heterogeneous internal documents. Front matter preserves provenance.
Compliance reviewers read clean Markdown, not raw PDFs. Certainty score routes ambiguous pages to a human queue.
EML / MBOX archives convert to per-message Markdown with headers and bodies preserved. Drop into RAG over inboxes.
Papers with formulas, tables, and charts convert via VLM OCR. VlmImageDetail.Maximal for highest fidelity.
The deterministic foundation under the TextExtraction strategy. Paragraph detection, reading-order recovery, six output modes, layout-aware search.
The OCR backend that powers the VlmOcr strategy. Native engine for classical OCR, vision-language for layout-aware.
Deskew, smart binarize, despeckle, autocrop. Preprocessing that makes both strategies more accurate.
Markdown-converted documents drop straight into the RAG pipeline. No glue.
Markdown is the most useful target for LLMs. The full conversion catalogue covers DOCX, HTML, PDF, EML and more.
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.