Solutions · Document Intelligence · Document to Markdown

Any document. One Markdown.

LM-Kit ships a first-class document-to-Markdown engine built specifically for LLM consumption. DocumentToMarkdown handles PDF, DOCX, PPTX, XLSX, EML, MBOX, HTML, and any common image format. Three interchangeable strategies (text-layer extraction, vision-language OCR, hybrid per-page selection) cover every input shape. Nineteen configuration knobs control output. A calibrated per-page certainty score tells you exactly which pages to ship and which to review. State-of-the-art accuracy on public document-conversion benchmarks.

10+ input formats 3 strategies SOTA benchmark accuracy

TextExtraction

Embedded text-layer parsing. Fastest path. Sub-second per page on modern desktops. Optional traditional OCR enrichment for embedded raster images.

Hybrid

Per-page auto-select. Text layer where it is clean and image-free; vision OCR where the page demands it. Recommended default.

VlmOcr

Rasterise every page, transcribe with a vision-language model. Best on scans, handwriting, complex layouts. Default model: lightonocr-2:1b.

Why a first-class engine

Pdf-to-Markdown is not a one-liner.

A naive Markdown converter handles digital PDFs and stops there. Real corpora are messier: 200-page reports half-digital and half-scanned, DOCX with embedded raster images, contracts with footnotes that look like headings, financial tables that span columns, scientific papers with formulas, screenshots of legacy applications, EML attachments buried three levels deep. Each one breaks a one-liner. The DocumentToMarkdown engine handles them as one API.

Six input overloads

File path, byte[], Stream, ImageBuffer, Uri, or pre-built Attachment. Sync and async variants. Direct-to-disk shortcuts.

Format-specific dispatch

EML, MBOX, HTML, DOCX route through dedicated converters tuned for each format. PDF, images, PPTX, XLSX, TXT route through the page-by-page strategy pipeline.

Calibrated certainty

Every page and every document carries a confidence score in [0, 1] blended from 30+ signal families. Ship with confidence, review with intent.

Cancellable progress

PageStarting fires before each page; subscribers can set e.Cancel = true to abort. PageCompleted reports diagnostics including elapsed, strategy used, token count.

30+ refinement stages

Smart-punctuation normalisation, TOC reconstruction, table fusion, heading-demotion filters, numbered-section promotion, abbreviation capitalisation. The Markdown is clean.

Plugs into RAG

The output drops straight into DocumentRag, any embedding pipeline, or a SimpleSearch index. Markdown is your normalised intermediate.

Three strategies, one decision

Match the path to the page.

Strategy is a per-call enum. Hybrid auto-resolves per page. Each strategy exposes its own configuration knobs.

Calibrated certainty score

Know which pages to trust.

Every DocumentToMarkdownPageResult exposes a Certainty in [0, 1]. The aggregate DocumentToMarkdownResult.Certainty is a word-count-weighted average across pages. Both are calibrated against a curated benchmark corpus with non-negative least-squares regression and 5-fold cross-validation, which makes the score interpretable and the weights non-negative (each signal can only increase confidence).

Text quality signals

Token naturalness (garbled-token ratio), repetition-free score, mixed-script detection, character-run repetitions. Catches OCR-style noise that LLMs hallucinate around.

OCR confidence

Fraction of words injected by the OCR pipeline versus extracted from the native text layer. High native ratio means high confidence.

Structural integrity

Table shape (nested, ragged, sparse, colspan / rowspan), heading hierarchy, layout coverage (Markdown words vs source words), paragraph cohesion, reading order.

Content shape

Character entropy in the natural-prose window (3.5-5.5 bits/char), symbol ratio (0.10-0.20 for natural prose), sentence closure, numeric-token ratio, mean token length.

VLM-specific

Truncation risk (token budget within 95% of the cap), the vision model's own quality score. Detects when the VLM ran out of room.

Refinement footprint

Fraction of post-emission refinement stages that fired, normalised Markdown-length delta. High refinement intensity flags lower confidence.

Practical interpretation. Certainty >= 0.85: ship as-is. 0.70 - 0.85: spot-check tables and headings. < 0.70: re-run with a stronger strategy or route to human review.

Configuration matrix

Nineteen knobs for nineteen workloads.

DocumentToMarkdownOptions exposes the full surface. Defaults are LLM-ready out of the box, so most callers set only one or two properties.

Strategy & pagination

Strategy

Strategy

Pick TextExtraction, VlmOcr, or Hybrid. Default: Hybrid.

Pages

PageRange

String like "1-5, 7, 9-12". Default: all pages.

Traditional OCR (TextExtraction path)

OCR

OcrEngine

Optional traditional OCR for embedded raster images and text-sparse pages. Pass LMKitOcr or any custom OcrEngine.

Parallelism

OcrImageParallelism

Concurrent OCR calls per page's embedded images. Range [1, 12]. Default: 4.

Vision-language OCR (VlmOcr / Hybrid path)

Detail

VlmImageDetail

Pixel budget per page: Minimal, Low, Standard, High, Maximal. Default: High.

Tokens

VlmMaximumCompletionTokens

Cap completion tokens per page. -1 means unlimited. Default: 3072.

Strip

VlmStripImageMarkup

Remove Markdown image references (![...]()) from VLM output. Default: true.

Strip

VlmStripStyleAttributes

Strip inline style="..." attributes from output. Default: true.

Output shaping

Pages

IncludePageSeparators

Insert separator between pages. Default: true.

Format

PageSeparatorFormat

Template with {pageNumber} placeholder. Default: "\n\n---\n\n<!-- Page {pageNumber} -->\n\n".

Front matter

EmitFrontMatter

Prepend YAML front-matter (source, pages, strategy, converted_at, elapsed_ms). Default: false.

Whitespace

NormalizeWhitespace

Collapse consecutive blank lines to one. Default: true.

Tables

PreferMarkdownTablesForNonNested

Rewrite non-nested HTML tables to GitHub-flavoured pipe syntax. Nested or span-using tables stay as HTML. Default: false.

DOCX-specific

DOCX

IncludeTables

Preserve DOCX tables. Default: true.

DOCX

IncludeImages

Preserve image references. Default: true.

DOCX

IncludeHyperlinks

Preserve hyperlinks. Default: true.

DOCX

PreserveLineBreaks

Preserve intra-paragraph line breaks. Default: true.

DOCX

IncludeEmptyParagraphs

Preserve blank paragraphs. Default: false.

Email-specific

Email

EmlStripQuotes

Strip quoted reply content from EML / MBOX bodies. Default: false.

30+ Markdown refinement stages

The Markdown is clean.

Raw text-layer extraction or VLM output is rarely shippable as-is. The refinement pipeline runs every page through 30+ ordered stages that fix the small things at scale.

Smart punctuation

Quote-spacing normalisation, leader-dot tightening, smart-punctuation. The output reads like a writer typed it.

Table-of-contents repair

Multi-line TOC entries are paired, numeric ranges joined, leader dots cleaned. TOCs survive the conversion.

Table fusion

Adjacent tables merged when they share columns. Header rows merged. Row-duplicates removed. Trailing footnote markers stripped.

Heading-demotion filters

Six false-positive filters (preposition-start, chart-legend, hyphen-end, short-numbered-first, overlapping-headings, etc.) demote bogus headings to body text.

Numbered-section promotion

Real numbered sections (1.2 Introduction) promoted to headings even when source typography missed them.

Cell-content cleanup

Cell alignment fixes, abbreviation capitalisation, space normalisation. Tables read naturally.

Conversion in three lines

From a path to clean Markdown.

Pick a tab to see the zero-config quick path, then the diagnostics pattern that streams per-page progress and routes low-confidence pages to human review.

Zero configuration: new DocumentToMarkdown(), call ConvertAsync, read the result. The engine picks the Hybrid strategy and emits Markdown plus calibrated certainty.

QuickConvert.cs
using LMKit.Document.Conversion;

// Zero config: Hybrid strategy, default vision model, defaults everywhere.
var converter = new DocumentToMarkdown();
var result    = await converter.ConvertAsync(@"C:\inputs\report.pdf");

Console.WriteLine(result.Markdown);            // final Markdown
Console.WriteLine($"certainty: {result.Certainty:F2}");
Console.WriteLine($"strategy: {result.EffectiveStrategy}");
Console.WriteLine($"elapsed:  {result.Elapsed.TotalSeconds:F1}s");
Real configurations

Tune for the workload.

Three production-grade configurations covering RAG ingestion, page-range processing with traditional OCR, and a custom VLM at maximum fidelity.

Front matter, normalised whitespace, Markdown tables, high-detail VLM rendering. The pattern that ships clean Markdown straight into a DocumentRag index. Pages under 70% certainty go to the human review queue.

RagIngestion.cs
// RAG-tuned: front matter on, page separators visible, tables as Markdown.
var options = new DocumentToMarkdownOptions
{
    Strategy                          = DocumentToMarkdownStrategy.Hybrid,
    EmitFrontMatter                   = true,
    PreferMarkdownTablesForNonNested  = true,
    NormalizeWhitespace               = true,
    VlmImageDetail                    = ImageDetail.High,
    VlmMaximumCompletionTokens        = 4096,
};

var rag = new DocumentRag(model, embedder);

foreach (var path in Directory.EnumerateFiles(@"C:\corpus", "*.*"))
{
    var r = await converter.ConvertAsync(path, options);
    if (r.Certainty < 0.70) { humanQueue.Add(path); continue; }
    await rag.ImportDocumentAsync(r.Markdown, metadata: new() { Name = path });
}
Versus the alternatives

Markdown converters are not equal.

CLI converters

Format-by-format. No OCR for scanned PDFs. No vision model for image-only pages. Reading order broken on multi-column. No certainty signal. No per-page diagnostics.

Cloud document services

Strong on accuracy but per-page billing and data egress. No local pipeline path. No way to keep the conversion on a developer laptop or in an air-gapped environment.

DocumentToMarkdown

First-class .NET engine, ten input formats, three strategies, nineteen knobs, calibrated certainty score, 30+ refinement stages, 100% local, SOTA accuracy on public benchmarks. Plugs straight into the rest of the LM-Kit stack with no glue.

Where DocumentToMarkdown ships

Real corpora, real volumes.

RAG ingestion

Universal entry point for vector databases. Markdown is the normalised intermediate before chunking and embedding.

Knowledge-base import

Convert legacy PDF / DOCX libraries into Markdown for static-site generators, wikis, doc portals.

LLM fine-tuning data

Build training corpora from heterogeneous internal documents. Front matter preserves provenance.

Document review

Compliance reviewers read clean Markdown, not raw PDFs. Certainty score routes ambiguous pages to a human queue.

Email archive ingestion

EML / MBOX archives convert to per-message Markdown with headers and bodies preserved. Drop into RAG over inboxes.

Scientific publishing

Papers with formulas, tables, and charts convert via VLM OCR. VlmImageDetail.Maximal for highest fidelity.

Related capabilities

Markdown plus the rest of Document Intelligence.

Layout understanding

The deterministic foundation under the TextExtraction strategy. Paragraph detection, reading-order recovery, six output modes, layout-aware search.

Layout page

OCR

The OCR backend that powers the VlmOcr strategy. Native engine for classical OCR, vision-language for layout-aware.

OCR page

Image processing

Deskew, smart binarize, despeckle, autocrop. Preprocessing that makes both strategies more accurate.

Image processing

Document RAG

Markdown-converted documents drop straight into the RAG pipeline. No glue.

Document RAG

Document conversion

Markdown is the most useful target for LLMs. The full conversion catalogue covers DOCX, HTML, PDF, EML and more.

Conversion catalogue

Format zoo. Markdown out.

Get Community Edition Download