Solutions · Document Intelligence · Document to Markdown

Any document. One Markdown.

LM-Kit ships a first-class document-to-Markdown engine built specifically for LLM consumption. DocumentToMarkdown handles PDF, DOCX, PPTX, XLSX, EML, MBOX, HTML, and any common image format. Three interchangeable strategies (text-layer extraction, vision-language OCR, hybrid per-page selection) cover every input shape. Nineteen configuration knobs control output. A calibrated per-page certainty score tells you exactly which pages to ship and which to review. State-of-the-art accuracy on public document-conversion benchmarks.

Start building free API reference

10+ input formats 3 strategies SOTA benchmark accuracy

TextExtraction

Embedded text-layer parsing. Fastest path. Sub-second per page on modern desktops. Optional traditional OCR enrichment for embedded raster images.

Hybrid

Per-page auto-select. Text layer where it is clean and image-free; vision OCR where the page demands it. Recommended default.

VlmOcr

Rasterise every page, transcribe with a vision-language model. Best on scans, handwriting, complex layouts. Default model: lightonocr-2:1b.

Why a first-class engine

Pdf-to-Markdown is not a one-liner.

A naive Markdown converter handles digital PDFs and stops there. Real corpora are messier: 200-page reports half-digital and half-scanned, DOCX with embedded raster images, contracts with footnotes that look like headings, financial tables that span columns, scientific papers with formulas, screenshots of legacy applications, EML attachments buried three levels deep. Each one breaks a one-liner. The DocumentToMarkdown engine handles them as one API.

Six input overloads

File path, byte[], Stream, ImageBuffer, Uri, or pre-built Attachment. Sync and async variants. Direct-to-disk shortcuts.

Format-specific dispatch

EML, MBOX, HTML, DOCX route through dedicated converters tuned for each format. PDF, images, PPTX, XLSX, TXT route through the page-by-page strategy pipeline.

Calibrated certainty

Every page and every document carries a confidence score in [0, 1] blended from 30+ signal families. Ship with confidence, review with intent.

Cancellable progress

PageStarting fires before each page; subscribers can set e.Cancel = true to abort. PageCompleted reports diagnostics including elapsed, strategy used, token count.

30+ refinement stages

Smart-punctuation normalisation, TOC reconstruction, table fusion, heading-demotion filters, numbered-section promotion, abbreviation capitalisation. The Markdown is clean.

Plugs into RAG

The output drops straight into DocumentRag, any embedding pipeline, or a SimpleSearch index. Markdown is your normalised intermediate.

Three strategies, one decision

Match the path to the page.

Strategy is a per-call enum. Hybrid auto-resolves per page. Each strategy exposes its own configuration knobs.

TextExtraction

Text-layer parsing

Reads the embedded text layer of digital PDFs and Office documents directly. No model, no inference, sub-second per page. Optional OcrEngine extends with traditional OCR for embedded raster images and full-page fallback when a page is text-sparse. Configurable embedded-image OCR parallelism (1-12 concurrent calls).

Best for: digital reports, contracts, technical documentation, software-generated PDFs.

VlmOcr

Vision-language OCR

Rasterises every page and transcribes through a vision-language model. Layout-aware: tables, formulas, charts, multi-column structure preserved. Configurable image detail (Minimal / Low / Standard / High / Maximal pixel budget) and per-page completion-token cap. Default model: lightonocr-2:1b, lazy-loaded.

Best for: scans, photographs, screenshots, hand-annotated forms, layout-heavy financial reports.

Hybrid

Per-page auto-select

Inspects each page. Picks TextExtraction when the text layer is clean and the page is image-free. Picks VlmOcr when the page is scanned, image-heavy, or has missing text. Image-only attachments always route to VLM. Recommended default for mixed corpora.

Best for: heterogeneous archives, third-party document drops, RAG ingestion pipelines that cannot pre-classify.

Calibrated certainty score

Know which pages to trust.

Every DocumentToMarkdownPageResult exposes a Certainty in [0, 1]. The aggregate DocumentToMarkdownResult.Certainty is a word-count-weighted average across pages. Both are calibrated against a curated benchmark corpus with non-negative least-squares regression and 5-fold cross-validation, which makes the score interpretable and the weights non-negative (each signal can only increase confidence).

Text quality signals

Token naturalness (garbled-token ratio), repetition-free score, mixed-script detection, character-run repetitions. Catches OCR-style noise that LLMs hallucinate around.

OCR confidence

Fraction of words injected by the OCR pipeline versus extracted from the native text layer. High native ratio means high confidence.

Structural integrity

Table shape (nested, ragged, sparse, colspan / rowspan), heading hierarchy, layout coverage (Markdown words vs source words), paragraph cohesion, reading order.

Content shape

Character entropy in the natural-prose window (3.5-5.5 bits/char), symbol ratio (0.10-0.20 for natural prose), sentence closure, numeric-token ratio, mean token length.

VLM-specific

Truncation risk (token budget within 95% of the cap), the vision model's own quality score. Detects when the VLM ran out of room.

Refinement footprint

Fraction of post-emission refinement stages that fired, normalised Markdown-length delta. High refinement intensity flags lower confidence.

Practical interpretation. Certainty >= 0.85: ship as-is. 0.70 - 0.85: spot-check tables and headings. < 0.70: re-run with a stronger strategy or route to human review.

Configuration matrix

Nineteen knobs for nineteen workloads.

DocumentToMarkdownOptions exposes the full surface. Defaults are LLM-ready out of the box, so most callers set only one or two properties.

Strategy & pagination

Strategy

`Strategy`

Pick TextExtraction, VlmOcr, or Hybrid. Default: Hybrid.

Pages

`PageRange`

String like "1-5, 7, 9-12". Default: all pages.

Traditional OCR (TextExtraction path)

OCR

`OcrEngine`

Optional traditional OCR for embedded raster images and text-sparse pages. Pass LMKitOcr or any custom OcrEngine.

Parallelism

`OcrImageParallelism`

Concurrent OCR calls per page's embedded images. Range [1, 12]. Default: 4.

Vision-language OCR (VlmOcr / Hybrid path)

Detail

`VlmImageDetail`

Pixel budget per page: Minimal, Low, Standard, High, Maximal. Default: High.

Tokens

`VlmMaximumCompletionTokens`

Cap completion tokens per page. -1 means unlimited. Default: 3072.

Strip

`VlmStripImageMarkup`

Remove Markdown image references (![...]()) from VLM output. Default: true.

Strip

`VlmStripStyleAttributes`

Strip inline style="..." attributes from output. Default: true.

Output shaping

Pages

`IncludePageSeparators`

Insert separator between pages. Default: true.

Format

`PageSeparatorFormat`

Template with {pageNumber} placeholder. Default: "\n\n---\n\n\n\n".

Front matter

`EmitFrontMatter`

Prepend YAML front-matter (source, pages, strategy, converted_at, elapsed_ms). Default: false.

Whitespace

`NormalizeWhitespace`

Collapse consecutive blank lines to one. Default: true.

Tables

`PreferMarkdownTablesForNonNested`

Rewrite non-nested HTML tables to GitHub-flavoured pipe syntax. Nested or span-using tables stay as HTML. Default: false.

DOCX-specific

DOCX

`IncludeTables`

Preserve DOCX tables. Default: true.

DOCX

`IncludeImages`

Preserve image references. Default: true.

DOCX

`IncludeHyperlinks`

Preserve hyperlinks. Default: true.

DOCX

`PreserveLineBreaks`

Preserve intra-paragraph line breaks. Default: true.

DOCX

`IncludeEmptyParagraphs`

Preserve blank paragraphs. Default: false.

Email-specific

`EmlStripQuotes`

Strip quoted reply content from EML / MBOX bodies. Default: false.

30+ Markdown refinement stages

The Markdown is clean.

Raw text-layer extraction or VLM output is rarely shippable as-is. The refinement pipeline runs every page through 30+ ordered stages that fix the small things at scale.

Smart punctuation

Quote-spacing normalisation, leader-dot tightening, smart-punctuation. The output reads like a writer typed it.

Table-of-contents repair

Multi-line TOC entries are paired, numeric ranges joined, leader dots cleaned. TOCs survive the conversion.

Table fusion

Adjacent tables merged when they share columns. Header rows merged. Row-duplicates removed. Trailing footnote markers stripped.

Heading-demotion filters

Six false-positive filters (preposition-start, chart-legend, hyphen-end, short-numbered-first, overlapping-headings, etc.) demote bogus headings to body text.

Numbered-section promotion

Real numbered sections (1.2 Introduction) promoted to headings even when source typography missed them.

Cell-content cleanup

Cell alignment fixes, abbreviation capitalisation, space normalisation. Tables read naturally.

Conversion in three lines

From a path to clean Markdown.

Pick a tab to see the zero-config quick path, then the diagnostics pattern that streams per-page progress and routes low-confidence pages to human review.

Zero configuration: new DocumentToMarkdown(), call ConvertAsync, read the result. The engine picks the Hybrid strategy and emits Markdown plus calibrated certainty.

QuickConvert.cs

using LMKit.Document.Conversion;

// Zero config: Hybrid strategy, default vision model, defaults everywhere.
var converter = new DocumentToMarkdown();
var result    = await converter.ConvertAsync(@"C:\inputs\report.pdf");

Console.WriteLine(result.Markdown);            // final Markdown
Console.WriteLine($"certainty: {result.Certainty:F2}");
Console.WriteLine($"strategy: {result.EffectiveStrategy}");
Console.WriteLine($"elapsed:  {result.Elapsed.TotalSeconds:F1}s");

Subscribe to PageStarting and PageCompleted for live progress and per-page certainty. Route low-certainty pages to a review queue without blocking the conversion.

PerPageDiagnostics.cs

// Stream per-page progress and route low-confidence pages to review.
var reviewQueue = new List<int>();

converter.PageStarting += (_, e) =>
{
    log.Info($"page {e.PageNumber}/{e.PageCount} via {e.PlannedStrategy}");
};

converter.PageCompleted += (_, e) =>
{
    if (e.PageResult is null) { log.Warn($"page {e.PageNumber} failed"); return; }
    var p = e.PageResult;
    log.Info($"page {p.PageNumber}: {p.StrategyUsed}, " +
             $"{p.Elapsed.TotalMilliseconds:F0}ms, " +
             $"certainty {p.Certainty:F2}, " +
             $"{p.GeneratedTokenCount} tokens");

    if (p.Certainty < 0.70) reviewQueue.Add(p.PageNumber);
};

var result = await converter.ConvertAsync(@"C:\inputs\mixed.pdf");

Real configurations

Tune for the workload.

Three production-grade configurations covering RAG ingestion, page-range processing with traditional OCR, and a custom VLM at maximum fidelity.

Front matter, normalised whitespace, Markdown tables, high-detail VLM rendering. The pattern that ships clean Markdown straight into a DocumentRag index. Pages under 70% certainty go to the human review queue.

RagIngestion.cs

// RAG-tuned: front matter on, page separators visible, tables as Markdown.
var options = new DocumentToMarkdownOptions
{
    Strategy                          = DocumentToMarkdownStrategy.Hybrid,
    EmitFrontMatter                   = true,
    PreferMarkdownTablesForNonNested  = true,
    NormalizeWhitespace               = true,
    VlmImageDetail                    = ImageDetail.High,
    VlmMaximumCompletionTokens        = 4096,
};

var rag = new DocumentRag(model, embedder);

foreach (var path in Directory.EnumerateFiles(@"C:\corpus", "*.*"))
{
    var r = await converter.ConvertAsync(path, options);
    if (r.Certainty < 0.70) { humanQueue.Add(path); continue; }
    await rag.ImportDocumentAsync(r.Markdown, metadata: new() { Name = path });
}

Pin a page range, route raster images through the traditional OCR engine in parallel, write straight to disk. The path for annual reports and other long documents where you only need the front section.

PageRangeWithTraditionalOcr.cs

// First 50 pages only. Add traditional OCR for embedded raster images.
var options = new DocumentToMarkdownOptions
{
    Strategy            = DocumentToMarkdownStrategy.TextExtraction,
    PageRange           = "1-50",
    OcrEngine           = new LMKitOcr(),
    OcrImageParallelism = 8,
};

await converter.ConvertToFileAsync(
    @"C:\inputs\annual-report.pdf",
    @"C:\out\annual-report-pages-1-50.md",
    options);

Maximal VLM detail, unlimited completion tokens, figure references preserved. The right configuration for scientific papers where equations and diagrams matter as much as the prose.

CustomVlm.cs

// Custom vision model with maxed-out detail for high-fidelity scientific papers.
var vlm       = VisionLanguageModel.LoadFromModelID("paddleocr-vl:0.9b");
var converter = new DocumentToMarkdown(vlm);

var options = new DocumentToMarkdownOptions
{
    Strategy                   = DocumentToMarkdownStrategy.VlmOcr,
    VlmImageDetail             = ImageDetail.Maximal,
    VlmMaximumCompletionTokens = -1,           // no cap
    VlmStripImageMarkup        = false,        // keep figure references
};

var r = await converter.ConvertAsync(@"C:\papers\nature-paper.pdf", options);

Versus the alternatives

Markdown converters are not equal.

CLI converters

Format-by-format. No OCR for scanned PDFs. No vision model for image-only pages. Reading order broken on multi-column. No certainty signal. No per-page diagnostics.

Cloud document services

Strong on accuracy but per-page billing and data egress. No local pipeline path. No way to keep the conversion on a developer laptop or in an air-gapped environment.

DocumentToMarkdown

First-class .NET engine, ten input formats, three strategies, nineteen knobs, calibrated certainty score, 30+ refinement stages, 100% local, SOTA accuracy on public benchmarks. Plugs straight into the rest of the LM-Kit stack with no glue.

Where DocumentToMarkdown ships

Real corpora, real volumes.

RAG ingestion

Universal entry point for vector databases. Markdown is the normalised intermediate before chunking and embedding.

Knowledge-base import

Convert legacy PDF / DOCX libraries into Markdown for static-site generators, wikis, doc portals.

LLM fine-tuning data

Build training corpora from heterogeneous internal documents. Front matter preserves provenance.

Document review

Compliance reviewers read clean Markdown, not raw PDFs. Certainty score routes ambiguous pages to a human queue.

Email archive ingestion

EML / MBOX archives convert to per-message Markdown with headers and bodies preserved. Drop into RAG over inboxes.

Scientific publishing

Papers with formulas, tables, and charts convert via VLM OCR. VlmImageDetail.Maximal for highest fidelity.

Demos and guides

Working references.

DemoDocument to Markdown GuideConvert documents to Markdown APIDocumentToMarkdownOptions APIDocumentToMarkdownResult APIDocumentToMarkdownStrategy ChangelogEngine release notes

Related capabilities

Markdown plus the rest of Document Intelligence.

Layout understanding

The deterministic foundation under the TextExtraction strategy. Paragraph detection, reading-order recovery, six output modes, layout-aware search.

Layout page

OCR

The OCR backend that powers the VlmOcr strategy. Native engine for classical OCR, vision-language for layout-aware.

OCR page

Image processing

Deskew, smart binarize, despeckle, autocrop. Preprocessing that makes both strategies more accurate.

Image processing

Document RAG

Markdown-converted documents drop straight into the RAG pipeline. No glue.

Document RAG

Document conversion

Markdown is the most useful target for LLMs. The full conversion catalogue covers DOCX, HTML, PDF, EML and more.

Conversion catalogue

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Format zoo. Markdown out.

Get Community Edition Download

Any document. One Markdown.

TextExtraction

Hybrid

VlmOcr

Six input overloads

Format-specific dispatch

Calibrated certainty

Cancellable progress

30+ refinement stages

Plugs into RAG

Text-layer parsing

Vision-language OCR

Per-page auto-select

Text quality signals

OCR confidence

Structural integrity

Content shape

VLM-specific

Refinement footprint

Strategy & pagination

Strategy

PageRange

Traditional OCR (TextExtraction path)

OcrEngine

OcrImageParallelism

Vision-language OCR (VlmOcr / Hybrid path)

VlmImageDetail

VlmMaximumCompletionTokens

VlmStripImageMarkup

VlmStripStyleAttributes

Output shaping

IncludePageSeparators

PageSeparatorFormat

EmitFrontMatter

NormalizeWhitespace

PreferMarkdownTablesForNonNested

DOCX-specific

IncludeTables

IncludeImages

IncludeHyperlinks

PreserveLineBreaks

IncludeEmptyParagraphs

Email-specific

EmlStripQuotes

Smart punctuation

Table-of-contents repair

Table fusion

Heading-demotion filters

Numbered-section promotion

Cell-content cleanup

CLI converters

Cloud document services

DocumentToMarkdown

RAG ingestion

Knowledge-base import

LLM fine-tuning data

Document review

Email archive ingestion

Scientific publishing

Layout understanding

OCR

Image processing

Document RAG

Document conversion

Document to Markdown

Document to Markdown walkthrough

Convert documents to Markdown

LMKit.Document

`Strategy`

`PageRange`

`OcrEngine`

`OcrImageParallelism`

`VlmImageDetail`

`VlmMaximumCompletionTokens`

`VlmStripImageMarkup`

`VlmStripStyleAttributes`

`IncludePageSeparators`

`PageSeparatorFormat`

`EmitFrontMatter`

`NormalizeWhitespace`

`PreferMarkdownTablesForNonNested`

`IncludeTables`

`IncludeImages`

`IncludeHyperlinks`

`PreserveLineBreaks`

`IncludeEmptyParagraphs`

`EmlStripQuotes`