PdfDocument
Parse, render, manipulate. File, stream, or byte-array input.
Most .NET teams pull in three or four libraries to handle PDFs:
one to parse, one to render, one to write, one to OCR. LM-Kit ships
a complete PDF toolkit in a single NuGet: PdfDocument
for read / write / render, PdfSearchableMaker for
OCR-stamped searchable PDFs, search-highlight engine for visual
locate, and 15+ built-in agent tools for headless automation.
PdfDocumentParse, render, manipulate. File, stream, or byte-array input.
PdfSearchableMakerStamp invisible OCR text layer onto scanned PDFs. Searchable, copyable, indexable.
SearchHighlightEngineLocate text and produce a marked-up PDF with visible highlights. Drives find-in-document UIs.
Every operation below is exposed both as a high-level .NET API and as a
built-in agent tool (pdf_*) so an agent can perform it
autonomously.
Read
Open PDFs from file, stream, or byte[]. Inspect metadata (title, author, permissions). Iterate pages.
Render
Render any page at any DPI. Drives thumbnails, page previews, and vision-model inputs.
Merge
pdf_merge tool plus direct API. Concatenate any number of PDFs into one. Preserves bookmarks and metadata.
Split
pdf_split for fixed-page splits. Pair with DocumentSplitter for semantic boundary detection (multi-document scans).
Search
pdf_search finds matches by phrase or regex. Returns page numbers and per-match positions.
Highlight
SearchHighlightEngine returns a marked-up PDF with visible highlights at every match. Drives in-app find-and-show UIs.
Searchable
PdfSearchableMaker runs OCR and embeds an invisible text layer. The result looks identical and is indexable / copyable.
Unlock
pdf_unlock opens password-protected PDFs given the password. Useful for legitimate access to protected archives.
Pages
Rotate, delete, flatten annotations, set orientation. Inspect page count and per-page metadata via pdf_pages.
Extract
pdf_extract pulls text and embedded images out of any PDF. Pair with EmbeddedImageOcr for OCR over images embedded in text-layer PDFs.
Metadata
pdf_metadata reads title, author, subject, keywords, creation date, encryption status, page count.
Build
ImageToPdf wraps one or many images into a PDF. Pair with ImageToSearchablePdf to add OCR text in one pass.
Archive
PdfGenerationOptions.Version = PdfA1b emits ISO 19005-1 archival PDFs. Supports PDF/A-1B, 2B, and 3B with full XMP metadata and an OCR text layer in the same pass.
TIFF
ImageToSearchablePdf.ConvertAsync ingests multipage TIFFs straight from scanners and fax servers and emits one searchable PDF/A. Per-page OCR runs in parallel.
Convert a folder of scanned PDFs into searchable PDFs with an invisible text layer rendered by OCR.
using LMKit.Document.Pdf; using LMKit.Extraction.Ocr; // Turn a folder of scanned PDFs into searchable PDFs. var ocr = new LMKitOcr(); // CPU-only, fast var maker = new PdfSearchableMaker(ocr); foreach (var path in Directory.EnumerateFiles(@"C:\scans", "*.pdf")) { await maker.MakeSearchableAsync(path, $@"C:\out\{Path.GetFileName(path)}"); } // Output PDFs look identical and are now full-text indexable.
Convert a multipage TIFF straight from a scanner or fax server into an ISO 19005-1 PDF/A-1B archive with a searchable OCR text layer in one call.
using LMKit.Document.Conversion; using LMKit.Document.Pdf; using LMKit.Extraction.Ocr; // On-device OCR + PDF/A-1B archival in one call. var ocr = new LMKitOcr(); var options = new PdfGenerationOptions { Version = PdfGenerationOptions.PdfVersion.PdfA1b, MaxDegreeOfParallelism = 4, EnableOrientationDetection = true, }; await ImageToSearchablePdf.ConvertAsync( @"C:\fax\inbox\case-2026-0142.tif", ocr, @"C:\archive\case-2026-0142.pdf", options); // Result: ISO 19005-1 (PDF/A-1B) compliant, OCR-searchable, audit-ready.
Run a search over an existing PDF and emit a new copy with every match visually highlighted for reviewers.
using LMKit.Document.Pdf; using LMKit.TextAnalysis; var engine = new SearchHighlightEngine(@"C:\contracts\msa.pdf"); // Find every mention and produce a highlighted PDF for the reviewer. SearchHighlightResult r = await engine.HighlightAsync( query: "indemnification", output: @"C:\out\msa-highlighted.pdf"); Console.WriteLine($"Found {r.Matches.Count} matches across {r.Pages.Count} pages");
Merge several PDFs into a single bundle and OCR the embedded images on every page in place.
using LMKit.Document.Pdf; // Merge a stack of one-pagers into a single book. var book = PdfDocument.Merge( @"C:\out\book.pdf", @"C:\pages\01.pdf", @"C:\pages\02.pdf", @"C:\pages\03.pdf"); // Extract every embedded image, OCR the ones that need it. using var doc = new PdfDocument(@"C:\reports\annual.pdf"); var ocr = new EmbeddedImageOcr(new LMKitOcr()); foreach (var page in doc.Pages) { await ocr.RunAsync(page); // updates page text in-place }
The same toolkit is registered as built-in agent tools so an LLM can drive
it. Available tools include pdf_extract, pdf_merge,
pdf_split, pdf_search,
pdf_search_highlight, pdf_to_image,
pdf_unlock, pdf_metadata, pdf_pages,
image_to_pdf, eml_to_pdf, plus the conversion
family (markdown_to_pdf, markdown_to_docx,
markdown_to_html) and OCR (ocr_recognize).
Register them on any agent and let it run document workflows end-to-end.
Searchable-PDF generation runs on top of LMKitOcr or VlmOcr. Pick the engine to match accuracy / speed needs.
Markdown to PDF, HTML to Markdown, image to PDF, and the full conversion catalogue.
When a single PDF holds multiple logical documents, semantic splitting separates them by content boundary.
All pdf_* tools registered out of the box. Compose with ToolPermissionPolicy for safe agent execution.
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Title, author, permissions, page sizes, encryption status via PdfInfo.
Open on GitHub → SampleStep-by-step doc page: prerequisites, setup, code path, expected output.
Read on docs → DemoCombine an ordered list of PDFs into a single output with PdfMerger.MergeFiles.
Open on GitHub → SampleStep-by-step doc page: prerequisites, setup, code path, expected output.
Read on docs → DemoSlice one PDF into N smaller PDFs along caller-defined ranges with PdfSplitter.SplitToFiles.
Open on GitHub → SampleStep-by-step doc page: prerequisites, setup, code path, expected output.
Read on docs → DemoRender each page to PNG / JPEG / WebP / BMP / TIFF / TGA / PNM via PdfRenderer.RenderPagesToFolder.
Open on GitHub → SampleStep-by-step doc page: prerequisites, setup, code path, expected output.
Read on docs → DemoPack every selected page into one multi-page TIFF via PdfRenderer.RenderPagesToMultipageTiff.
Open on GitHub → SampleStep-by-step doc page: prerequisites, setup, code path, expected output.
Read on docs → DemoAuto-fix sideways scans or apply a uniform rotation with PdfEditor.ApplyToFile + PageEdit.
Open on GitHub → SampleStep-by-step doc page: prerequisites, setup, code path, expected output.
Read on docs → DemoLayout-aware keyword search returning page index, snippet, and bounding-box rectangles.
Open on GitHub → SampleStep-by-step doc page: prerequisites, setup, code path, expected output.
Read on docs → DemoTake an image-only scanned PDF and produce a searchable PDF via PdfSearchableMaker.ConvertToFile.
Open on GitHub → SampleStep-by-step doc page: prerequisites, setup, code path, expected output.
Read on docs → DemoConvert scanned multipage TIFFs into searchable PDF/A-1B (ISO 19005-1) archives via ImageToSearchablePdf.ConvertAsync + PdfGenerationOptions.Version = PdfA1b.
Open on GitHub → SampleStep-by-step doc page: prerequisites, setup, code path, expected output.
Read on docs → DemoInspect, render, search, split, and edit password-protected PDFs end to end with the password flowing through every PDF class.
Open on GitHub → SampleStep-by-step doc page: prerequisites, setup, code path, expected output.
Read on docs → How-to guidePdfToImage vs PdfRenderer, full format matrix, async + cancellation + progress, encrypted PDFs.
Read the guide → How-to guidePDF + DOCX + HTML + EML through one pipeline.
Read the guide → API referenceAPI reference for the PDF toolkit namespace.
Open the reference →