PdfDocument
Parse, render, manipulate. File, stream, or byte-array input.
Most .NET teams pull in three or four libraries to handle PDFs:
one to parse, one to render, one to write, one to OCR. LM-Kit ships
a complete PDF toolkit in a single NuGet: PdfDocument
for read / write / render, PdfSearchableMaker for
OCR-stamped searchable PDFs, search-highlight engine for visual
locate, and 15+ built-in agent tools for headless automation.
PdfDocumentParse, render, manipulate. File, stream, or byte-array input.
PdfSearchableMakerStamp invisible OCR text layer onto scanned PDFs. Searchable, copyable, indexable.
SearchHighlightEngineLocate text and produce a marked-up PDF with visible highlights. Drives find-in-document UIs.
Every operation below is exposed both as a high-level .NET API and as a
built-in agent tool (pdf_*) so an agent can perform it
autonomously.
Read
Open PDFs from file, stream, or byte[]. Inspect metadata (title, author, permissions). Iterate pages.
Render
Render any page at any DPI. Drives thumbnails, page previews, and vision-model inputs.
Merge
pdf_merge tool plus direct API. Concatenate any number of PDFs into one. Preserves bookmarks and metadata.
Split
pdf_split for fixed-page splits. Pair with DocumentSplitter for semantic boundary detection (multi-document scans).
Search
pdf_search finds matches by phrase or regex. Returns page numbers and per-match positions.
Highlight
SearchHighlightEngine returns a marked-up PDF with visible highlights at every match. Drives in-app find-and-show UIs.
Searchable
PdfSearchableMaker runs OCR and embeds an invisible text layer. The result looks identical and is indexable / copyable.
Unlock
pdf_unlock opens password-protected PDFs given the password. Useful for legitimate access to protected archives.
Pages
Rotate, delete, flatten annotations, set orientation. Inspect page count and per-page metadata via pdf_pages.
Extract
pdf_extract pulls text and embedded images out of any PDF. Pair with EmbeddedImageOcr for OCR over images embedded in text-layer PDFs.
Metadata
pdf_metadata reads title, author, subject, keywords, creation date, encryption status, page count.
Build
ImageToPdf wraps one or many images into a PDF. Pair with ImageToSearchablePdf to add OCR text in one pass.
Convert a folder of scanned PDFs into searchable PDFs with an invisible text layer rendered by OCR.
using LMKit.Document.Pdf; using LMKit.Extraction.Ocr; // Turn a folder of scanned PDFs into searchable PDFs. var ocr = new LMKitOcr(); // CPU-only, fast var maker = new PdfSearchableMaker(ocr); foreach (var path in Directory.EnumerateFiles(@"C:\scans", "*.pdf")) { await maker.MakeSearchableAsync(path, $@"C:\out\{Path.GetFileName(path)}"); } // Output PDFs look identical and are now full-text indexable.
Run a search over an existing PDF and emit a new copy with every match visually highlighted for reviewers.
using LMKit.Document.Pdf; using LMKit.TextAnalysis; var engine = new SearchHighlightEngine(@"C:\contracts\msa.pdf"); // Find every mention and produce a highlighted PDF for the reviewer. SearchHighlightResult r = await engine.HighlightAsync( query: "indemnification", output: @"C:\out\msa-highlighted.pdf"); Console.WriteLine($"Found {r.Matches.Count} matches across {r.Pages.Count} pages");
Merge several PDFs into a single bundle and OCR the embedded images on every page in place.
using LMKit.Document.Pdf; // Merge a stack of one-pagers into a single book. var book = PdfDocument.Merge( @"C:\out\book.pdf", @"C:\pages\01.pdf", @"C:\pages\02.pdf", @"C:\pages\03.pdf"); // Extract every embedded image, OCR the ones that need it. using var doc = new PdfDocument(@"C:\reports\annual.pdf"); var ocr = new EmbeddedImageOcr(new LMKitOcr()); foreach (var page in doc.Pages) { await ocr.RunAsync(page); // updates page text in-place }
The same toolkit is registered as built-in agent tools so an LLM can drive
it. Available tools include pdf_extract, pdf_merge,
pdf_split, pdf_search,
pdf_search_highlight, pdf_to_image,
pdf_unlock, pdf_metadata, pdf_pages,
image_to_pdf, eml_to_pdf, plus the conversion
family (markdown_to_pdf, markdown_to_docx,
markdown_to_html) and OCR (ocr_recognize).
Register them on any agent and let it run document workflows end-to-end.
Searchable-PDF generation runs on top of LMKitOcr or VlmOcr. Pick the engine to match accuracy / speed needs.
Markdown to PDF, HTML to Markdown, image to PDF, and the full conversion catalogue.
When a single PDF holds multiple logical documents, semantic splitting separates them by content boundary.
All pdf_* tools registered out of the box. Compose with ToolPermissionPolicy for safe agent execution.
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Console demo: full-document processing pipeline with built-in tools.
Open on GitHub → How-to guidePDF + DOCX + HTML + EML through one pipeline.
Read the guide → API referenceAPI reference for the PDF toolkit namespace.
Open the reference →