Solutions · Document Intelligence · Document Search

Layout-aware document search, built into the SDK.

A foundational engine that turns any page (parsed PDF, OCR'd scan, layout tree from a VLM) into a searchable coordinate-aware index. Exact, regex, and fuzzy matching. Region, proximity, and between-anchor queries. Single or multi-page. Designed to be extremely fast and to slot into any workflow or agent tool. Always on-device.

5-minute quickstart API reference

3 search modes 6 operators Bounding-box accurate

Engine

`LayoutSearchEngine`

The single entry point. One new, all modes.

Result

`TextMatch`

Text, snippet, score, page, bounding box. Ready for redaction or highlighting.

Input

`PageElement`

The same layout tree produced by PDF parsing, OCR, and VLM layout analysis.

Companion

`SearchHighlightEngine`

Render matches as annotated images for review UIs and audit trails.

Why a dedicated search engine

Find isn’t enough. Locate is.

string.Contains tells you a match exists. A real document workflow needs to know where: which page, which paragraph, what bounding box, with what confidence, in what surrounding context. Document Search answers those questions in microseconds, across millions of pages, without leaving the process.

01 · Layout-aware

Coordinates, not character offsets

Every match returns the page index plus the bounding box of the matched text. Drop the box straight into a redaction step, a viewer highlight, or an audit annotation.

02 · Multi-modal input

PDF, OCR, VLM, all one tree

The engine reads PageElement trees. They come from native PDF parsing, traditional OCR, vision-language model OCR, layout extraction, or any custom source. One search API for all input modalities.

03 · Engineered for speed

Microseconds per page

Tokenisation, normalisation, and matching are tuned to run inline inside agent tools and chat loops. No background indexer to manage, no I/O round-trip, no warm-up.

04 · Multi-page native

Single call across the whole document

Every method has a single-page and an IEnumerable<PageElement> overload. Search across thousands of pages, get back ordered matches with cross-page locations.

05 · Snippet + score

Ready to render

Every TextMatch carries the matched text, a configurable context window snippet, and a normalised relevance score. Sort, filter, surface to the user, with no extra plumbing.

06 · 100% local

Documents never leave the process

No cloud search index, no per-call cost, no quota. The engine works inside an in-process method call. Air-gapped, regulated, and offline-first workloads are first-class.

Three search modes

Exact, regex, or fuzzy.

The right mode depends on the question. Same engine, same result type, same coordinate guarantees. Pick a tab for the signature you need.

FindText performs substring matching with optional case-insensitivity and whole-word boundaries. The default is OrdinalIgnoreCase; flip to Ordinal for case-sensitive matching. Use it for invoice numbers, contract keywords, SKUs.

ExactText.cs

using LMKit.Document.Search;

var engine = new LayoutSearchEngine();

// Default: case-insensitive substring search.
List<TextMatch> hits = engine.FindText(page, "Invoice");

// Whole-word, case-sensitive search.
hits = engine.FindText(page, "NDA", new TextSearchOptions
{
    Comparison   = StringComparison.Ordinal,
    WholeWord    = true,
    MaxResults   = 50,
    ContextChars = 60,
});

foreach (var m in hits)
{
    Console.WriteLine($"page {m.PageIndex}  {m.Bounds}  {m.Snippet}");
}

FindRegex accepts any .NET regex. Useful for structured patterns: dates, monetary amounts, identifiers, case numbers, IBANs. The engine resolves every match against the layout tree so even multi-character regex matches return a single contiguous bounding box.

RegexPatterns.cs

// Find every monetary amount on every page in the document.
var money = engine.FindRegex(
    pages,
    @"\$\s?\d{1,3}(?:,\d{3})*(?:\.\d{2})?");

// Find ISO dates (YYYY-MM-DD) anywhere in the contract.
var dates = engine.FindRegex(
    contract,
    @"\b\d{4}-\d{2}-\d{2}\b",
    new RegexSearchOptions { MaxResults = 1000 });

FindFuzzy tolerates OCR errors, typos, and layout-induced whitespace drift. Set MinScore to control strictness and MaxEditDistance to bound the search. Returns a normalised similarity score with every match so you can sort and threshold.

FuzzyMatching.cs

// "Pharmaceuticals" with up to 2 OCR-style typos.
var hits = engine.FindFuzzy(page, "Pharmaceuticals", new FuzzySearchOptions
{
    MaxEditDistance = 2,
    MinScore        = 0.80,
    TokenAware      = true,
    MaxResults      = 100,
});

foreach (var m in hits.OrderByDescending(m => m.Score))
{
    Console.WriteLine($"score {m.Score:F2}  text \"{m.Text}\"");
}

Advanced operators

Region, proximity, and between.

Beyond the three base modes, the engine exposes three structural operators that turn search into a layout query language. Pick a tab.

FindInRegion returns every text element inside a pixel-coordinate rectangle. Useful for invoice line items, form fields with known layout, table-cell extraction, and bounding-box-driven redaction.

InRegion.cs

// Pull every text element from the top-right corner of every page
// (where the page number lives on this corpus).
var pageNumberBox = new Rectangle(x: 540, y: 20, width: 60, height: 30);

var hits = engine.FindInRegion(pages, pageNumberBox, new RegionSearchOptions
{
    MaxResults = 2000,
});

FindNear finds matches within a proximity radius of an anchor query. The radius is expressed as a fraction of page size, so the same call works across A4, US Letter, and scanned variants without rewriting coordinates.

NearAnchor.cs

// Find every dollar amount within 5% of the page diagonal from the
// word "Total" - across every page in the document.
var hits = engine.FindNear(pages, "Total", new ProximityOptions
{
    Radius = 0.05,             // 5% of page size
    MatchPattern = @"\$[\d,\.]+",
    MaxResults = 200,
});

FindBetween returns the text that sits between two anchors. Perfect for extracting the body of a section, the contents of a labeled box, or any "from this heading to the next" pattern, with layout preserved.

BetweenAnchors.cs

// Pull the body of the "Indemnification" clause out of a contract.
var clause = engine.FindBetween(
    contract,
    startQuery: "Indemnification",
    endQuery:   "Termination",
    new BetweenOptions { IncludeAnchors = false });

Console.WriteLine(clause[0].Text);

Annotated output

Render matches as evidence.

Every match carries pixel-precise coordinates. The companion SearchHighlightEngine turns a list of matches into a rendered annotated image, ready for a review UI, an audit trail, or a customer-facing explanation of where an answer came from.

HighlightMatches.cs

using LMKit.Document.Search;

// Find PII in a scanned page, render highlights, save the annotated image.
var hits = engine.FindRegex(page, @"\b\d{3}-\d{2}-\d{4}\b"); // SSN-like

var result = await SearchHighlightEngine.HighlightAsync(
    page,
    hits,
    new SearchHighlightOptions
    {
        Appearance = new HighlightAppearance
        {
            FillColor   = Color.FromArgb(80, Color.Yellow),
            StrokeColor = Color.OrangeRed,
            StrokeWidth = 2,
        },
    });

await result.SaveAsync(@"D:\out\redacted-preview.png");

When you need to go further

Lexical search is one layer. Semantic search is the next.

Exact, regex, and fuzzy answer the question "where does this string appear?". They do not answer "where does this concept appear?". For paraphrases, synonyms, multilingual queries, and intent-driven retrieval, the same SDK exposes an embedding + vector-search layer that composes directly with everything above. Document Search is where most workflows start; RAG & Knowledge is where they grow.

Layer 1

Embeddings

Turn any text passage (or image) into a high-dimensional vector with the Embedder class. Text and image vectors share a space via Nomic Embed Vision so cross-modal queries work out of the box.

Embeddings

Layer 2

Vector search

Store vectors in the built-in FileSystemVectorStore, an in-memory store, or a Qdrant or pgvector connector via the IVectorStore contract. Cosine, dot, and Euclidean similarity are first-class.

Vector database

Layer 3

Hybrid retrieval

HybridRetrievalStrategy fuses dense embeddings with lexical BM25 (Bm25RetrievalStrategy) and reranks via RagReranker. Recall from BM25, precision from vectors, ranking from a dedicated reranker model.

RAG & Knowledge

Layer 4

Query expansion

When the user's wording is far from the corpus, HydeRetriever (Hypothetical Document Embeddings) and MultiQueryRetriever generate alternative queries with the LLM, retrieve for each, and merge results.

Retrieval strategies

Layer 5

Document RAG

DocumentRag wraps chunking (TextChunking, HtmlChunking, MarkdownChunking), embedding, retrieval, and source attribution into a single high-level engine. Hand it a folder, get cited answers.

Document RAG

Layer 6

Chat with PDF

PdfChat and RagChat compose every layer above into a multi-turn conversation primitive. Streaming answers, multi-turn memory, source-attributed citations, all in one class.

Chat with PDF

Mix freely. Layout-aware search and semantic retrieval are not alternatives, they are complementary. A common pattern: use LayoutSearchEngine to locate anchors with bounding boxes, then expand to DocumentRag to retrieve conceptually similar passages, then ground the answer with both coordinate and citation provenance. Every layer above can run inside the same process, on-device, with no external service.

How it connects

Search is a building block.

The same engine is wired across the SDK. Wherever PageElement shows up, search shows up next to it.

OCR & VLM-OCR

Search what OCR produced

Both LMKitOcr and VlmOcr return PageElement trees. Feed them directly into LayoutSearchEngine; coordinates from OCR become coordinates in the match.

OCR

Layout Understanding

Search by structure

Layout analysis tags paragraphs, headings, tables, footnotes. Combine with region or between-anchor search to query by structural intent ("find the price inside the first table").

Layout

Document RAG

Ground answers in coordinates

RAG retrieves a passage. LayoutSearchEngine locates that passage in the source page with a bounding box. The result: every answer can show you exactly where it came from.

RAG

Agents

Agent-callable search

Search runs in microseconds and is safe to expose as an agent tool. An agent can locate clauses, scan for PII, find numbers near labels, all as in-process tool calls with no external service.

Tools

PII & redaction

Locate, then redact

Regex and fuzzy modes find sensitive content; bounding boxes drive the redaction step. SearchHighlightEngine can render before-and-after evidence for compliance review.

PII

Extraction

Anchor your extractor

Find an anchor ("Invoice number"), search near it ("alphanumeric token within 5% of the page"), feed the result back into a schema-constrained extractor. Robust against template drift.

Extraction

Where it ships

Workloads that lean on layout-aware search.

Compliance & e-discovery

Scan thousands of pages for sensitive terms, get back every hit with page number and bounding box, produce an audit-ready PDF with highlighted evidence. Works on air-gapped legal review machines.

Invoice and form processing

Find labels by anchor query, find values by proximity. No fragile template needed; the same code handles invoices from a hundred vendors with different layouts.

Contract review

Pull the body of a clause with FindBetween, scan for prohibited language with FindRegex, render highlighted versions for legal review. Works offline on a workstation.

RAG explanation layer

When the RAG model cites a passage, search locates it on the original page. The viewer renders an annotated image so users can see the provenance.

Agent toolbox

Expose search as a tool. The agent can find clauses, count occurrences across pages, locate anchors, all in microseconds, all without sending document content over the network.

Document QA pipelines

Run thousands of pages through OCR + search + highlight to produce dashboards of detected entities, dates, and amounts. Per-page latency stays in the millisecond range.

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Search and highlight

Console demo: text / regex / fuzzy search on PDFs, with bounding-box highlights drawn on a rendered preview.

Open on GitHub → Sample

Search and highlight walkthrough

Step-by-step doc: prerequisites, setup, query types, expected output.

Read on docs → How-to guide

Search and locate content within documents

End-to-end how-to: load a document, run text / regex / fuzzy queries, get coordinates.

Read the guide → How-to guide

Analyze document layout and search by structure

Combine layout extraction with structural queries (find inside a region, between anchors).

Read the guide → How-to guide

Locate text regions with VLM-OCR

OCR a scanned page into a PageElement tree, then run any search query on the result.

Read the guide → API reference

LayoutSearchEngine

API reference for the search engine (FindText, FindRegex, FindFuzzy, FindInRegion, FindNear, FindBetween).

Open the reference →

LM-Kit.NET pillars

Seven pillars, one foundation.

The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.

01 · AI Agents

Orchestration patterns

ReAct planning, supervisors, parallel and pipeline orchestrators, persistent memory, MCP clients, custom tools.

AI Agents

02 · Document Intelligence

Parse PDFs, images, EML

PDF text and table extraction, on-device OCR reaching SOTA benchmark scores, structured field extraction with grammar-constrained generation.

Document Intelligence

03 · Vision & Multimodal

VLMs, image classification, chat with image

Image understanding, classification, labeling, multimodal chat, image embeddings, VLM-OCR, background removal. Same conversation surface as LLMs.

Vision & Multimodal

04 · RAG & Knowledge

Vector search and retrieval

Built-in vector store, Qdrant and pgvector connectors, embeddings, hybrid retrieval, document chunking, source citations.

RAG & Knowledge

05 · Text Analysis

Classification, NER, PII, sentiment

Built-in classifiers and an extractor that emits typed C# objects via grammar-constrained sampling. Sentiment, keywords, language detection.

Text Analysis

06 · Speech & Audio

Audio transcription, STT

A growing local speech-to-text stack: hallucination suppression, Voice Activity Detection, real-time translation, streaming output, 100+ languages.

Speech & Audio

07 · Text Generation

Conversations, rewriting, summaries

Single-turn, multi-turn, and stateless conversation primitives. Translate, correct, rewrite, summarise. Prompt templates, streaming, grammar-constrained outputs.

Text Generation

The foundation

Every capability above runs on this runtime.

Foundation

Local Inference

The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.

Explore the foundation

Document search without an indexer.

Three modes, six operators, bounding-box coordinates, microsecond latency, zero cloud calls. Embed it in the agent tool, the chat loop, the redaction pipeline, the RAG explainer.

Start in 5 minutes Document Intelligence hub