Solutions · Document Intelligence · Document Search

Layout-aware document search, built into the SDK.

A foundational engine that turns any page (parsed PDF, OCR'd scan, layout tree from a VLM) into a searchable coordinate-aware index. Exact, regex, and fuzzy matching. Region, proximity, and between-anchor queries. Single or multi-page. Designed to be extremely fast and to slot into any workflow or agent tool. Always on-device.

3 search modes 6 operators Bounding-box accurate
Engine

LayoutSearchEngine

The single entry point. One new, all modes.

Result

TextMatch

Text, snippet, score, page, bounding box. Ready for redaction or highlighting.

Input

PageElement

The same layout tree produced by PDF parsing, OCR, and VLM layout analysis.

Companion

SearchHighlightEngine

Render matches as annotated images for review UIs and audit trails.

Why a dedicated search engine

Find isn’t enough. Locate is.

string.Contains tells you a match exists. A real document workflow needs to know where: which page, which paragraph, what bounding box, with what confidence, in what surrounding context. Document Search answers those questions in microseconds, across millions of pages, without leaving the process.

01 · Layout-aware

Coordinates, not character offsets

Every match returns the page index plus the bounding box of the matched text. Drop the box straight into a redaction step, a viewer highlight, or an audit annotation.

02 · Multi-modal input

PDF, OCR, VLM, all one tree

The engine reads PageElement trees. They come from native PDF parsing, traditional OCR, vision-language model OCR, layout extraction, or any custom source. One search API for all input modalities.

03 · Engineered for speed

Microseconds per page

Tokenisation, normalisation, and matching are tuned to run inline inside agent tools and chat loops. No background indexer to manage, no I/O round-trip, no warm-up.

04 · Multi-page native

Single call across the whole document

Every method has a single-page and an IEnumerable<PageElement> overload. Search across thousands of pages, get back ordered matches with cross-page locations.

05 · Snippet + score

Ready to render

Every TextMatch carries the matched text, a configurable context window snippet, and a normalised relevance score. Sort, filter, surface to the user, with no extra plumbing.

06 · 100% local

Documents never leave the process

No cloud search index, no per-call cost, no quota. The engine works inside an in-process method call. Air-gapped, regulated, and offline-first workloads are first-class.

Three search modes

Exact, regex, or fuzzy.

The right mode depends on the question. Same engine, same result type, same coordinate guarantees. Pick a tab for the signature you need.

FindText performs substring matching with optional case-insensitivity and whole-word boundaries. The default is OrdinalIgnoreCase; flip to Ordinal for case-sensitive matching. Use it for invoice numbers, contract keywords, SKUs.

ExactText.cs
using LMKit.Document.Search;

var engine = new LayoutSearchEngine();

// Default: case-insensitive substring search.
List<TextMatch> hits = engine.FindText(page, "Invoice");

// Whole-word, case-sensitive search.
hits = engine.FindText(page, "NDA", new TextSearchOptions
{
    Comparison   = StringComparison.Ordinal,
    WholeWord    = true,
    MaxResults   = 50,
    ContextChars = 60,
});

foreach (var m in hits)
{
    Console.WriteLine($"page {m.PageIndex}  {m.Bounds}  {m.Snippet}");
}
Advanced operators

Region, proximity, and between.

Beyond the three base modes, the engine exposes three structural operators that turn search into a layout query language. Pick a tab.

FindInRegion returns every text element inside a pixel-coordinate rectangle. Useful for invoice line items, form fields with known layout, table-cell extraction, and bounding-box-driven redaction.

InRegion.cs
// Pull every text element from the top-right corner of every page
// (where the page number lives on this corpus).
var pageNumberBox = new Rectangle(x: 540, y: 20, width: 60, height: 30);

var hits = engine.FindInRegion(pages, pageNumberBox, new RegionSearchOptions
{
    MaxResults = 2000,
});
Annotated output

Render matches as evidence.

Every match carries pixel-precise coordinates. The companion SearchHighlightEngine turns a list of matches into a rendered annotated image, ready for a review UI, an audit trail, or a customer-facing explanation of where an answer came from.

HighlightMatches.cs
using LMKit.Document.Search;

// Find PII in a scanned page, render highlights, save the annotated image.
var hits = engine.FindRegex(page, @"\b\d{3}-\d{2}-\d{4}\b"); // SSN-like

var result = await SearchHighlightEngine.HighlightAsync(
    page,
    hits,
    new SearchHighlightOptions
    {
        Appearance = new HighlightAppearance
        {
            FillColor   = Color.FromArgb(80, Color.Yellow),
            StrokeColor = Color.OrangeRed,
            StrokeWidth = 2,
        },
    });

await result.SaveAsync(@"D:\out\redacted-preview.png");
When you need to go further

Lexical search is one layer. Semantic search is the next.

Exact, regex, and fuzzy answer the question "where does this string appear?". They do not answer "where does this concept appear?". For paraphrases, synonyms, multilingual queries, and intent-driven retrieval, the same SDK exposes an embedding + vector-search layer that composes directly with everything above. Document Search is where most workflows start; RAG & Knowledge is where they grow.

Mix freely. Layout-aware search and semantic retrieval are not alternatives, they are complementary. A common pattern: use LayoutSearchEngine to locate anchors with bounding boxes, then expand to DocumentRag to retrieve conceptually similar passages, then ground the answer with both coordinate and citation provenance. Every layer above can run inside the same process, on-device, with no external service.

Where it ships

Workloads that lean on layout-aware search.

Compliance & e-discovery

Scan thousands of pages for sensitive terms, get back every hit with page number and bounding box, produce an audit-ready PDF with highlighted evidence. Works on air-gapped legal review machines.

Invoice and form processing

Find labels by anchor query, find values by proximity. No fragile template needed; the same code handles invoices from a hundred vendors with different layouts.

Contract review

Pull the body of a clause with FindBetween, scan for prohibited language with FindRegex, render highlighted versions for legal review. Works offline on a workstation.

RAG explanation layer

When the RAG model cites a passage, search locates it on the original page. The viewer renders an annotated image so users can see the provenance.

Agent toolbox

Expose search as a tool. The agent can find clauses, count occurrences across pages, locate anchors, all in microseconds, all without sending document content over the network.

Document QA pipelines

Run thousands of pages through OCR + search + highlight to produce dashboards of detected entities, dates, and amounts. Per-page latency stays in the millisecond range.

LM-Kit.NET pillars

Seven pillars, one foundation.

The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.

The foundation

Every capability above runs on this runtime.

Foundation

Local Inference

The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.

Explore the foundation

Document search without an indexer.

Three modes, six operators, bounding-box coordinates, microsecond latency, zero cloud calls. Embed it in the agent tool, the chat loop, the redaction pipeline, the RAG explainer.

Start in 5 minutes Document Intelligence hub