Solutions · Document Intelligence · RAG

The retrieval engine under everything.

DocumentRag is the foundation under every document-aware capability in LM-Kit.NET: PDF chat, RAG chat, multi-format ingestion, document Q&A. It pairs an adaptive page-by-page processing pipeline with advanced query strategies, vision-grounded retrieval, layout-aware chunking, pluggable vector stores, and full source attribution. Use the turn-key wrappers when defaults are right; reach for DocumentRag directly when production needs more.

3 processing modes 5 query strategies 4 vector store backends
Stage 01

Document import

PDF, DOCX, PPTX, XLSX, HTML, EML, MBOX, images, raw text. Per-document metadata.

Stage 02 · Adaptive

Per-page processing

Auto-selects text extraction, vision OCR, or document understanding per page.

Stage 03

Layout-aware chunking

Chunks aligned to paragraphs, headings, table boundaries. Not arbitrary token windows.

Stage 04

Strategic retrieval

Multi-query expansion, hypothetical answer, contextual recall, MMR, reranking.

Why DocumentRag

RAG that knows about documents.

Generic RAG treats every document as a flat string, chunks it by token count, embeds the chunks, and hopes for the best. Real corpora punish that approach: scanned pages have no text layer, financial reports hide values in tables, contracts have hierarchical structure, scientific papers use multi-column layouts. DocumentRag is built for those documents specifically: page-level adaptive processing, layout-aware segmentation, vision grounding when the layout demands it, and source attribution down to page and passage.

Format-aware ingestion

PDF, DOCX, PPTX, XLSX, HTML, EML, MBOX, images. Each format routes through a converter tuned for it. The pipeline does not assume PDF.

Per-page strategy

A 200-page mixed PDF gets text extraction on the digital pages, vision OCR on the scanned ones, and document understanding on the layout-heavy ones. One ingestion call, three strategies.

Layout-aware chunking

Chunk boundaries align to paragraphs, headings, list items, and table edges, not to arbitrary token positions. Retrieval matches semantic units, not random spans.

Strategic retrieval

Multi-query expansion, hypothetical-answer retrieval, contextual recall, MMR diversification, reranking. Pick a strategy per workload; mix and match in custom pipelines.

Vision-grounded queries

Inject the original page image into the model's context for the matched chunks. The LLM reads both the text and the visual layout, critical for charts and complex tables.

Source attribution by design

Every result carries the document name, page number, source URI, similarity score, and the constituent passages. Compliance, audit trails, and citations come free.

Three ways in

Pick the right abstraction.

The retrieval stack ships at three abstraction levels. Use the highest one whose defaults fit your workload, drop down only when you need control.

Highest level

PdfChat & RagChat

Turn-key conversational Q&A. Multi-turn memory, automatic strategy selection, source attribution preserved. Best when you want to ship a chat surface in an afternoon.

Lowest level

RagEngine

Generic text-based retrieval over arbitrary content (not document-shaped). Useful when the input is already pre-processed text and you want raw indexing and search.

Processing modes

Adaptive page-by-page processing.

Configure how each page is analysed. Auto routes intelligently per page based on content; the explicit modes pin the strategy when you want predictable behaviour. The enum is PageProcessingMode.

Mode

TextExtraction

Extracts text directly from the document structure. Uses the configured OCR engine for pages that require it (scanned images, image-based PDFs). Fastest path; lowest resource usage; ideal for clean digital documents.

PageProcessingMode.TextExtraction

Mode

DocumentUnderstanding

Uses a vision parser (VLM) to analyse page images visually. Produces Markdown-formatted output that preserves document structure including tables, headings, and layout. Best for complex layouts, forms, and mixed content.

PageProcessingMode.DocumentUnderstanding

Query strategies

Five ways to find the right passage.

Naive cosine similarity over a single embedded query is rarely the best retrieval strategy. DocumentRag ships with several advanced strategies, switchable per call. Pick the one that matches the question shape; combine for hybrid pipelines.

Strategy

Direct similarity

Embed the query, search the store, return the top-k passages. Fastest; works well when the question vocabulary closely mirrors the corpus.

Strategy

Multi-query expansion

The model rewrites the query into several semantically equivalent variants. Each variant is searched independently; results are merged and de-duplicated. Recovers passages that direct similarity misses due to vocabulary drift.

Strategy

Hypothetical answer

The model generates a plausible answer to the query, then searches for passages similar to that answer. Particularly strong on questions where the answer vocabulary is far from the question vocabulary.

Strategy

Contextual

Expand each query with conversation history before searching. The retrieval respects multi-turn context: a follow-up question retrieves passages relevant to the preceding turn, not just the literal text.

Strategy

MMR diversification

Maximal Marginal Relevance balances similarity with diversity. Prevents top-k from collapsing to near-duplicates; surfaces multiple distinct sources when the corpus has redundancy.

Refinement

Reranking

Run a second pass on the top-k candidates with a more expensive scorer. Significantly improves precision on the final ordering; layered on top of any base strategy.

Vision-grounded retrieval

When the layout is the answer.

Some questions cannot be answered from text alone. The chart on page 14 that shows quarterly revenue. The table on page 22 with three column spans. The diagram in a technical manual. DocumentRag can inject the original page renderings alongside the matched text passages so the language model reads both.

Charts

"What was the revenue trend across Q1 to Q4?" pulls the bar-chart page into context. The model reads the chart, not a paraphrase of it.

Complex tables

Multi-row headers, merged cells, footnoted values. The visual layout carries information that flat text extraction loses.

Diagrams & schematics

Technical documents, architecture diagrams, flowcharts. The model gets the picture, not just the labels.

Mixed content pages

Pages where text and image must be read together. Magazine layouts, financial dashboards, product brochures.

Chunking & attribution

Two foundations most stacks skip.

Layout-aware

Chunking that respects structure

Chunks align to paragraphs, headings, list items, and table edges. The deterministic layout-understanding pipeline drives boundaries; MaxChunkSize caps length without splitting structural units mid-sentence. Markdown chunking for vision output; text chunking for text-extraction output.

Attributed

Source attribution by construction

Every retrieved partition carries a DocumentReference: document name, page number, source URI, similarity score, custom metadata. The query API returns both the response and the references that fed it. Compliance, audit, and citation requirements covered without retrofitting.

Vector storage

Pluggable store architecture.

Swap backends without rewriting code. The same DocumentRag pipeline runs across every store type.

From prototype to production

The IVectorStore interface decouples retrieval logic from storage implementation. Prototype with in-memory storage, iterate with file-based persistence, deploy with the built-in high-performance store, scale out to a third-party connector, or write your own backend for a proprietary system. Application code stays the same.

  • Zero setup for local development
  • Durable, portable file-based storage
  • Built-in high-performance index for production
  • Connectors for external vector databases
  • Custom backends via IVectorStore
  • Incremental updates without index rebuild
In-memory

Fast prototyping, live testing, instant feedback

No setup
Built-in vector DB

File-based, handles millions of vectors, no server needed

Recommended
External connector

Distributed scaling for hundreds of millions of vectors

Production
Custom backend

Implement IVectorStore for proprietary systems

Flexible
Lifecycle & events

Observable at every phase.

Document ingestion is not an opaque call. DocumentRag emits progress events for every phase the pipeline goes through, with enough metadata to drive UIs, dashboards, and telemetry without polling.

Phase

PageProcessingStarted

Fires before each page enters the processing pipeline. Carries page index, total page count, planned strategy, source name. Use it to render a per-page progress bar.

Phase

PageProcessingCompleted

Fires after each page completes. Carries elapsed time, strategy actually used (Auto resolves to TextExtraction or DocumentUnderstanding here), generated token count, optional warnings.

Phase

EmbeddingStarted

Fires when the chunk-and-embed phase begins for the document. Useful for switching the UI from "processing" to "indexing" status.

Phase

EmbeddingCompleted

Fires when all chunks have been embedded and persisted. Indicates the document is queryable. Carries final partition count.

Lifecycle

Import / delete by ID

Documents carry an explicit ID for lifecycle management. DeleteDocumentAsync removes a document and all its partitions. ImportDocumentAsync on the same ID re-indexes.

Lifecycle

Cancellation tokens

Every async method accepts a CancellationToken. Long-running ingestions cancel cleanly at the next page boundary, releasing resources without leaving partial state behind.

Real pipelines

Three working examples.

Ingest a PDF into a file-based vector store, then query with source attribution including page numbers.

BasicIngestAndQuery.cs
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Data.Storage;
using LMKit.TextGeneration;

// Models. The chat model can be smaller than the embedding model and vice versa.
var chatModel  = LM.LoadFromModelID("qwen3.5:4b");
var embedModel = LM.LoadFromModelID("embeddinggemma-300m");

// File-based vector store. Persists across runs, no server.
var store = new FileSystemVectorStore("./embeddings");

var rag = new DocumentRag(embedModel, store)
{
    ProcessingMode = PageProcessingMode.Auto,
    MaxChunkSize   = 512,
};

// Import with an explicit ID for lifecycle management.
await rag.ImportDocumentAsync(
    Attachment.FromFile("financial-report.pdf"),
    new("fin-q4-2024") { Name = "Q4 financial report" },
    dataSource: "finance");

// Query with source attribution.
var chat   = new SingleTurnConversation(chatModel);
var result = await rag.QueryPartitionsAsync("What was Q4 revenue?", chat);

Console.WriteLine(result.Response.Completion);
foreach (var r in result.SourceReferences)
{
    Console.WriteLine($"  [{r.SimilarityScore:P0}] {r.Name}, page {r.PageNumber}");
}
Where DocumentRag ships

Real workloads.

Enterprise document search

Index a corpus of contracts, reports, and policies. Answer freeform questions with citations to document name and page number. Source attribution is the audit trail.

Compliance & audit Q&A

Lifecycle by ID: documents are added, updated, deleted as the policy archive evolves. Every answer cites its sources; deleted documents drop from results immediately.

Financial & scientific reports

Vision-grounded retrieval reads the charts and tables that text-only RAG would miss. Auto mode handles the mixed-format pages without configuration.

Customer support knowledge base

Ingest manuals, troubleshooting guides, release notes. Multi-query expansion recovers passages when the user vocabulary differs from the documentation.

Email archive recall

Index full inboxes (EML, MBOX) into the same RAG pipeline as document corpora. Retrieve threads with date and sender attribution per result.

Custom RAG architectures

Drop DocumentRag into an existing retrieval stack. Configure custom vector stores, embedding models, and query strategies; the rest of your architecture stays.

Versus the alternatives

Three familiar shapes.

Cloud RAG services

Per-query billing. Documents leave the perimeter. Limited control over chunking, ranking, and processing strategy. Outages become your outages. Cost grows with scale.

Hand-rolled local RAG

Months of plumbing: PDF parsing, OCR, chunking, embedding, indexing, retrieval, reranking, attribution. Each layer is a project. Each layer is a maintenance burden.

DocumentRag

Adaptive processing, layout-aware chunking, advanced query strategies, vision-grounded retrieval, lifecycle by ID, source attribution, pluggable storage. One class, full control, 100% local. The foundation under PDF chat, RAG chat, and the rest of LM-Kit's document stack.

Related capabilities

RAG plus the foundations.

Layout understanding

The deterministic layout pipeline that drives chunk boundaries. Reading order, paragraphs, headings, tables, all reconstructed before chunking.

Layout understanding

Document to Markdown

Convert mixed-format documents to LLM-ready Markdown before indexing. Universal entry point for any RAG pipeline.

Markdown converter

OCR

The OCR backend that powers Auto-mode and DocumentUnderstanding-mode pages. Native engine plus vision-language OCR.

OCR page

Document chat & Q&A

The turn-key conversational wrapper on top of DocumentRag. PdfChat handles session, cache, and retrieval policy.

Chat with PDF

Built-in vector database

The default high-performance store. File-based, no server, millions of vectors per index, incremental updates.

Vector database

Email processing

Index full inboxes (EML, MBOX) into the same RAG pipeline. Source attribution per email.

Email processing

The retrieval engine. Local. Yours.

Get Community Edition Download