Document import
PDF, DOCX, PPTX, XLSX, HTML, EML, MBOX, images, raw text. Per-document metadata.
DocumentRag is the foundation under every document-aware
capability in LM-Kit.NET: PDF chat, RAG chat, multi-format ingestion,
document Q&A. It pairs an adaptive page-by-page
processing pipeline with advanced query strategies,
vision-grounded retrieval, layout-aware chunking,
pluggable vector stores, and full source attribution. Use the
turn-key wrappers when defaults are right; reach for
DocumentRag directly when production needs more.
PDF, DOCX, PPTX, XLSX, HTML, EML, MBOX, images, raw text. Per-document metadata.
Auto-selects text extraction, vision OCR, or document understanding per page.
Chunks aligned to paragraphs, headings, table boundaries. Not arbitrary token windows.
Multi-query expansion, hypothetical answer, contextual recall, MMR, reranking.
Generic RAG treats every document as a flat string, chunks it by token
count, embeds the chunks, and hopes for the best. Real corpora punish
that approach: scanned pages have no text layer, financial reports hide
values in tables, contracts have hierarchical structure, scientific
papers use multi-column layouts. DocumentRag is built for
those documents specifically: page-level adaptive processing,
layout-aware segmentation, vision grounding when the layout demands it,
and source attribution down to page and passage.
PDF, DOCX, PPTX, XLSX, HTML, EML, MBOX, images. Each format routes through a converter tuned for it. The pipeline does not assume PDF.
A 200-page mixed PDF gets text extraction on the digital pages, vision OCR on the scanned ones, and document understanding on the layout-heavy ones. One ingestion call, three strategies.
Chunk boundaries align to paragraphs, headings, list items, and table edges, not to arbitrary token positions. Retrieval matches semantic units, not random spans.
Multi-query expansion, hypothetical-answer retrieval, contextual recall, MMR diversification, reranking. Pick a strategy per workload; mix and match in custom pipelines.
Inject the original page image into the model's context for the matched chunks. The LLM reads both the text and the visual layout, critical for charts and complex tables.
Every result carries the document name, page number, source URI, similarity score, and the constituent passages. Compliance, audit trails, and citations come free.
The retrieval stack ships at three abstraction levels. Use the highest one whose defaults fit your workload, drop down only when you need control.
Highest level
PdfChat & RagChatTurn-key conversational Q&A. Multi-turn memory, automatic strategy selection, source attribution preserved. Best when you want to ship a chat surface in an afternoon.
Mid level
DocumentRagFull pipeline control. Configurable processing modes, vector stores, chunking, query strategies. Document lifecycle by ID. The page you are reading.
Lowest level
RagEngineGeneric text-based retrieval over arbitrary content (not document-shaped). Useful when the input is already pre-processed text and you want raw indexing and search.
Configure how each page is analysed. Auto routes intelligently
per page based on content; the explicit modes pin the strategy when
you want predictable behaviour. The enum is PageProcessingMode.
Default · Auto
Analyses each page and selects the optimal strategy. Uses vision parsing for image-heavy pages when a vision parser is configured; falls back to text extraction otherwise. Zero configuration; per-page optimisation; handles mixed document types.
PageProcessingMode.Auto
Mode
Extracts text directly from the document structure. Uses the configured OCR engine for pages that require it (scanned images, image-based PDFs). Fastest path; lowest resource usage; ideal for clean digital documents.
PageProcessingMode.TextExtraction
Mode
Uses a vision parser (VLM) to analyse page images visually. Produces Markdown-formatted output that preserves document structure including tables, headings, and layout. Best for complex layouts, forms, and mixed content.
PageProcessingMode.DocumentUnderstanding
Naive cosine similarity over a single embedded query is rarely the
best retrieval strategy. DocumentRag ships with several
advanced strategies, switchable per call. Pick the one that matches
the question shape; combine for hybrid pipelines.
Strategy
Embed the query, search the store, return the top-k passages. Fastest; works well when the question vocabulary closely mirrors the corpus.
Strategy
The model rewrites the query into several semantically equivalent variants. Each variant is searched independently; results are merged and de-duplicated. Recovers passages that direct similarity misses due to vocabulary drift.
Strategy
The model generates a plausible answer to the query, then searches for passages similar to that answer. Particularly strong on questions where the answer vocabulary is far from the question vocabulary.
Strategy
Expand each query with conversation history before searching. The retrieval respects multi-turn context: a follow-up question retrieves passages relevant to the preceding turn, not just the literal text.
Strategy
Maximal Marginal Relevance balances similarity with diversity. Prevents top-k from collapsing to near-duplicates; surfaces multiple distinct sources when the corpus has redundancy.
Refinement
Run a second pass on the top-k candidates with a more expensive scorer. Significantly improves precision on the final ordering; layered on top of any base strategy.
Some questions cannot be answered from text alone. The chart on page 14
that shows quarterly revenue. The table on page 22 with three column
spans. The diagram in a technical manual. DocumentRag can
inject the original page renderings alongside the matched text passages
so the language model reads both.
"What was the revenue trend across Q1 to Q4?" pulls the bar-chart page into context. The model reads the chart, not a paraphrase of it.
Multi-row headers, merged cells, footnoted values. The visual layout carries information that flat text extraction loses.
Technical documents, architecture diagrams, flowcharts. The model gets the picture, not just the labels.
Pages where text and image must be read together. Magazine layouts, financial dashboards, product brochures.
Layout-aware
Chunks align to paragraphs, headings, list items, and table edges. The deterministic layout-understanding pipeline drives boundaries; MaxChunkSize caps length without splitting structural units mid-sentence. Markdown chunking for vision output; text chunking for text-extraction output.
Attributed
Every retrieved partition carries a DocumentReference: document name, page number, source URI, similarity score, custom metadata. The query API returns both the response and the references that fed it. Compliance, audit, and citation requirements covered without retrofitting.
Swap backends without rewriting code. The same DocumentRag pipeline runs across every store type.
The IVectorStore interface decouples retrieval logic
from storage implementation. Prototype with in-memory storage,
iterate with file-based persistence, deploy with the built-in
high-performance store, scale out to a third-party connector,
or write your own backend for a proprietary system. Application
code stays the same.
IVectorStoreFast prototyping, live testing, instant feedback
No setupFile-based, handles millions of vectors, no server needed
RecommendedDistributed scaling for hundreds of millions of vectors
ProductionImplement IVectorStore for proprietary systems
Document ingestion is not an opaque call. DocumentRag emits
progress events for every phase the pipeline goes through, with enough
metadata to drive UIs, dashboards, and telemetry without polling.
Phase
Fires before each page enters the processing pipeline. Carries page index, total page count, planned strategy, source name. Use it to render a per-page progress bar.
Phase
Fires after each page completes. Carries elapsed time, strategy actually used (Auto resolves to TextExtraction or DocumentUnderstanding here), generated token count, optional warnings.
Phase
Fires when the chunk-and-embed phase begins for the document. Useful for switching the UI from "processing" to "indexing" status.
Phase
Fires when all chunks have been embedded and persisted. Indicates the document is queryable. Carries final partition count.
Lifecycle
Documents carry an explicit ID for lifecycle management. DeleteDocumentAsync removes a document and all its partitions. ImportDocumentAsync on the same ID re-indexes.
Lifecycle
Every async method accepts a CancellationToken. Long-running ingestions cancel cleanly at the next page boundary, releasing resources without leaving partial state behind.
Ingest a PDF into a file-based vector store, then query with source attribution including page numbers.
using LMKit.Model; using LMKit.Retrieval; using LMKit.Data.Storage; using LMKit.TextGeneration; // Models. The chat model can be smaller than the embedding model and vice versa. var chatModel = LM.LoadFromModelID("qwen3.5:4b"); var embedModel = LM.LoadFromModelID("embeddinggemma-300m"); // File-based vector store. Persists across runs, no server. var store = new FileSystemVectorStore("./embeddings"); var rag = new DocumentRag(embedModel, store) { ProcessingMode = PageProcessingMode.Auto, MaxChunkSize = 512, }; // Import with an explicit ID for lifecycle management. await rag.ImportDocumentAsync( Attachment.FromFile("financial-report.pdf"), new("fin-q4-2024") { Name = "Q4 financial report" }, dataSource: "finance"); // Query with source attribution. var chat = new SingleTurnConversation(chatModel); var result = await rag.QueryPartitionsAsync("What was Q4 revenue?", chat); Console.WriteLine(result.Response.Completion); foreach (var r in result.SourceReferences) { Console.WriteLine($" [{r.SimilarityScore:P0}] {r.Name}, page {r.PageNumber}"); }
Plug a vision-language model into the parser to read charts and tables alongside the surrounding text.
// Inject page renderings into context for charts, tables, and complex layouts. var visionModel = VisionLanguageModel.LoadFromModelID("qwen3.5:4b"); var rag = new DocumentRag(embedModel, store) { ProcessingMode = PageProcessingMode.Auto, VisionParser = new VlmOcr(visionModel), }; await rag.ImportDocumentAsync( Attachment.FromFile("annual-report.pdf"), new("ann-2024") { Name = "Annual report 2024" }); // Pass the matched page images alongside the text passages. var visionChat = new MultiTurnConversation(visionModel); var result = await rag.QueryPartitionsAsync( "Describe the revenue trend on the Q1-to-Q4 chart.", visionChat, includePageRenderingsInContext: true); // The vision model now reads both the surrounding text AND the chart image.
Subscribe to lifecycle events to drive a progress UI per page, per phase, and per strategy choice.
// Drive a progress UI from the lifecycle events. rag.Progress += (sender, e) => { switch (e.Phase) { case DocumentImportPhase.PageProcessingStarted: ui.SetStatus($"Page {e.PageIndex + 1}/{e.TotalPages} via {e.PlannedStrategy}"); break; case DocumentImportPhase.PageProcessingCompleted: log.Info($"Page {e.PageIndex + 1}: {e.StrategyUsed}, {e.Elapsed.TotalMilliseconds:F0}ms"); break; case DocumentImportPhase.EmbeddingStarted: ui.SetStatus("Indexing..."); break; case DocumentImportPhase.EmbeddingCompleted: ui.SetStatus("Ready"); break; } }; await rag.ImportDocumentAsync(attachment, metadata, dataSource: "finance", ct);
Index a corpus of contracts, reports, and policies. Answer freeform questions with citations to document name and page number. Source attribution is the audit trail.
Lifecycle by ID: documents are added, updated, deleted as the policy archive evolves. Every answer cites its sources; deleted documents drop from results immediately.
Vision-grounded retrieval reads the charts and tables that text-only RAG would miss. Auto mode handles the mixed-format pages without configuration.
Ingest manuals, troubleshooting guides, release notes. Multi-query expansion recovers passages when the user vocabulary differs from the documentation.
Index full inboxes (EML, MBOX) into the same RAG pipeline as document corpora. Retrieve threads with date and sender attribution per result.
Drop DocumentRag into an existing retrieval stack. Configure custom vector stores, embedding models, and query strategies; the rest of your architecture stays.
Per-query billing. Documents leave the perimeter. Limited control over chunking, ranking, and processing strategy. Outages become your outages. Cost grows with scale.
Months of plumbing: PDF parsing, OCR, chunking, embedding, indexing, retrieval, reranking, attribution. Each layer is a project. Each layer is a maintenance burden.
Adaptive processing, layout-aware chunking, advanced query strategies, vision-grounded retrieval, lifecycle by ID, source attribution, pluggable storage. One class, full control, 100% local. The foundation under PDF chat, RAG chat, and the rest of LM-Kit's document stack.
Class
Main retrieval engine. Document-shaped ingestion, adaptive processing, lifecycle by ID, query strategies, source attribution.
View documentationEnum
Per-page strategy: Auto, TextExtraction, DocumentUnderstanding. Controls how each page is analysed.
View documentationType
Source attribution for every result. Document name, page number, source URI, similarity score, custom metadata.
View documentationInterface
Pluggable storage contract. Implement for proprietary backends; use the built-in implementations for typical workloads.
View documentationEnum
Lifecycle phases emitted via the Progress event: PageProcessingStarted, PageProcessingCompleted, EmbeddingStarted, EmbeddingCompleted.
View documentationClass
Turn-key conversational wrapper on DocumentRag. Multi-turn memory, automatic strategy selection, ready-to-ship chat surface.
View documentationThe deterministic layout pipeline that drives chunk boundaries. Reading order, paragraphs, headings, tables, all reconstructed before chunking.
Convert mixed-format documents to LLM-ready Markdown before indexing. Universal entry point for any RAG pipeline.
The OCR backend that powers Auto-mode and DocumentUnderstanding-mode pages. Native engine plus vision-language OCR.
The turn-key conversational wrapper on top of DocumentRag. PdfChat handles session, cache, and retrieval policy.
The default high-performance store. File-based, no server, millions of vectors per index, incremental updates.
Index full inboxes (EML, MBOX) into the same RAG pipeline. Source attribution per email.
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Console demo: ingest, embed, retrieve, ground answers in citations.
Open on GitHub → How-to guideEnd-to-end RAG architecture for production .NET applications.
Read the guide → How-to guideLong-lived index with reindex, evict, and version policies.
Read the guide → How-to guideHybrid retrieval + rerank for precision-critical workloads.
Read the guide →