Build CustomDocument RAG Pipelines.
Full control over document ingestion, processing, and retrieval. Adaptive pipeline that auto-selects OCR, Vision Language Models, or direct text extraction per page. Pluggable vector stores, source attribution, and document lifecycle management. 100% local processing.
Full Control Over Document Retrieval
DocumentRag is the lower-level engine behind LM-Kit's document intelligence
capabilities. While PdfChat provides a turnkey conversational interface,
DocumentRag gives you complete control over the document processing pipeline,
from import through embedding to retrieval.
Configure processing modes per document, integrate your own vector stores, manage document lifecycles with explicit IDs, and build custom retrieval workflows that fit your exact requirements.
When to use DocumentRag: Choose DocumentRag when you need fine-grained control over chunking strategies, custom vector store backends, explicit document lifecycle management, or integration into existing RAG architectures.
PdfChat
High-Level Agent
Turnkey document Q&A with conversational interface, tool calling, and memory. Best for rapid prototyping.
DocumentRag
RAG Engine
Full pipeline control with configurable processing modes, vector stores, and document lifecycle management.
RagEngine
Base Engine
Generic text-based RAG for any content. DocumentRag extends this with document-specific features.
Adaptive Page-by-Page Processing
Configure how each page is analyzed and text is extracted. Auto mode intelligently selects the best strategy per page.
Analyzes each page and automatically selects the optimal strategy. Uses VLM for image-heavy pages when VisionParser is configured, otherwise falls back to text extraction.
- Zero configuration required
- Per-page optimization
- Handles mixed document types
Extracts text directly from the document structure. Uses the configured OcrEngine for pages that require OCR (scanned images, image-based PDFs).
- Fastest processing speed
- Lower resource usage
- Best for clean digital documents
Uses VisionParser (VLM) to analyze page images visually. Produces markdown-formatted output that preserves document structure including tables, headings, and layout.
- Best for complex layouts
- Preserves document structure
- Tables, forms, mixed content
Complete Document RAG Toolkit
Everything you need to build production-grade document retrieval pipelines.
Document Lifecycle Management
Import documents with explicit IDs for lifecycle management. Delete or update documents using their ID. Track documents across sessions with metadata.
Source Attribution
Every partition includes document name, page number, and custom metadata. Full traceability for compliance, audit, and citation requirements.
Configurable Chunking
Control chunk sizes with MaxChunkSize property. TextChunking for standard text, MarkdownChunking for VLM output. Layout-aware segmentation.
OCR & VLM Integration
Configure OcrEngine for traditional text extraction from images. Configure VisionParser for VLM-based document understanding with layout preservation.
Progress Events
Subscribe to Progress events for real-time status updates. Track page processing, embedding generation, and completion phases.
Semantic Search
FindMatchingPartitions returns ranked results by similarity. QueryPartitions generates responses with DocumentReference objects for source attribution.
Pluggable Vector Store Architecture
Swap backends without rewriting code. The same DocumentRag logic works across all storage types.
From Prototype to Production
DocumentRag integrates seamlessly with LM-Kit's unified vector storage system. Whether you're building a desktop application with local-only storage or a distributed AI platform with cloud infrastructure, the same code works everywhere. Just change the backend.
The IVectorStore interface provides a standardized contract that decouples
your document RAG logic from storage implementation. This means you can prototype with
in-memory storage, develop with file-based persistence, and deploy to production with
Qdrant or your custom backend, all without changing your application code.
- Zero setup for local development
- Durable, portable file-based storage
- HNSW indexing with Qdrant for scale
- Custom backends via IVectorStore
- Incremental updates without rebuild
In-Memory
Fast prototyping, live testing, instant feedback
No SetupBuilt-in Vector DB
File-based, handles millions of vectors, no server needed
RecommendedQdrant
HNSW indexing, distributed scaling, cloud-ready
ProductionCustom Backend
Implement IVectorStore for proprietary systems
FlexibleBuild a Document RAG Pipeline
Complete example showing document import, retrieval, and response generation with source attribution.
using LMKit.Model; using LMKit.Retrieval; using LMKit.Data.Storage; using LMKit.TextGeneration; // Load models var chatModel = LM.LoadFromModelID("qwen3:4b"); var embedModel = LM.LoadFromModelID("embeddinggemma-300m"); var visionModel = LM.LoadFromModelID("qwen3-vl:4b"); // Configure vector store (optional - uses in-memory by default) var vectorStore = new FileSystemVectorStore("./embeddings"); // Create DocumentRag with full configuration var docRag = new DocumentRag(embedModel, vectorStore) { ProcessingMode = PageProcessingMode.Auto, MaxChunkSize = 512, OcrEngine = new OcrEngine(), VisionParser = new VlmOcr(visionModel) }; // Subscribe to progress events docRag.Progress += (sender, e) => { if (e.Phase == DocumentImportPhase.PageProcessingStarted) Console.WriteLine($"Processing page {e.PageIndex + 1}/{e.TotalPages}"); }; // Import document with explicit ID for lifecycle management var attachment = Attachment.FromFile("financial-report.pdf"); var metadata = new DocumentRag.DocumentMetadata( attachment, id: "fin-report-2024-q4", sourceUri: "https://internal.company.com/reports/q4-2024.pdf"); await docRag.ImportDocumentAsync(attachment, metadata, "financial-docs"); // Find relevant passages var partitions = await docRag.FindMatchingPartitionsAsync( "What was the Q4 revenue?", topK: 5, minScore: 0.5f); // Generate response with source attribution var conversation = new SingleTurnConversation(chatModel); var result = await docRag.QueryPartitionsAsync( "What was the Q4 revenue?", partitions, conversation); Console.WriteLine(result.Response.Completion); foreach (var reference in result.SourceReferences) { Console.WriteLine($" Source: {reference.Name}, Page {reference.PageNumber}"); Console.WriteLine($" Score: {reference.SimilarityScore:P1}"); } // Later: Delete document by ID await docRag.DeleteDocumentAsync("fin-report-2024-q4", "financial-docs");
When to Use DocumentRag
DocumentRag excels when you need granular control over document processing.
Document Management Systems
Build enterprise search with explicit document IDs, version tracking, and integration with existing document repositories.
Compliance & Audit Systems
Source attribution with page numbers and metadata provides audit trails. Custom metadata supports compliance tagging.
Custom RAG Architectures
Integrate DocumentRag into existing RAG pipelines. Use custom vector stores, embedding models, and retrieval strategies.
Complex Document Processing
Fine-tune processing modes per document type. Use VLM for scanned documents, text extraction for digital PDFs.
Embedding Cache Systems
Use IVectorStore for persistent embedding storage. Subsequent document loads are instant when cached.
Batch Document Processing
Import multiple documents into the same DataSource. Query across all documents with unified retrieval.
Key Classes & Methods
Core components for building document RAG pipelines.
DocumentRag
Main class for document-centric RAG. Extends RagEngine with document-specific processing, OCR/VLM integration, and source attribution.
View DocumentationDocumentMetadata
Metadata container for documents. Includes ID for lifecycle management, name, source URI, and custom metadata fields.
View DocumentationDocumentReference
Source reference from retrieval results. Provides document name, page number, source URI, excerpt, and similarity score.
View DocumentationIVectorStore
Interface for embedding storage. Implement for custom backends or use FileSystemVectorStore, Qdrant connector, etc.
View DocumentationVlmOcr
Vision Language Model-based document parser. Analyzes page images visually and produces markdown-formatted output.
View DocumentationPageProcessingMode
Enum for processing modes: Auto, TextExtraction, DocumentUnderstanding. Controls how each page is analyzed.
View DocumentationReady to Build Custom Document RAG?
Full control over document processing, retrieval, and response generation. 100% local, 100% your infrastructure.