Claim Free Community License
Document RAG Engine

Build CustomDocument RAG Pipelines.

Full control over document ingestion, processing, and retrieval. Adaptive pipeline that auto-selects OCR, Vision Language Models, or direct text extraction per page. Pluggable vector stores, source attribution, and document lifecycle management. 100% local processing.

Adaptive Processing Custom Vector Stores Source Attribution
Document Import
PDF, Office, HTML, Markdown, images with metadata
Adaptive Processing
Auto-selects Text, OCR, or VLM per page
Auto
Chunking & Embedding
Smart text chunking with layout awareness
Semantic Retrieval
Similarity search with source attribution
Traceable
3
Processing Modes
100%
Local
∞
Scalable

Full Control Over Document Retrieval

DocumentRag is the lower-level engine behind LM-Kit's document intelligence capabilities. While PdfChat provides a turnkey conversational interface, DocumentRag gives you complete control over the document processing pipeline, from import through embedding to retrieval.

Configure processing modes per document, integrate your own vector stores, manage document lifecycles with explicit IDs, and build custom retrieval workflows that fit your exact requirements.

When to use DocumentRag: Choose DocumentRag when you need fine-grained control over chunking strategies, custom vector store backends, explicit document lifecycle management, or integration into existing RAG architectures.

PdfChat

High-Level Agent

Turnkey document Q&A with conversational interface, tool calling, and memory. Best for rapid prototyping.

DocumentRag

RAG Engine

Full pipeline control with configurable processing modes, vector stores, and document lifecycle management.

RagEngine

Base Engine

Generic text-based RAG for any content. DocumentRag extends this with document-specific features.

Adaptive Page-by-Page Processing

Configure how each page is analyzed and text is extracted. Auto mode intelligently selects the best strategy per page.

TextExtraction
PageProcessingMode.TextExtraction

Extracts text directly from the document structure. Uses the configured OcrEngine for pages that require OCR (scanned images, image-based PDFs).

  • Fastest processing speed
  • Lower resource usage
  • Best for clean digital documents
DocumentUnderstanding
PageProcessingMode.DocumentUnderstanding

Uses VisionParser (VLM) to analyze page images visually. Produces markdown-formatted output that preserves document structure including tables, headings, and layout.

  • Best for complex layouts
  • Preserves document structure
  • Tables, forms, mixed content

Complete Document RAG Toolkit

Everything you need to build production-grade document retrieval pipelines.

Document Lifecycle Management

Import documents with explicit IDs for lifecycle management. Delete or update documents using their ID. Track documents across sessions with metadata.

Source Attribution

Every partition includes document name, page number, and custom metadata. Full traceability for compliance, audit, and citation requirements.

Configurable Chunking

Control chunk sizes with MaxChunkSize property. TextChunking for standard text, MarkdownChunking for VLM output. Layout-aware segmentation.

OCR & VLM Integration

Configure OcrEngine for traditional text extraction from images. Configure VisionParser for VLM-based document understanding with layout preservation.

Progress Events

Subscribe to Progress events for real-time status updates. Track page processing, embedding generation, and completion phases.

Semantic Search

FindMatchingPartitions returns ranked results by similarity. QueryPartitions generates responses with DocumentReference objects for source attribution.

Pluggable Vector Store Architecture

Swap backends without rewriting code. The same DocumentRag logic works across all storage types.

From Prototype to Production

DocumentRag integrates seamlessly with LM-Kit's unified vector storage system. Whether you're building a desktop application with local-only storage or a distributed AI platform with cloud infrastructure, the same code works everywhere. Just change the backend.

The IVectorStore interface provides a standardized contract that decouples your document RAG logic from storage implementation. This means you can prototype with in-memory storage, develop with file-based persistence, and deploy to production with Qdrant or your custom backend, all without changing your application code.

  • Zero setup for local development
  • Durable, portable file-based storage
  • HNSW indexing with Qdrant for scale
  • Custom backends via IVectorStore
  • Incremental updates without rebuild
Explore Vector Database Options

In-Memory

Fast prototyping, live testing, instant feedback

No Setup

Qdrant

HNSW indexing, distributed scaling, cloud-ready

Production

Custom Backend

Implement IVectorStore for proprietary systems

Flexible
Storage Persistence Scale Infrastructure
In-Memory Temporary Low None
Built-in DB File-based Medium None
Qdrant Durable High Qdrant instance
Custom Custom Varies Your infra

Build a Document RAG Pipeline

Complete example showing document import, retrieval, and response generation with source attribution.

DocumentRagPipeline.cs
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Data.Storage;
using LMKit.TextGeneration;

// Load models
var chatModel = LM.LoadFromModelID("qwen3:4b");
var embedModel = LM.LoadFromModelID("embeddinggemma-300m");
var visionModel = LM.LoadFromModelID("qwen3-vl:4b");

// Configure vector store (optional - uses in-memory by default)
var vectorStore = new FileSystemVectorStore("./embeddings");

// Create DocumentRag with full configuration
var docRag = new DocumentRag(embedModel, vectorStore)
{
    ProcessingMode = PageProcessingMode.Auto,
    MaxChunkSize = 512,
    OcrEngine = new OcrEngine(),
    VisionParser = new VlmOcr(visionModel)
};

// Subscribe to progress events
docRag.Progress += (sender, e) =>
{
    if (e.Phase == DocumentImportPhase.PageProcessingStarted)
        Console.WriteLine($"Processing page {e.PageIndex + 1}/{e.TotalPages}");
};

// Import document with explicit ID for lifecycle management
var attachment = Attachment.FromFile("financial-report.pdf");
var metadata = new DocumentRag.DocumentMetadata(
    attachment,
    id: "fin-report-2024-q4",
    sourceUri: "https://internal.company.com/reports/q4-2024.pdf");

await docRag.ImportDocumentAsync(attachment, metadata, "financial-docs");

// Find relevant passages
var partitions = await docRag.FindMatchingPartitionsAsync(
    "What was the Q4 revenue?",
    topK: 5,
    minScore: 0.5f);

// Generate response with source attribution
var conversation = new SingleTurnConversation(chatModel);
var result = await docRag.QueryPartitionsAsync(
    "What was the Q4 revenue?",
    partitions,
    conversation);

Console.WriteLine(result.Response.Completion);
foreach (var reference in result.SourceReferences)
{
    Console.WriteLine($"  Source: {reference.Name}, Page {reference.PageNumber}");
    Console.WriteLine($"  Score: {reference.SimilarityScore:P1}");
}

// Later: Delete document by ID
await docRag.DeleteDocumentAsync("fin-report-2024-q4", "financial-docs");

When to Use DocumentRag

DocumentRag excels when you need granular control over document processing.

Document Management Systems

Build enterprise search with explicit document IDs, version tracking, and integration with existing document repositories.

Compliance & Audit Systems

Source attribution with page numbers and metadata provides audit trails. Custom metadata supports compliance tagging.

Custom RAG Architectures

Integrate DocumentRag into existing RAG pipelines. Use custom vector stores, embedding models, and retrieval strategies.

Complex Document Processing

Fine-tune processing modes per document type. Use VLM for scanned documents, text extraction for digital PDFs.

Embedding Cache Systems

Use IVectorStore for persistent embedding storage. Subsequent document loads are instant when cached.

Batch Document Processing

Import multiple documents into the same DataSource. Query across all documents with unified retrieval.

Key Classes & Methods

Core components for building document RAG pipelines.

DocumentRag

Main class for document-centric RAG. Extends RagEngine with document-specific processing, OCR/VLM integration, and source attribution.

View Documentation
DocumentMetadata

Metadata container for documents. Includes ID for lifecycle management, name, source URI, and custom metadata fields.

View Documentation
DocumentReference

Source reference from retrieval results. Provides document name, page number, source URI, excerpt, and similarity score.

View Documentation
IVectorStore

Interface for embedding storage. Implement for custom backends or use FileSystemVectorStore, Qdrant connector, etc.

View Documentation
VlmOcr

Vision Language Model-based document parser. Analyzes page images visually and produces markdown-formatted output.

View Documentation
PageProcessingMode

Enum for processing modes: Auto, TextExtraction, DocumentUnderstanding. Controls how each page is analyzed.

View Documentation

Ready to Build Custom Document RAG?

Full control over document processing, retrieval, and response generation. 100% local, 100% your infrastructure.