Image understanding
Caption, describe, answer visual questions, reason about what's in the frame.
LM-Kit.NET runs the complete vision and multimodal surface on your hardware: vision-language models, image classification, image labeling, multimodal chat, image embeddings, VLM-driven OCR, background removal. One SDK, one inference engine, zero cloud calls.
Caption, describe, answer visual questions, reason about what's in the frame.
Single-label classification, multi-label tagging, custom categories driven by a VLM.
Drop an Attachment into a conversation. Multi-turn, streaming, tool-augmented.
Cross-modal vectors aligned with text. VLM OCR with SOTA scores on public benchmarks.
Each one is a single class, a single line to load, and a single call to run. Mix any of them with the rest of the LM-Kit.NET surface (agents, RAG, document intelligence).
01
Caption, describe, visual question answering, scene reasoning. Drop an image into a SingleTurnConversation and ask anything.
02
Single-label classification with custom categories. Zero-shot via VLM prompt, or fine-tuned via LoRA. Grammar-constrained output for deterministic labels.
Image classification03
Multi-label tagging. Extract a set of relevant tags from an image, with confidence scores. Useful for asset catalogs, content moderation, indexing.
Image labeling04
Multimodal conversations. Multiple images, multi-turn, streaming tokens. Same MultiTurnConversation API as text-only chat.
05
Cross-modal vectors aligned with the text embedder. Text-to-image and image-to-image retrieval, multimodal RAG, duplicate detection.
Image embeddings06
Vision-language-model OCR with structured intents: text, Markdown, tables, formulas, charts, coordinates, seals. SOTA accuracy on document-OCR benchmarks.
VLM-OCR07
U2-Net and ModNet for background removal. Plus a full preprocessing toolkit: deskew, crop, resize, color correction. Built-in agent tools wrap each one.
Background removal & preprocessingThe same conversation surface that powers every LLM in LM-Kit.NET also powers every VLM. No special code path, no separate API.
using LMKit.Model; using LMKit.TextGeneration; using LMKit.Graphics; // 1. Load a vision-language model. Same call as any LLM. var vlm = LM.LoadFromModelID("qwen3-vl:8b"); // 2. Wrap in a conversation. Same primitives as text-only. var chat = new SingleTurnConversation(vlm); // 3. Submit a prompt with one or more image attachments. string answer = await chat.SubmitAsync( "Describe this image and list any visible defects.", Attachment.FromFile("part-photo.jpg")); Console.WriteLine(answer);
Same code path for multi-turn chat, function calls, grammar-constrained
outputs, RAG queries. The Attachment type accepts files,
byte arrays, streams, and base64 strings. 5-minute quickstart.
Today's vision stack. The catalog grows with every release; swap a model ID and the rest of your code keeps working.
VLM
General vision-language model. Strong at captioning, document Q&A, visual reasoning. Sizes from 2B to 72B. Model IDs qwen2-vl:2b and qwen3.5:9b.
VLM
Google's open vision-language model. Good multilingual coverage. Available in 1B, 4B, 12B, 27B sizes via the Gemma 3 model family.
VLM
Lightweight VLM optimised for OCR, document understanding, screenshot reasoning, function calling. Model ID glm-4.6v-flash.
OCR
LM-Kit's first-party OCR engine. SOTA benchmark scores on public document datasets. Fast on a single core, optimised multi-threading.
OCR
VLM-based OCR with seven structured intents. Model ID paddleocr-vl:0.9b.
Embeddings
Image vectors aligned with nomic-embed-text. Cross-modal search: text query, image result and vice versa. Model ID nomic-embed-vision.
Auto-label millions of media assets with consistent tags. No cloud quota, no per-call cost. Re-tag with new taxonomies on demand.
Manufacturing defect detection, classification, label OCR. Photos and screenshots never leave the factory floor.
Multi-label image flagging with custom categories. Fine-tune to your policy without sending user content to a third party.
Generate descriptive alt text on the user's device. Privacy-respecting accessibility for screen readers and translation.
Agents that read screenshots, parse dashboards, reason about diagrams, operate visual workflows. Vision plus tools in one SDK.
Knowledge bases that mix text and images. Retrieve a diagram by question, retrieve a paragraph by uploaded figure, ground answers in both modalities.
Vision is its own pillar, but it threads through the rest of the stack. Here is where you will find it in practice.
Documents
Vision-driven OCR with layout awareness, table extraction, formula recognition. The flagship vision use case for document workflows.
Document OCRAgents
Drop an Attachment into any tool input. Agents that read screenshots, parse charts, reason about diagrams.
RAG
Index figures, schematics, screenshots. Retrieve text by image, retrieve images by text. Vision-grounded answers.
RAG & KnowledgeWorking console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Console demo: drop one or more images into a chat, stream the answer.
Open on GitHub → DemoConsole demo: extract structured text with bounding boxes from images.
Open on GitHub → DemoConsole demo: cross-modal retrieval with Nomic Embed Vision.
Open on GitHub → How-to guideHow-to: load a VLM, attach an image, ask questions, constrain output.
Read the guide →The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.
01 · AI Agents
ReAct planning, supervisors, parallel and pipeline orchestrators, persistent memory, MCP clients, custom tools.
AI Agents02 · Document Intelligence
PDF text and table extraction, on-device OCR reaching SOTA benchmark scores, structured field extraction with grammar-constrained generation.
Document Intelligence03 · Vision & Multimodal
Image understanding, classification, labeling, multimodal chat, image embeddings, VLM-OCR, background removal. Same conversation surface as LLMs.
You are here04 · RAG & Knowledge
Built-in vector store, Qdrant connector, embeddings, hybrid retrieval, document chunking, source citations.
RAG & Knowledge05 · Text Analysis
Built-in classifiers and an extractor that emits typed C# objects via grammar-constrained sampling. Sentiment, keywords, language detection.
Text Analysis06 · Speech & Audio
A growing local speech-to-text stack: hallucination suppression, Voice Activity Detection, real-time translation, streaming output, 100+ languages.
Speech & Audio07 · Text Generation
Single-turn, multi-turn, and stateless conversation primitives. Translate, correct, rewrite, summarise. Prompt templates, streaming, grammar-constrained outputs.
Text GenerationThe foundation
Every capability above runs on this runtime.
Foundation
The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.
Image understanding, classification, labeling, multimodal chat, embeddings, OCR. One SDK. Five backends. Zero per-call cost.