Solutions · Vision & Multimodal

The full local vision stack for .NET.

LM-Kit.NET runs the complete vision and multimodal surface on your hardware: vision-language models, image classification, image labeling, multimodal chat, image embeddings, VLM-driven OCR, background removal. One SDK, one inference engine, zero cloud calls.

5-minute quickstart API reference

VLMs: Qwen2-VL, Gemma3-VL, GLM-V OCR: LMKit OCR, PaddleOCR-VL Embeddings: Nomic Embed Vision

Capability

Image understanding

Caption, describe, answer visual questions, reason about what's in the frame.

Capability

Image classification & labeling

Single-label classification, multi-label tagging, custom categories driven by a VLM.

Capability

Chat with image

Drop an Attachment into a conversation. Multi-turn, streaming, tool-augmented.

Capability

Image embeddings & OCR

Cross-modal vectors aligned with text. VLM OCR with SOTA scores on public benchmarks.

Capabilities

Seven dedicated vision capabilities.

Each one is a single class, a single line to load, and a single call to run. Mix any of them with the rest of the LM-Kit.NET surface (agents, RAG, document intelligence).

Image understanding

Caption, describe, visual question answering, scene reasoning. Drop an image into a SingleTurnConversation and ask anything.

Image understanding

Image classification

Single-label classification with custom categories. Zero-shot via VLM prompt, or fine-tuned via LoRA. Grammar-constrained output for deterministic labels.

Image classification

Image labeling

Multi-label tagging. Extract a set of relevant tags from an image, with confidence scores. Useful for asset catalogs, content moderation, indexing.

Image labeling

Chat with image

Multimodal conversations. Multiple images, multi-turn, streaming tokens. Same MultiTurnConversation API as text-only chat.

Chat with image

Image embeddings

Cross-modal vectors aligned with the text embedder. Text-to-image and image-to-image retrieval, multimodal RAG, duplicate detection.

Image embeddings

VLM-driven OCR

Vision-language-model OCR with structured intents: text, Markdown, tables, formulas, charts, coordinates, seals. SOTA accuracy on document-OCR benchmarks.

VLM-OCR

Background removal & image preprocessing

U2-Net and ModNet for background removal. Plus a full preprocessing toolkit: deskew, crop, resize, color correction. Built-in agent tools wrap each one.

Background removal & preprocessing

How it works

Load a VLM, attach an image, get a result.

The same conversation surface that powers every LLM in LM-Kit.NET also powers every VLM. No special code path, no separate API.

VisionExample.cs

using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Graphics;

// 1. Load a vision-language model. Same call as any LLM.
var vlm = LM.LoadFromModelID("qwen3-vl:8b");

// 2. Wrap in a conversation. Same primitives as text-only.
var chat = new SingleTurnConversation(vlm);

// 3. Submit a prompt with one or more image attachments.
string answer = await chat.SubmitAsync(
    "Describe this image and list any visible defects.",
    Attachment.FromFile("part-photo.jpg"));

Console.WriteLine(answer);

Same code path for multi-turn chat, function calls, grammar-constrained outputs, RAG queries. The Attachment type accepts files, byte arrays, streams, and base64 strings. 5-minute quickstart.

Models

Vision models that run locally.

Today's vision stack. The catalog grows with every release; swap a model ID and the rest of your code keeps working.

VLM

Qwen2-VL

General vision-language model. Strong at captioning, document Q&A, visual reasoning. Sizes from 2B to 72B. Model IDs qwen2-vl:2b and qwen3.5:9b.

VLM

Gemma 4 Vision

Google's open vision-language model. Good multilingual coverage. Available in 1B, 4B, 12B, 27B sizes via the Gemma 3 model family.

VLM

GLM-V 4.6 Flash

Lightweight VLM optimised for OCR, document understanding, screenshot reasoning, function calling. Model ID glm-4.6v-flash.

OCR

LMKit OCR

LM-Kit's first-party OCR engine. SOTA benchmark scores on public document datasets. Fast on a single core, optimised multi-threading.

OCR

PaddleOCR-VL

VLM-based OCR with seven structured intents. Model ID paddleocr-vl:0.9b.

Embeddings

Nomic Embed Vision

Image vectors aligned with nomic-embed-text. Cross-modal search: text query, image result and vice versa. Model ID nomic-embed-vision.

Browse the full catalog

Use cases

What you build with local vision.

Asset tagging & DAM

Auto-label millions of media assets with consistent tags. No cloud quota, no per-call cost. Re-tag with new taxonomies on demand.

Visual quality control

Manufacturing defect detection, classification, label OCR. Photos and screenshots never leave the factory floor.

Content moderation

Multi-label image flagging with custom categories. Fine-tune to your policy without sending user content to a third party.

Accessibility & alt text

Generate descriptive alt text on the user's device. Privacy-respecting accessibility for screen readers and translation.

Visual agents

Agents that read screenshots, parse dashboards, reason about diagrams, operate visual workflows. Vision plus tools in one SDK.

Multimodal RAG

Knowledge bases that mix text and images. Retrieve a diagram by question, retrieve a paragraph by uploaded figure, ground answers in both modalities.

Where vision shows up

Vision across the other pillars.

Vision is its own pillar, but it threads through the rest of the stack. Here is where you will find it in practice.

Documents

VLM-OCR for documents

Vision-driven OCR with layout awareness, table extraction, formula recognition. The flagship vision use case for document workflows.

Document OCR

Agents

Agents with eyes

Drop an Attachment into any tool input. Agents that read screenshots, parse charts, reason about diagrams.

AI Agents

RAG

Multimodal RAG

Index figures, schematics, screenshots. Retrieve text by image, retrieve images by text. Vision-grounded answers.

RAG & Knowledge

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Multi-turn chat with vision

Console demo: drop one or more images into a chat, stream the answer.

Open on GitHub → Demo

VLM OCR with coordinates

Console demo: extract structured text with bounding boxes from images.

Open on GitHub → Demo

Image similarity search

Console demo: cross-modal retrieval with Nomic Embed Vision.

Open on GitHub → How-to guide

Analyze images with vision

How-to: load a VLM, attach an image, ask questions, constrain output.

Read the guide →

LM-Kit.NET pillars

Seven pillars, one foundation.

The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.

01 · AI Agents

Orchestration patterns

ReAct planning, supervisors, parallel and pipeline orchestrators, persistent memory, MCP clients, custom tools.

AI Agents

02 · Document Intelligence

Parse PDFs, images, EML

PDF text and table extraction, on-device OCR reaching SOTA benchmark scores, structured field extraction with grammar-constrained generation.

Document Intelligence

03 · Vision & Multimodal

VLMs, image classification, chat with image

Image understanding, classification, labeling, multimodal chat, image embeddings, VLM-OCR, background removal. Same conversation surface as LLMs.

You are here

04 · RAG & Knowledge

Vector search and retrieval

Built-in vector store, Qdrant and pgvector connectors, embeddings, hybrid retrieval, document chunking, source citations.

RAG & Knowledge

05 · Text Analysis

Classification, NER, PII, sentiment

Built-in classifiers and an extractor that emits typed C# objects via grammar-constrained sampling. Sentiment, keywords, language detection.

Text Analysis

06 · Speech & Audio

Audio transcription, STT

A growing local speech-to-text stack: hallucination suppression, Voice Activity Detection, real-time translation, streaming output, 100+ languages.

Speech & Audio

07 · Text Generation

Conversations, rewriting, summaries

Single-turn, multi-turn, and stateless conversation primitives. Translate, correct, rewrite, summarise. Prompt templates, streaming, grammar-constrained outputs.

Text Generation

The foundation

Every capability above runs on this runtime.

Foundation

Local Inference

The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.

Explore the foundation

Ship vision without the cloud.

Image understanding, classification, labeling, multimodal chat, embeddings, OCR. One SDK. Five backends. Zero per-call cost.

Start in 5 minutes Download LM-Kit.NET