Solutions · Vision & Multimodal

The full local vision stack for .NET.

LM-Kit.NET runs the complete vision and multimodal surface on your hardware: vision-language models, image classification, image labeling, multimodal chat, image embeddings, VLM-driven OCR, background removal. One SDK, one inference engine, zero cloud calls.

VLMs: Qwen2-VL, Gemma3-VL, GLM-V OCR: LMKit OCR, PaddleOCR-VL Embeddings: Nomic Embed Vision
Capability

Image understanding

Caption, describe, answer visual questions, reason about what's in the frame.

Capability

Image classification & labeling

Single-label classification, multi-label tagging, custom categories driven by a VLM.

Capability

Chat with image

Drop an Attachment into a conversation. Multi-turn, streaming, tool-augmented.

Capability

Image embeddings & OCR

Cross-modal vectors aligned with text. VLM OCR with SOTA scores on public benchmarks.

Capabilities

Seven dedicated vision capabilities.

Each one is a single class, a single line to load, and a single call to run. Mix any of them with the rest of the LM-Kit.NET surface (agents, RAG, document intelligence).

How it works

Load a VLM, attach an image, get a result.

The same conversation surface that powers every LLM in LM-Kit.NET also powers every VLM. No special code path, no separate API.

VisionExample.cs
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Graphics;

// 1. Load a vision-language model. Same call as any LLM.
var vlm = LM.LoadFromModelID("qwen3-vl:8b");

// 2. Wrap in a conversation. Same primitives as text-only.
var chat = new SingleTurnConversation(vlm);

// 3. Submit a prompt with one or more image attachments.
string answer = await chat.SubmitAsync(
    "Describe this image and list any visible defects.",
    Attachment.FromFile("part-photo.jpg"));

Console.WriteLine(answer);

Same code path for multi-turn chat, function calls, grammar-constrained outputs, RAG queries. The Attachment type accepts files, byte arrays, streams, and base64 strings. 5-minute quickstart.

Models

Vision models that run locally.

Today's vision stack. The catalog grows with every release; swap a model ID and the rest of your code keeps working.

VLM

Qwen2-VL

General vision-language model. Strong at captioning, document Q&A, visual reasoning. Sizes from 2B to 72B. Model IDs qwen2-vl:2b and qwen3.5:9b.

VLM

Gemma 4 Vision

Google's open vision-language model. Good multilingual coverage. Available in 1B, 4B, 12B, 27B sizes via the Gemma 3 model family.

VLM

GLM-V 4.6 Flash

Lightweight VLM optimised for OCR, document understanding, screenshot reasoning, function calling. Model ID glm-4.6v-flash.

OCR

LMKit OCR

LM-Kit's first-party OCR engine. SOTA benchmark scores on public document datasets. Fast on a single core, optimised multi-threading.

OCR

PaddleOCR-VL

VLM-based OCR with seven structured intents. Model ID paddleocr-vl:0.9b.

Embeddings

Nomic Embed Vision

Image vectors aligned with nomic-embed-text. Cross-modal search: text query, image result and vice versa. Model ID nomic-embed-vision.

Browse the full catalog

Use cases

What you build with local vision.

Asset tagging & DAM

Auto-label millions of media assets with consistent tags. No cloud quota, no per-call cost. Re-tag with new taxonomies on demand.

Visual quality control

Manufacturing defect detection, classification, label OCR. Photos and screenshots never leave the factory floor.

Content moderation

Multi-label image flagging with custom categories. Fine-tune to your policy without sending user content to a third party.

Accessibility & alt text

Generate descriptive alt text on the user's device. Privacy-respecting accessibility for screen readers and translation.

Visual agents

Agents that read screenshots, parse dashboards, reason about diagrams, operate visual workflows. Vision plus tools in one SDK.

Multimodal RAG

Knowledge bases that mix text and images. Retrieve a diagram by question, retrieve a paragraph by uploaded figure, ground answers in both modalities.

LM-Kit.NET pillars

Seven pillars, one foundation.

The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.

The foundation

Every capability above runs on this runtime.

Foundation

Local Inference

The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.

Explore the foundation

Ship vision without the cloud.

Image understanding, classification, labeling, multimodal chat, embeddings, OCR. One SDK. Five backends. Zero per-call cost.

Start in 5 minutes Download LM-Kit.NET