Solutions · Vision · Image Understanding

Image understanding, on-device.

Caption, describe, classify, and answer questions about any image using vision-language models that run on your hardware. Same conversation surface as text-only chat. No upload, no API key, no per-call billing.

5-minute quickstart API reference

Captioning Visual Q&A Scene reasoning

Model

`qwen3.5:9b`

Default VLM. Strong general image understanding.

Model

`glm-4.6v-flash`

Lightweight, fast, function-calling capable.

Model

Gemma 4 Vision

Open-weight VLM, multilingual.

What it does

Four tasks, one prompt away.

Captioning

Generate accurate captions for any image. Short sentences for asset libraries, long-form for accessibility, structured for catalogs.

Visual Q&A

Ask a question about an image and get an answer grounded in what the model sees. Counts objects, identifies attributes, reads signage.

Scene reasoning

Multi-step reasoning over visual content. Diagrams, charts, layouts, screenshots. Pair with reasoning models for harder tasks.

Structured output

Constrain output to a JSON schema or BNF grammar. Get typed objects from images with deterministic shape, no post-processing.

How it works

One conversation, one attachment.

ImageUnderstanding.cs

using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Graphics;

var vlm  = LM.LoadFromModelID("qwen3-vl:8b");
var chat = new SingleTurnConversation(vlm);

// 1. Caption an image.
var caption = await chat.SubmitAsync(
    "Caption this image in one sentence.",
    Attachment.FromFile("product.jpg"));

// 2. Visual Q&A on the same image.
var answer = await chat.SubmitAsync(
    "How many people are visible? Reply with a number only.",
    Attachment.FromFile("crowd.jpg"));

// 3. Structured output via grammar-constrained generation.
var grammar = Grammar.FromJsonSchema("""
    { "type": "object",
      "properties": {
        "subject":   { "type": "string" },
        "tags":      { "type": "array", "items": { "type": "string" } },
        "is_safe":   { "type": "boolean" } } }
    """);
var json = await chat.SubmitAsync(
    "Describe this image as JSON.",
    Attachment.FromFile("photo.jpg"),
    grammar);

Use cases

Where image understanding belongs.

Accessibility

Generate alt text on the user's device for screen readers, captions for video frames, descriptive narration.

Media catalogs

Bulk-caption asset libraries, generate descriptions for stock images, build searchable indexes from photo archives.

Customer support

Read screenshots customers attach to tickets. Triage automatically before a human looks.

Compliance & safety

Describe visual content for review queues without sending the original image to a third-party endpoint.

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Image Understanding

Console demo: caption, describe, count objects, and yes/no fact-check images with a VLM.

Open on GitHub → Sample

Image Understanding walkthrough

Step-by-step doc page: prerequisites, setup, code path, expected output.

Read on docs → Demo

Multi-turn chat with vision

Drop an image into a chat, ask questions, stream the response.

Open on GitHub → Sample

Multi-turn chat with vision walkthrough

Step-by-step doc page: prerequisites, setup, code path, expected output.

Read on docs → How-to guide

Analyze images with vision

Caption, describe, answer visual questions with a VLM.

Read the guide → How-to guide

Import and query documents with vision

Vision-grounded retrieval over images and scanned pages.

Read the guide → API reference

SingleTurnConversation

The conversation primitive used for one-shot VLM prompts.

Open the reference →

LM-Kit.NET pillars

Seven pillars, one foundation.

The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.

01 · AI Agents

Orchestration patterns

ReAct planning, supervisors, parallel and pipeline orchestrators, persistent memory, MCP clients, custom tools.

AI Agents

02 · Document Intelligence

Parse PDFs, images, EML

PDF text and table extraction, on-device OCR reaching SOTA benchmark scores, structured field extraction with grammar-constrained generation.

Document Intelligence

03 · Vision & Multimodal

VLMs, image classification, chat with image

Image understanding, classification, labeling, multimodal chat, image embeddings, VLM-OCR, background removal. Same conversation surface as LLMs.

Vision & Multimodal

04 · RAG & Knowledge

Vector search and retrieval

Built-in vector store, Qdrant and pgvector connectors, embeddings, hybrid retrieval, document chunking, source citations.

RAG & Knowledge

05 · Text Analysis

Classification, NER, PII, sentiment

Built-in classifiers and an extractor that emits typed C# objects via grammar-constrained sampling. Sentiment, keywords, language detection.

Text Analysis

06 · Speech & Audio

Audio transcription, STT

A growing local speech-to-text stack: hallucination suppression, Voice Activity Detection, real-time translation, streaming output, 100+ languages.

Speech & Audio

07 · Text Generation

Conversations, rewriting, summaries

Single-turn, multi-turn, and stateless conversation primitives. Translate, correct, rewrite, summarise. Prompt templates, streaming, grammar-constrained outputs.

Text Generation

The foundation

Every capability above runs on this runtime.

Foundation

Local Inference

The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.

Explore the foundation

Caption, ask, or constrain.

Start in 5 minutes Back to Vision hub