Solutions · Vision · Image Understanding

Image understanding, on-device.

Caption, describe, classify, and answer questions about any image using vision-language models that run on your hardware. Same conversation surface as text-only chat. No upload, no API key, no per-call billing.

Captioning Visual Q&A Scene reasoning
Model

qwen3.5:9b

Default VLM. Strong general image understanding.

Model

glm-4.6v-flash

Lightweight, fast, function-calling capable.

Model

Gemma 4 Vision

Open-weight VLM, multilingual.

What it does

Four tasks, one prompt away.

01

Captioning

Generate accurate captions for any image. Short sentences for asset libraries, long-form for accessibility, structured for catalogs.

02

Visual Q&A

Ask a question about an image and get an answer grounded in what the model sees. Counts objects, identifies attributes, reads signage.

03

Scene reasoning

Multi-step reasoning over visual content. Diagrams, charts, layouts, screenshots. Pair with reasoning models for harder tasks.

04

Structured output

Constrain output to a JSON schema or BNF grammar. Get typed objects from images with deterministic shape, no post-processing.

How it works

One conversation, one attachment.

ImageUnderstanding.cs
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Graphics;

var vlm  = LM.LoadFromModelID("qwen3-vl:8b");
var chat = new SingleTurnConversation(vlm);

// 1. Caption an image.
var caption = await chat.SubmitAsync(
    "Caption this image in one sentence.",
    Attachment.FromFile("product.jpg"));

// 2. Visual Q&A on the same image.
var answer = await chat.SubmitAsync(
    "How many people are visible? Reply with a number only.",
    Attachment.FromFile("crowd.jpg"));

// 3. Structured output via grammar-constrained generation.
var grammar = Grammar.FromJsonSchema("""
    { "type": "object",
      "properties": {
        "subject":   { "type": "string" },
        "tags":      { "type": "array", "items": { "type": "string" } },
        "is_safe":   { "type": "boolean" } } }
    """);
var json = await chat.SubmitAsync(
    "Describe this image as JSON.",
    Attachment.FromFile("photo.jpg"),
    grammar);
Use cases

Where image understanding belongs.

Accessibility

Generate alt text on the user's device for screen readers, captions for video frames, descriptive narration.

Media catalogs

Bulk-caption asset libraries, generate descriptions for stock images, build searchable indexes from photo archives.

Customer support

Read screenshots customers attach to tickets. Triage automatically before a human looks.

Compliance & safety

Describe visual content for review queues without sending the original image to a third-party endpoint.

LM-Kit.NET pillars

Seven pillars, one foundation.

The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.

The foundation

Every capability above runs on this runtime.

Foundation

Local Inference

The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.

Explore the foundation

Caption, ask, or constrain.

Start in 5 minutes Back to Vision hub