Solutions · Vision · Chat with Image

Chat with images, multi-turn, streaming.

Attach one or more images to a conversation, ask follow-up questions, stream the response, call tools. Same MultiTurnConversation primitive that drives every text-only chat in LM-Kit.NET. The model holds visual context across turns.

Multi-turn Streaming tokens Multiple images per turn
Class

MultiTurnConversation

History-aware chat with image attachments.

Type

Attachment

File, byte array, stream, or base64.

Feature

Streaming

Tokens stream in real time for responsive UI.

Features

Everything text chat does, plus images.

01

Multiple images per turn

Attach several images to a single message. The VLM reasons across them: "compare these two", "which one is closer to spec".

02

History across turns

The model remembers images from earlier turns. Ask "and what about the second photo I sent?" three messages later.

03

Token streaming

Stream tokens as the model generates. Build responsive chat UIs without batching the whole answer.

04

Tool calling

VLMs that support function calling (like GLM-V Flash) can call your tools mid-conversation with image context.

How it works

One conversation, any number of images.

ChatWithImage.cs
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Graphics;

var vlm = LM.LoadFromModelID("qwen3-vl:8b");
var chat = new MultiTurnConversation(vlm)
{
    SystemMessage = "You are a careful visual inspector."
};

// Turn 1: send an image.
var first = await chat.SubmitAsync(
    "Describe this part and flag any defects.",
    Attachment.FromFile("photo-1.jpg"));

// Turn 2: send a second image, reference the first.
var second = await chat.SubmitAsync(
    "This one is from the same batch. Compare to the first.",
    Attachment.FromFile("photo-2.jpg"));

// Turn 3: stream the verdict.
await foreach (var token in chat.StreamAsync(
        "Verdict: which one ships, which one goes back?"))
{
    Console.Write(token.Text);
}
Use cases

Where chat-with-image changes the UX.

Customer support

User drops a screenshot, the assistant explains what to do. Multi-turn back-and-forth without ever leaving the device.

Field inspection

Technicians snap photos of equipment, the on-device assistant reasons across them, drafts a report. Works in low-connectivity sites.

Visual code review

Paste a screenshot of a UI bug; the assistant walks the user through the fix while preserving conversation history.

Healthcare intake

Patient uploads photos of symptoms; on-device VLM drafts triage notes. PHI never crosses the network boundary.

LM-Kit.NET pillars

Seven pillars, one foundation.

The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.

The foundation

Every capability above runs on this runtime.

Foundation

Local Inference

The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.

Explore the foundation

Image-aware chat, same primitives.

Start in 5 minutes Back to Vision hub