Solutions · Vision · Chat with Image

Chat with images, multi-turn, streaming.

Attach one or more images to a conversation, ask follow-up questions, stream the response, call tools. Same MultiTurnConversation primitive that drives every text-only chat in LM-Kit.NET. The model holds visual context across turns.

5-minute quickstart MultiTurnConversation

Multi-turn Streaming tokens Multiple images per turn

Class

`MultiTurnConversation`

History-aware chat with image attachments.

Type

`Attachment`

File, byte array, stream, or base64.

Feature

Streaming

Tokens stream in real time for responsive UI.

Features

Everything text chat does, plus images.

Multiple images per turn

Attach several images to a single message. The VLM reasons across them: "compare these two", "which one is closer to spec".

History across turns

The model remembers images from earlier turns. Ask "and what about the second photo I sent?" three messages later.

Token streaming

Stream tokens as the model generates. Build responsive chat UIs without batching the whole answer.

Tool calling

VLMs that support function calling (like GLM-V Flash) can call your tools mid-conversation with image context.

How it works

One conversation, any number of images.

ChatWithImage.cs

using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Graphics;

var vlm = LM.LoadFromModelID("qwen3-vl:8b");
var chat = new MultiTurnConversation(vlm)
{
    SystemMessage = "You are a careful visual inspector."
};

// Turn 1: send an image.
var first = await chat.SubmitAsync(
    "Describe this part and flag any defects.",
    Attachment.FromFile("photo-1.jpg"));

// Turn 2: send a second image, reference the first.
var second = await chat.SubmitAsync(
    "This one is from the same batch. Compare to the first.",
    Attachment.FromFile("photo-2.jpg"));

// Turn 3: stream the verdict.
await foreach (var token in chat.StreamAsync(
        "Verdict: which one ships, which one goes back?"))
{
    Console.Write(token.Text);
}

Use cases

Where chat-with-image changes the UX.

Customer support

User drops a screenshot, the assistant explains what to do. Multi-turn back-and-forth without ever leaving the device.

Field inspection

Technicians snap photos of equipment, the on-device assistant reasons across them, drafts a report. Works in low-connectivity sites.

Visual code review

Paste a screenshot of a UI bug; the assistant walks the user through the fix while preserving conversation history.

Healthcare intake

Patient uploads photos of symptoms; on-device VLM drafts triage notes. PHI never crosses the network boundary.

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Multi-turn chat with vision

Console demo: drop one or more images into a multi-turn chat and stream the response.

Open on GitHub → Sample

Multi-turn chat with vision walkthrough

Annotated walkthrough of the multi-turn-chat-with-vision sample on the docs site.

Read on docs → How-to guide

Analyze images with vision

How-to: load a VLM, attach an image, stream the answer, constrain output to a schema.

Read the guide → API reference

MultiTurnConversation

API reference for the conversation primitive that drives every chat in LM-Kit.NET, including VLM chats.

Open the reference →

LM-Kit.NET pillars

Seven pillars, one foundation.

The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.

01 · AI Agents

Orchestration patterns

ReAct planning, supervisors, parallel and pipeline orchestrators, persistent memory, MCP clients, custom tools.

AI Agents

02 · Document Intelligence

Parse PDFs, images, EML

PDF text and table extraction, on-device OCR reaching SOTA benchmark scores, structured field extraction with grammar-constrained generation.

Document Intelligence

03 · Vision & Multimodal

VLMs, image classification, chat with image

Image understanding, classification, labeling, multimodal chat, image embeddings, VLM-OCR, background removal. Same conversation surface as LLMs.

Vision & Multimodal

04 · RAG & Knowledge

Vector search and retrieval

Built-in vector store, Qdrant and pgvector connectors, embeddings, hybrid retrieval, document chunking, source citations.

RAG & Knowledge

05 · Text Analysis

Classification, NER, PII, sentiment

Built-in classifiers and an extractor that emits typed C# objects via grammar-constrained sampling. Sentiment, keywords, language detection.

Text Analysis

06 · Speech & Audio

Audio transcription, STT

A growing local speech-to-text stack: hallucination suppression, Voice Activity Detection, real-time translation, streaming output, 100+ languages.

Speech & Audio

07 · Text Generation

Conversations, rewriting, summaries

Single-turn, multi-turn, and stateless conversation primitives. Translate, correct, rewrite, summarise. Prompt templates, streaming, grammar-constrained outputs.

Text Generation

The foundation

Every capability above runs on this runtime.

Foundation

Local Inference

The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.

Explore the foundation

Image-aware chat, same primitives.

Start in 5 minutes Back to Vision hub