qwen3.5:9b
Default VLM. Strong general image understanding.
Caption, describe, classify, and answer questions about any image using vision-language models that run on your hardware. Same conversation surface as text-only chat. No upload, no API key, no per-call billing.
qwen3.5:9bDefault VLM. Strong general image understanding.
glm-4.6v-flashLightweight, fast, function-calling capable.
Open-weight VLM, multilingual.
01
Generate accurate captions for any image. Short sentences for asset libraries, long-form for accessibility, structured for catalogs.
02
Ask a question about an image and get an answer grounded in what the model sees. Counts objects, identifies attributes, reads signage.
03
Multi-step reasoning over visual content. Diagrams, charts, layouts, screenshots. Pair with reasoning models for harder tasks.
04
Constrain output to a JSON schema or BNF grammar. Get typed objects from images with deterministic shape, no post-processing.
using LMKit.Model; using LMKit.TextGeneration; using LMKit.Graphics; var vlm = LM.LoadFromModelID("qwen3-vl:8b"); var chat = new SingleTurnConversation(vlm); // 1. Caption an image. var caption = await chat.SubmitAsync( "Caption this image in one sentence.", Attachment.FromFile("product.jpg")); // 2. Visual Q&A on the same image. var answer = await chat.SubmitAsync( "How many people are visible? Reply with a number only.", Attachment.FromFile("crowd.jpg")); // 3. Structured output via grammar-constrained generation. var grammar = Grammar.FromJsonSchema(""" { "type": "object", "properties": { "subject": { "type": "string" }, "tags": { "type": "array", "items": { "type": "string" } }, "is_safe": { "type": "boolean" } } } """); var json = await chat.SubmitAsync( "Describe this image as JSON.", Attachment.FromFile("photo.jpg"), grammar);
Generate alt text on the user's device for screen readers, captions for video frames, descriptive narration.
Bulk-caption asset libraries, generate descriptions for stock images, build searchable indexes from photo archives.
Read screenshots customers attach to tickets. Triage automatically before a human looks.
Describe visual content for review queues without sending the original image to a third-party endpoint.
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Drop an image into a chat, ask questions, stream the response.
Open on GitHub → How-to guideCaption, describe, answer visual questions with a VLM.
Read the guide → How-to guideVision-grounded retrieval over images and scanned pages.
Read the guide → API referenceThe conversation primitive used for one-shot VLM prompts.
Open the reference →The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.
01 · AI Agents
ReAct planning, supervisors, parallel and pipeline orchestrators, persistent memory, MCP clients, custom tools.
AI Agents02 · Document Intelligence
PDF text and table extraction, on-device OCR reaching SOTA benchmark scores, structured field extraction with grammar-constrained generation.
Document Intelligence03 · Vision & Multimodal
Image understanding, classification, labeling, multimodal chat, image embeddings, VLM-OCR, background removal. Same conversation surface as LLMs.
Vision & Multimodal04 · RAG & Knowledge
Built-in vector store, Qdrant connector, embeddings, hybrid retrieval, document chunking, source citations.
RAG & Knowledge05 · Text Analysis
Built-in classifiers and an extractor that emits typed C# objects via grammar-constrained sampling. Sentiment, keywords, language detection.
Text Analysis06 · Speech & Audio
A growing local speech-to-text stack: hallucination suppression, Voice Activity Detection, real-time translation, streaming output, 100+ languages.
Speech & Audio07 · Text Generation
Single-turn, multi-turn, and stateless conversation primitives. Translate, correct, rewrite, summarise. Prompt templates, streaming, grammar-constrained outputs.
Text GenerationThe foundation
Every capability above runs on this runtime.
Foundation
The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.