VLM + Grammar
Vision-language model output constrained to a finite label set.
Pick a label from a set you define. Zero-shot via VLM prompting, grammar-constrained to your exact label list. No upload, no training run required, no per-call billing. Move to LoRA fine-tuning when your dataset grows.
Vision-language model output constrained to a finite label set.
Train an adapter once, hot-swap at runtime.
Token logprobs surface as per-label confidence.
A grammar restricts the VLM to emit one of your labels (and nothing else). The model picks the best match for the input image; the output is parseable, deterministic, and exact every time.
using LMKit.Model; using LMKit.TextGeneration; using LMKit.Graphics; var vlm = LM.LoadFromModelID("qwen3-vl:8b"); var chat = new SingleTurnConversation(vlm); // 1. Define the allowed label set as a BNF grammar. var labels = new[] { "defect", "clean", "borderline" }; var grammar = Grammar.FromAllowedValues(labels); // 2. Submit the image and let the model pick a label. var result = await chat.SubmitAsync( "Classify this part photo as one of: defect, clean, borderline.", Attachment.FromFile("part-photo.jpg"), grammar); Console.WriteLine(result); // "defect", "clean", or "borderline"
Zero-shot
Give the VLM the label set in the prompt + a grammar. Works well when categories are visually distinct and well-named. Quickest path from idea to running.
Few-shot
Include 2-4 reference images per label in the prompt context. Pushes accuracy without training infrastructure. Great for nuanced categories.
LoRA
Train a LoRA on labeled examples; hot-swap at runtime. Best accuracy when you have a domain-specific taxonomy and a few hundred images per class.
Manufacturing line photos, weld inspection, surface condition. Air-gapped factory floors stay air-gapped.
"Is this an invoice, a contract, a receipt, or an ID?" Route mailroom scans to the right downstream pipeline.
Flag user-uploaded content by category. Keep moderation policy on your own server, not a third-party endpoint.
Pre-classify medical imagery for a queue. PHI never leaves the hospital network.
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.
Console demo: classify text or images into custom categories.
Open on GitHub → DemoClassify scanned documents with a VLM and grammar-constrained labels.
Open on GitHub → How-to guideDefine a taxonomy, constrain the model output, ship deterministic labels.
Read the guide → API referenceGrammar API for grammar-constrained generation.
Open the reference →The seven pillars of LM-Kit.NET, plus the local runtime they share. Highlighted card is where you are now.
01 · AI Agents
ReAct planning, supervisors, parallel and pipeline orchestrators, persistent memory, MCP clients, custom tools.
AI Agents02 · Document Intelligence
PDF text and table extraction, on-device OCR reaching SOTA benchmark scores, structured field extraction with grammar-constrained generation.
Document Intelligence03 · Vision & Multimodal
Image understanding, classification, labeling, multimodal chat, image embeddings, VLM-OCR, background removal. Same conversation surface as LLMs.
Vision & Multimodal04 · RAG & Knowledge
Built-in vector store, Qdrant connector, embeddings, hybrid retrieval, document chunking, source citations.
RAG & Knowledge05 · Text Analysis
Built-in classifiers and an extractor that emits typed C# objects via grammar-constrained sampling. Sentiment, keywords, language detection.
Text Analysis06 · Speech & Audio
A growing local speech-to-text stack: hallucination suppression, Voice Activity Detection, real-time translation, streaming output, 100+ languages.
Speech & Audio07 · Text Generation
Single-turn, multi-turn, and stateless conversation primitives. Translate, correct, rewrite, summarise. Prompt templates, streaming, grammar-constrained outputs.
Text GenerationThe foundation
Every capability above runs on this runtime.
Foundation
The runtime all seven pillars sit on. The LM-Kit.NET NuGet ships the complete inference system: open-weight LLMs, vision-language models, embeddings, on-device speech-to-text, OCR and classifiers, accelerated on CPU, AVX2, CUDA 12/13, Vulkan or Metal. One package, zero cloud calls, predictable latency, full data and technology sovereignty.