👁️‍🗨️LM-Kit Goes Multimodal: Introducing Vision Support and Our 2025 Roadmap

Introduction

Over the past year, we’ve been hard at work expanding LM-Kit to empower developers with cutting-edge AI capabilities. We are now excited to announce that LM-Kit has officially gone multimodal, thanks to the addition of support for Vision Language Models (VLMs). This milestone paves the way for delivering a multi-agents orchestration system, which will be one of our main focuses in the coming year.

Our journey in 2024 revolved around building a state-of-the-art inference system—fully native to the .NET ecosystem—that can run thousands of local language models (for example, phi 4, Llama 3.3, Mistral, Gemma 2, and Qwen 2.5) on any device. This system provides robust features such as:

  • Dynamic Sampling: Leverage our proprietary method for on-the-fly sampling adaptation, described in detail on our blog.
  • Advanced Function Calling: Easily integrate your own functions and workflows using our innovative function calling approach, showcased here.

 

In addition, LM-Kit comes with a growing library of prebuilt AI agents, providing out-of-the-box solutions for various use cases—and we’re continuously expanding this list to cover even more domains.

By becoming multimodal, LM-Kit now seamlessly integrates visual and textual inputs, opening up endless possibilities for real-world applications. Looking ahead to 2025, our priority is to harness these capabilities toward an orchestration layer that manages multiple agents simultaneously—further expanding the scope of what you can achieve with LM-Kit.

Major Milestone: Vision Support in LM-Kit

New .lmk Model Architecture

We’ve introduced a new model format (.lmk) to streamline how models are packaged and deployed with LM-Kit. This format is essentially a zip file that contains:

  1. Metadata
  2. Base model tensors
  3. Other modal tensors

No Changes Required

One of the main benefits of this .lmk format is that no modifications are needed in your existing LM-Kit applications. The .lmk files are fully compatible with the current LM-Kit API. As soon as you load an .lmk file that includes a vision model, your AI agents gain instant image interpretation and reasoning capabilities.

Single-File Deployment

By consolidating all necessary components—metadata, base model, and any specialized modules like vision—into a single .lmk file, shipping and version control become much simpler. You just need to deploy one file, rather than juggling multiple files and configurations.

Future-Ready

While this announcement focuses on vision, the .lmk format is designed to handle everything in LM-Kit, including additional models, audio, and any other future capabilities. It’s a universal container for all of LM-Kit’s current and upcoming functionalities

Prebuilt Vision-Enabled Agents and Expansion Plans

For this initial release, we’ve added vision support to our prebuilt conversational AI agents. Over the next few weeks, we’ll extend the same capability to other types of agents in LM-Kit, further paving the way for multimodal AI agent orchestration—where text and images (and soon audio) are processed together for richer, more context-aware interactions.

Where to Get the Models

If you’re eager to experiment with the new .lmk files, you can find them in our Hugging Face repository. We’ve tested various configurations, and the results have gone far beyond our expectations.

See Vision in Action: Real-World Examples of LM-Kit’s Vision Support

Doesn’t this look like magic? This demo is running entirely on my laptop, performing inference with a free, Apache-2 licensed model.

Try It Yourself

For those ready to dive in, check out our tutorial here:

“The Multi-Turn Chat with Vision demo extends the capabilities of the LM-Kit.NET SDK by adding support for visual attachments in a multi-turn conversational flow. This sample shows how to integrate both Large Language Models (LLMs) and Small Language Models (SLMs) into a .NET application. By supporting models of various sizes, you can run the chatbot on devices ranging from powerful servers to smaller edge devices. The demo produces image-driven insights in addition to maintaining text-based conversational context across multiple exchanges.”

If you’re looking for a more hands-on approach, you can explore the source code directly on GitHub:

LM-Kit: The Four-in-One AI Agent Runtime

LM-Kit isn’t just an inference tool—it’s a very unique suite for creating and managing intelligent agents. Each of its four components tackles a key part of the AI journey:

  1. Multimodal Language Model Inference
    Effortlessly run LLMs, SLMs, and VLMs on-device—processing text and images with free, open-source models.

  2. Prebuilt Agent Customization
    Leverage existing AI agents, then tailor them to your specific needs for quick and efficient deployment.

  3. Agent Creation from Scratch
    Build specialized AI agents entirely on your own terms, with flexible design and capabilities.

  4. Multi-Agent Orchestration
    Coordinate entire fleets of AI agents, forming sophisticated collaborative solutions.

From stand-alone inference to orchestrating complex agent networks, LM-Kit empowers you to craft cutting-edge AI applications—securely, efficiently, and without leaving your device.

Our Vision for Generative AI in 2025

From Language Models to Agentic Solutions

Historically, we built software by coding a rigid set of predefined rules. As AI evolves from standalone language models to agentic solutions, our focus has shifted toward orchestrating autonomous AI “agents” that function more like small teams of humans. Instead of hardcoding every procedure, we now define objectives, instructions, roles, permissions, and organizational structures.

Because these agents learn from accumulated knowledge and experiences—much as people do—we can no longer rely on purely deterministic methods. This change demands new software architectures designed to handle flexibility, autonomy, and decision-making at scale, marking a transition from traditional rule-based development to a more dynamic, agent-based approach.

LM-Kit’s mission is to help developers embrace this new paradigm by providing a comprehensive AI Agent Runtime framework, enabling them to build, deploy, and manage these autonomous AI agents more effectively.

2024 in Review

In 2024, LM-Kit’s research and development focused on the question: “What can we really do with LLMs?” To address this, we developed a local inference system that supports text and vision modalities, utilizing Large Language Models (LLMs), Small Language Models (SLMs), and Vision Language Models (VLMs). Additionally, we created out-of-the-box AI agents tailored to various industry-specific use cases, showcasing the practical applications and versatility of these models.

Looking Forward to 2025

Moving into 2025, we will:

  • Continue to enhance our inference system to ensure maximum performance and flexibility.
  • Invest heavily in AI Agent Orchestration features, enriching our cognitive framework for more collaborative and autonomous agent interactions.
  • Introduce audio/voice support, further expanding our multimodal approach and offering even more ways for agents to communicate.

Get Started Today

We’re thrilled to see what our developer community will build with these new vision features in LM-Kit. Whether you’re creating a simple chatbot or orchestrating a multi-agent system, our goal is to make AI development as open, flexible, and powerful as possible.

Ready to start? Download the latest version of LM-Kit, explore our Hugging Face repository for .lmk models, and check out our tutorials and sample code to integrate vision support into your apps right away.

Thank you for being part of our journey. We can’t wait to see what you build with LM-Kit’s new multimodal AI capabilities—and we’re excited to bring you even more advanced features in 2025!

Share Post

Send us Your Feedback

Stay anonymous if you prefer