Solutions · Local inference · Multi-GPU & tensor overrides

Big models. Commodity hardware.

A 30B mixture-of-experts model does not need a GPU pool to run. With LM.TensorOverride you place individual tensors where you want them: dense layers on a fast GPU, MoE experts on CPU, attention on a second GPU. With FavorDistributedInference you split a single workload across every available device. The result is large-model inference on the hardware your team actually has.

Start building free Tensor override API

Per-tensor placement MoE expert offload Distributed across GPUs

Tensor overrides

Regex-pattern device placement. Pin each weight to CPU, GPU 0, GPU 1, or any combination.

MoE expert offload

Run a 35B MoE on a 16 GB GPU by routing experts to CPU and keeping the hot path on the accelerator.

Distributed inference

One LM, many devices. Tensor computation splits automatically across all available GPUs.

Why hardware-aware placement

Models grow. VRAM does not.

The interesting models are getting larger. Mixture-of-experts designs in particular ship 30B+ parameters but only activate a fraction per token. On a single consumer GPU the dense path fits and the experts do not. Tensor overrides let you place hot weights on the fast device and cold weights elsewhere, instead of refusing to run the model at all.

Per-tensor control

Match weights by regex pattern (layer indices, role keywords, parameter shape). Each match maps to a target device. Patterns process in declaration order.

MoE expert offload

Route the experts (sparse, only some active per token) to CPU; keep the dense backbone on GPU. The hot path stays accelerated; the cold path does not block VRAM.

Layer split across GPUs

Dense large models split layer-by-layer across two or more GPUs. FavorDistributedInference orchestrates the cross-device dataflow.

Backend-aware

CUDA, Vulkan, Metal, AVX2 backends all participate. Mix and match: GPU 0 on CUDA for dense, CPU on AVX2 for experts.

No retraining

Same model file. Same weights. Same output. Only the device map changes between configurations.

Predictable throughput

Hardware-bound, not random. Once placement is set, throughput is reproducible across runs and across machines with similar topology.

Three patterns

Place the weights, run the model.

Single flag. The runtime detects every available GPU and splits tensors across them automatically. Caller code stays exactly the same; one LM, N devices.

DistributeAcrossGpus.cs

using LMKit.Global;
using LMKit.Model;

// Single-flag distributed inference. The runtime splits tensors across
// every available GPU automatically.
Configuration.FavorDistributedInference = true;

var model = LM.LoadFromModelID("glm4.7-flash");

// One LM. Computation orchestrated across N GPUs. Caller code unchanged.
var chat = new MultiTurnConversation(model);
var reply = await chat.SubmitAsync("Walk me through MoE expert routing.");

Run a Mixture-of-Experts model on a single 16 GB card by keeping the dense backbone on GPU and pushing the expert tensors to CPU. Same weights, roughly half the VRAM footprint.

MoeExpertOffload.cs

// MoE on a single 16 GB GPU: keep the dense backbone on GPU,
// push the experts to CPU. Same weights, half the VRAM.
var options = new LM.LoadOptions
{
    TensorOverrides =
    {
        // Patterns process in order. First match wins.
        { @"\.experts\.",         DeviceTarget.Cpu },
        { @"\.attn_(q|k|v|o)",   DeviceTarget.Gpu(0) },
        { @"^output",            DeviceTarget.Gpu(0) },
        { @".*",                 DeviceTarget.Gpu(0) }   // fallback
    }
};

var model = LM.LoadFromModelID("glm4.7-flash", options);

Speculative decoding compounded with a model split. The small draft model sits on GPU 0; the full model spreads across GPU 0 and GPU 1 by layer index. Both speed-ups stack.

SpeculativeWithSplit.cs

// Speculative decoding pair: small draft on GPU 0, full model split across
// GPU 0 and GPU 1. Latency wins compound with hardware utilisation.
var draft = LM.LoadFromModelID("qwen3.5:0.8b");

var fullOptions = new LM.LoadOptions
{
    TensorOverrides =
    {
        { @"\.layers\.([0-9]|1[0-5])\.", DeviceTarget.Gpu(0) },
        { @"\.layers\.(1[6-9]|[2-9][0-9])\.", DeviceTarget.Gpu(1) },
        { @".*", DeviceTarget.Gpu(0) }
    }
};
var full = LM.LoadFromModelID("qwen3.5:27b", fullOptions);

var chat = new MultiTurnConversation(full)
{
    SpeculativeDecoding = new SpeculativeOptions { DraftModel = draft, SpeculationDepth = 5 }
};

Where placement pays off

Hardware budgets that work.

MoE on a single GPU

Run a 30B+ mixture-of-experts model on a 16 GB GPU by routing the experts to CPU and the hot backbone to the accelerator.

Two consumer GPUs, one big model

Split a dense large model layer-by-layer across two consumer-grade GPUs. The combined VRAM holds the model; the dataflow runs across both.

Workstation with mixed accelerators

Fast GPU plus integrated GPU plus CPU. Place attention layers on the fast GPU, FFN on the second, vocabulary on CPU. Use everything you have.

Server with NUMA topology

Pin specific tensors to specific NUMA nodes for predictable cross-socket throughput on multi-CPU servers.

Hot-cold separation

Frequently-touched tensors on faster memory, infrequent ones on cheaper memory. Cost-per-token improves without changing the model.

Bigger models without retraining

Switch from a 7B to a 14B or 30B model with no other code change. The placement map absorbs the size jump; the rest of the SDK stays the same.

Demos and guides

Working references.

GuideOffload MoE experts to CPU GuideDistribute models across multiple GPUs GuideDistributed inference (getting started) APILM.LoadOptions APIConfiguration.FavorDistributedInference GuideEstimate memory before loading

Related capabilities

Placement plus the rest.

Quantization

Quantise to fit, then place to fit better. Q4 plus tensor overrides puts very large models on very small machines.

Quantization

Sampling controls

Speculative decoding pairs naturally with hardware-aware placement. Draft on small device, full on split GPUs.

Sampling controls

Context hibernation

More sessions per node when idle conversations free their layer slice. Active sessions claim the released VRAM.

Hibernation

Model catalog

Pick the right model size for the placement plan. The catalogue exposes parameter count and quantisation per variant.

Model catalog

Demos & docs

Build it. Read it. Try it.

Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.

Demo

Bigger models. Same hardware.

Get Community Edition Download

Big models. Commodity hardware.

Tensor overrides

MoE expert offload

Distributed inference

Models grow. VRAM does not.

Per-tensor control

MoE expert offload

Layer split across GPUs

Backend-aware

No retraining

Predictable throughput

Place the weights, run the model.

Hardware budgets that work.

MoE on a single GPU

Two consumer GPUs, one big model

Workstation with mixed accelerators

Server with NUMA topology

Hot-cold separation

Bigger models without retraining

Working references.

Placement plus the rest.

Quantization

Sampling controls

Context hibernation

Model catalog

Build it. Read it. Try it.

Multi-GPU and Tensor Overrides

Multi-GPU and Tensor Overrides walkthrough

Distribute models across multiple GPUs

Offload MoE experts to CPU with tensor overrides

Bigger models. Same hardware.