Solutions · Local inference · Multi-GPU & tensor overrides

Big models. Commodity hardware.

A 30B mixture-of-experts model does not need a GPU pool to run. With LM.TensorOverride you place individual tensors where you want them: dense layers on a fast GPU, MoE experts on CPU, attention on a second GPU. With FavorDistributedInference you split a single workload across every available device. The result is large-model inference on the hardware your team actually has.

Per-tensor placement MoE expert offload Distributed across GPUs

Tensor overrides

Regex-pattern device placement. Pin each weight to CPU, GPU 0, GPU 1, or any combination.

MoE expert offload

Run a 35B MoE on a 16 GB GPU by routing experts to CPU and keeping the hot path on the accelerator.

Distributed inference

One LM, many devices. Tensor computation splits automatically across all available GPUs.

Why hardware-aware placement

Models grow. VRAM does not.

The interesting models are getting larger. Mixture-of-experts designs in particular ship 30B+ parameters but only activate a fraction per token. On a single consumer GPU the dense path fits and the experts do not. Tensor overrides let you place hot weights on the fast device and cold weights elsewhere, instead of refusing to run the model at all.

Per-tensor control

Match weights by regex pattern (layer indices, role keywords, parameter shape). Each match maps to a target device. Patterns process in declaration order.

MoE expert offload

Route the experts (sparse, only some active per token) to CPU; keep the dense backbone on GPU. The hot path stays accelerated; the cold path does not block VRAM.

Layer split across GPUs

Dense large models split layer-by-layer across two or more GPUs. FavorDistributedInference orchestrates the cross-device dataflow.

Backend-aware

CUDA, Vulkan, Metal, AVX2 backends all participate. Mix and match: GPU 0 on CUDA for dense, CPU on AVX2 for experts.

No retraining

Same model file. Same weights. Same output. Only the device map changes between configurations.

Predictable throughput

Hardware-bound, not random. Once placement is set, throughput is reproducible across runs and across machines with similar topology.

Three patterns

Place the weights, run the model.

Single flag. The runtime detects every available GPU and splits tensors across them automatically. Caller code stays exactly the same; one LM, N devices.

DistributeAcrossGpus.cs
using LMKit.Global;
using LMKit.Model;

// Single-flag distributed inference. The runtime splits tensors across
// every available GPU automatically.
Configuration.FavorDistributedInference = true;

var model = LM.LoadFromModelID("glm4.7-flash");

// One LM. Computation orchestrated across N GPUs. Caller code unchanged.
var chat = new MultiTurnConversation(model);
var reply = await chat.SubmitAsync("Walk me through MoE expert routing.");
Where placement pays off

Hardware budgets that work.

MoE on a single GPU

Run a 30B+ mixture-of-experts model on a 16 GB GPU by routing the experts to CPU and the hot backbone to the accelerator.

Two consumer GPUs, one big model

Split a dense large model layer-by-layer across two consumer-grade GPUs. The combined VRAM holds the model; the dataflow runs across both.

Workstation with mixed accelerators

Fast GPU plus integrated GPU plus CPU. Place attention layers on the fast GPU, FFN on the second, vocabulary on CPU. Use everything you have.

Server with NUMA topology

Pin specific tensors to specific NUMA nodes for predictable cross-socket throughput on multi-CPU servers.

Hot-cold separation

Frequently-touched tensors on faster memory, infrequent ones on cheaper memory. Cost-per-token improves without changing the model.

Bigger models without retraining

Switch from a 7B to a 14B or 30B model with no other code change. The placement map absorbs the size jump; the rest of the SDK stays the same.

Related capabilities

Placement plus the rest.

Quantization

Quantise to fit, then place to fit better. Q4 plus tensor overrides puts very large models on very small machines.

Quantization

Sampling controls

Speculative decoding pairs naturally with hardware-aware placement. Draft on small device, full on split GPUs.

Sampling controls

Context hibernation

More sessions per node when idle conversations free their layer slice. Active sessions claim the released VRAM.

Hibernation

Model catalog

Pick the right model size for the placement plan. The catalogue exposes parameter count and quantisation per variant.

Model catalog

Bigger models. Same hardware.

Get Community Edition Download