Tensor overrides
Regex-pattern device placement. Pin each weight to CPU, GPU 0, GPU 1, or any combination.
A 30B mixture-of-experts model does not need a GPU pool to run.
With LM.TensorOverride you place individual tensors
where you want them: dense layers on a fast GPU, MoE experts on
CPU, attention on a second GPU. With FavorDistributedInference
you split a single workload across every available device. The
result is large-model inference on the hardware your team
actually has.
Regex-pattern device placement. Pin each weight to CPU, GPU 0, GPU 1, or any combination.
Run a 35B MoE on a 16 GB GPU by routing experts to CPU and keeping the hot path on the accelerator.
One LM, many devices. Tensor computation splits automatically across all available GPUs.
The interesting models are getting larger. Mixture-of-experts designs in particular ship 30B+ parameters but only activate a fraction per token. On a single consumer GPU the dense path fits and the experts do not. Tensor overrides let you place hot weights on the fast device and cold weights elsewhere, instead of refusing to run the model at all.
Match weights by regex pattern (layer indices, role keywords, parameter shape). Each match maps to a target device. Patterns process in declaration order.
Route the experts (sparse, only some active per token) to CPU; keep the dense backbone on GPU. The hot path stays accelerated; the cold path does not block VRAM.
Dense large models split layer-by-layer across two or more GPUs. FavorDistributedInference orchestrates the cross-device dataflow.
CUDA, Vulkan, Metal, AVX2 backends all participate. Mix and match: GPU 0 on CUDA for dense, CPU on AVX2 for experts.
Same model file. Same weights. Same output. Only the device map changes between configurations.
Hardware-bound, not random. Once placement is set, throughput is reproducible across runs and across machines with similar topology.
Single flag. The runtime detects every available GPU and splits tensors across them automatically. Caller code stays exactly the same; one LM, N devices.
using LMKit.Global; using LMKit.Model; // Single-flag distributed inference. The runtime splits tensors across // every available GPU automatically. Configuration.FavorDistributedInference = true; var model = LM.LoadFromModelID("glm4.7-flash"); // One LM. Computation orchestrated across N GPUs. Caller code unchanged. var chat = new MultiTurnConversation(model); var reply = await chat.SubmitAsync("Walk me through MoE expert routing.");
Run a Mixture-of-Experts model on a single 16 GB card by keeping the dense backbone on GPU and pushing the expert tensors to CPU. Same weights, roughly half the VRAM footprint.
// MoE on a single 16 GB GPU: keep the dense backbone on GPU, // push the experts to CPU. Same weights, half the VRAM. var options = new LM.LoadOptions { TensorOverrides = { // Patterns process in order. First match wins. { @"\.experts\.", DeviceTarget.Cpu }, { @"\.attn_(q|k|v|o)", DeviceTarget.Gpu(0) }, { @"^output", DeviceTarget.Gpu(0) }, { @".*", DeviceTarget.Gpu(0) } // fallback } }; var model = LM.LoadFromModelID("glm4.7-flash", options);
Speculative decoding compounded with a model split. The small draft model sits on GPU 0; the full model spreads across GPU 0 and GPU 1 by layer index. Both speed-ups stack.
// Speculative decoding pair: small draft on GPU 0, full model split across // GPU 0 and GPU 1. Latency wins compound with hardware utilisation. var draft = LM.LoadFromModelID("qwen3.5:0.8b"); var fullOptions = new LM.LoadOptions { TensorOverrides = { { @"\.layers\.([0-9]|1[0-5])\.", DeviceTarget.Gpu(0) }, { @"\.layers\.(1[6-9]|[2-9][0-9])\.", DeviceTarget.Gpu(1) }, { @".*", DeviceTarget.Gpu(0) } } }; var full = LM.LoadFromModelID("qwen3.5:27b", fullOptions); var chat = new MultiTurnConversation(full) { SpeculativeDecoding = new SpeculativeOptions { DraftModel = draft, SpeculationDepth = 5 } };
Run a 30B+ mixture-of-experts model on a 16 GB GPU by routing the experts to CPU and the hot backbone to the accelerator.
Split a dense large model layer-by-layer across two consumer-grade GPUs. The combined VRAM holds the model; the dataflow runs across both.
Fast GPU plus integrated GPU plus CPU. Place attention layers on the fast GPU, FFN on the second, vocabulary on CPU. Use everything you have.
Pin specific tensors to specific NUMA nodes for predictable cross-socket throughput on multi-CPU servers.
Frequently-touched tensors on faster memory, infrequent ones on cheaper memory. Cost-per-token improves without changing the model.
Switch from a 7B to a 14B or 30B model with no other code change. The placement map absorbs the size jump; the rest of the SDK stays the same.
Quantise to fit, then place to fit better. Q4 plus tensor overrides puts very large models on very small machines.
Speculative decoding pairs naturally with hardware-aware placement. Draft on small device, full on split GPUs.
More sessions per node when idle conversations free their layer slice. Active sessions claim the released VRAM.
Pick the right model size for the placement plan. The catalogue exposes parameter count and quantisation per variant.
Working console demos on GitHub, step-by-step how-to guides on the docs site, and the API reference for the classes used on this page.