Platform · Runtime

On-device engines.

Three inference backends — MLX for Apple Silicon, GGUF via llama.cpp for broad model coverage, and Core ML for vision and audio — selected automatically per device.

Overview

Nimbus8 doesn't ship its own inference engine. Instead it wraps three proven, open-source runtimes and picks the right one for each model and device combination. The goal is simple: the fastest token-per-second rate your hardware can sustain, with zero configuration on your part.

All three backends run entirely on-device. No network call is made during inference — ever.

MLX

Apple's own machine-learning framework for Apple Silicon. MLX is the preferred backend on devices with M-series or recent A-series chips. It uses unified memory to avoid copies between CPU and GPU, which means larger models fit in less RAM than you'd expect.

Nimbus8 loads MLX-format weights directly from Hugging Face repos. Quantized variants (4-bit, 8-bit) are supported out of the box.

GGUF via llama.cpp

The workhorse. llama.cpp supports the widest range of model architectures and quantization formats. If a model exists in GGUF format on Hugging Face, Nimbus8 can probably run it.

The Rust core bridges to llama.cpp through a thin FFI layer. Thread count, batch size, and context length are tuned automatically based on the device's chip and available memory.

Core ML

Used primarily for vision (Stratus) and audio (Overture) workloads. Core ML models run on the Neural Engine, freeing the GPU for text inference. This is how Nimbus8 can run OCR and chat simultaneously without one starving the other.

Whisper (speech-to-text) and the image diffusion pipeline both use Core ML paths optimized by Apple's ml-stable-diffusion and whisper.cpp integrations.

Automatic selection

When you load a model, the runtime checks the model's format, the device's chip generation, and available memory. It picks the backend that will give you the best throughput without risking an out-of-memory kill. You never have to choose — but if you're curious, the model detail screen shows which engine is active.

Performance

Token rates vary by model size, quantization, and device. As a rough guide: a 3B parameter model at Q4 quantization streams at 30–50 tokens/second on an iPhone 15 Pro. Larger models (7B–8B) run at 10–20 tok/s on the same hardware. The runtime measures actual throughput on first load and stores it in the model registry for future reference.