MLX vs. GGUF on an A17 Pro.

Nimbus8 ships with two on-device LLM runtimes: MLX (Apple's array framework, optimized for Apple silicon) and llama.cpp (the pragmatist's choice, via GGUF files). We're often asked which one is "better." The honest answer is: neither. It depends on the model, the quantization, and the phone.

The test rig

iPhone 15 Pro, A17 Pro, 8 GB RAM, iOS 18.3, airplane mode on, screen at lowest brightness, battery above 70%. Each run warms up for 30 seconds, then measures sustained tokens/second over a 500-token generation with a fixed 1024-token context. Peak RSS is recorded via os_proc_available_memory polling.

Models tested: Llama 3.2 3B Instruct, Qwen 2.5 3B/7B Instruct, Gemma 3 2B/4B, Mistral 7B Instruct v0.3, Phi-3.5 Mini, and Phi-3.5 Vision.

The headline numbers

For the 3B-class models, MLX wins tokens/second by roughly 18–25%. Llama 3.2 3B at Q4: MLX sustains ~62 tok/s, llama.cpp with Metal offload ~50 tok/s. Qwen 2.5 3B tracks similarly. MLX's advantage here comes from leaning harder on the Neural Engine for the shared attention computation.

For 7B-class models, the gap narrows sharply, and GGUF has the better memory profile. llama.cpp's Q4_K_M quantization is more compact than the closest MLX equivalent, and on a phone at Q4 you're usually memory-bound first, compute-bound second.

For vision models (Phi-3.5 Vision), GGUF is the only practical path today. MLX's vision support is improving fast, but as of this writing, llama.cpp's multimodal projector handling is more mature and much better debugged.

Where each runtime genuinely wins

MLX wins when:

You're running a 2–4B parameter model at Q4 or Q8.
You care about tokens/second more than RSS ceiling.
The model has a well-maintained MLX port (Gemma, Llama, Qwen).

GGUF wins when:

The model is 7B+ and you're fighting for memory headroom.
You want exotic quantization (IQ2, IQ3, imatrix-specific).
The model is multimodal and the vision projector isn't in MLX yet.
You need broad model coverage — HuggingFace's GGUF catalog dwarfs the MLX catalog by an order of magnitude.

The capability manifest picks for you

You don't have to memorize any of this. The capability manifest flags each model with preferredRuntime and fallbackRuntime, and Nimbus8 routes automatically. When we ship a new model to the catalog, we run exactly this benchmark and pick the runtime that produced better numbers on each device tier. The table updates with the app.

What's coming

MLX's vision story is catching up fast. We expect parity on Phi-3.5 Vision by mid-year and a meaningful throughput win on Gemma 3 Vision shortly after. When that happens, the manifest updates and your phone starts using MLX for those too, automatically.

Until then: pick a model, don't think about the runtime, and enjoy the fact that your phone is doing all of this with the lid on a data center closed.

Tagged Models · Published Mar 28, 2026 · Back to all posts