Nimbus8 Try Now
Blog · Runtime

Picking a model that actually runs on your iPhone.

Apr 18, 2026 7 min read All posts

The biggest mistake a mobile LLM app can make is shipping with a "bigger is better" model picker. On an iPhone, the right model is the one that fits — not the one with the most parameters.

Nimbus8 has opinions about this, and those opinions live in a small JSON file we call the capability manifest. It describes every open model we support — architecture, context length, vision/OCR/PDF flags, required runtime, quantization, and, crucially, how much RAM it actually needs on device.

Why "fits" wins

On a phone, the variable that matters most isn't raw parameter count — it's whether the model loads without the OS evicting your app. A 70B parameter model streamed over the cellular network sounds impressive in a benchmark, but in practice it produces a 2-second latency per token and fails the moment the tunnel flickers. A 3B-parameter model running on the Neural Engine at 60 tokens/second is the conversation you actually want.

The capability manifest makes this tradeoff explicit. Each entry has an onDeviceFit score derived from measured tokens/second, warmup time, and peak RSS on the current device tier. Models that can't keep the interface responsive are hidden by default. You can flip a switch in Settings to "show incompatible models" — we trust you — but the default path is to only offer picks that will make the phone feel like a phone.

Memory budgets, in practice

iOS gives each app a memory high-water mark that depends on the device and the current system load. For a baseline iPhone 15, that's roughly 3 GB of app-private RAM before you're pressured. A Q4-quantized 3B weighs in around 1.7 GB, leaving comfortable headroom for the runtime, the text layout engine, the attachments, and a prompt cache. A Q8 7B eats most of the budget, works on a Pro, and starts paging on an SE.

We keep a simple table of these budgets per device tier and prune the catalog to only show what the specific phone can actually run. You don't have to care about it; it just works.

Why quantized 3B beats streamed 70B

Three reasons, in rough order of importance:

  • Latency to first token. A local 3B starts emitting in under 300 ms. A cloud 70B takes whatever TLS, DNS, and the provider's queue decide to give you. On-device always wins the first-impression fight.
  • Offline by default. Subway, plane, parking garage, country without LTE. The local model doesn't care.
  • Privacy by construction. The prompt is the one thing that literally never leaves the device. No "may be used to improve our services" asterisk.

A 70B still wins for dense technical tasks that benefit from its extra capacity. Nimbus8 doesn't pretend otherwise: for those, you open Cirrus with a BYO-key cloud provider if you want to, clearly labeled as network. But for chat, summarization, outlines, and most of the day? The phone is fine. The phone is, in fact, great.

Where to start

If you're installing Nimbus8 for the first time on a modern iPhone, we suggest:

  • Llama 3.2 3B Instruct (Q4_K_M) for everyday chat.
  • Qwen 2.5 Coder 7B on Pro-tier phones for Cirrus.
  • Gemma 3 2B if you want the smallest responsive chat model and the best battery life.

All three show up at the top of the picker, routed to the right runtime, pre-filtered for your device. No benchmark spreadsheet required.


Tagged Runtime · Published Apr 18, 2026 · Back to all posts