Model Catalog — Nimbus8

How the catalog works

The catalog is a curated JSON manifest shipped with every Nimbus8 release. Each entry includes the Hugging Face repo, filename, quantization level, expected memory footprint, and a set of capability flags (vision, tool-use, thinking, multilingual). When you browse models inside the app, you're reading this manifest — not hitting an API.

Device-fit gating runs before any model appears in your list. Nimbus8 reads your chip generation (A15, A16, A17 Pro, M1, etc.) and available RAM, then filters the manifest to models that will actually load and run. If a model needs 6 GB of RAM and your device has 4 GB available, it won't show up — no failed downloads, no OOM crashes.

Auto-tiering goes a step further. Within each module, models are ranked into tiers — Recommended, Capable, and Lightweight — based on measured tok/s, quality benchmarks, and memory headroom for your specific device. The top-tier pick is pre-selected when you first open a module, but you can always switch.

Chat models (Gale)

Gale's chat models are the heart of Nimbus8. The current catalog includes Llama 3.2 1B and 3B (MLX 4-bit, Q4_K_M GGUF), Qwen3 0.6B/1.7B/4B (MLX 4-bit), Gemma 3 1B/4B (MLX 4-bit), Mistral 7B-Instruct-v0.3 (Q4_K_M GGUF), and Phi-4-mini 3.8B (MLX 4-bit). All instruct-tuned variants are tested with the chat template expected by each model family.

On an iPhone 15 Pro (A17 Pro, 8 GB), the 3B–4B class models deliver 18–35 tok/s with a 4096-token context window. The 1B models push 45–60 tok/s and fit comfortably on 6 GB devices. Llama 3.2 3B Q4_K_M uses roughly 2.1 GB of RAM at inference; Qwen3 4B MLX 4-bit sits around 2.6 GB. Context windows vary — Qwen3 supports up to 32k natively, though Nimbus8 caps effective context at 8192 tokens on 8 GB devices to leave headroom for the OS.

MLX 4-bit quantizations are preferred on Apple Silicon because they leverage the unified memory architecture and the GPU's matrix multiply units directly. GGUF Q4_K_M via llama.cpp is the fallback for models without MLX weights or for older A15/A16 chips where llama.cpp's NEON kernels are better tuned. The manifest marks which backend each model entry targets, so there's no ambiguity at download time.

Code models (Cirrus)

Cirrus uses code-specialized models for inline diff generation, file creation, and repository-aware edits. The catalog includes Qwen3-Coder 4B (MLX 4-bit), CodeGemma 2B and 7B (Q4_K_M GGUF), and DeepSeek-Coder 1.3B and 6.7B (Q4_K_M GGUF). Each model is tested against a suite of diff-generation tasks — creating files, editing functions, and multi-file refactors — to verify that structured output is reliable.

Tool-calling support is a key differentiator. Qwen3-Coder and DeepSeek-Coder both support the function-calling format that Cirrus uses to invoke file operations, run searches, and generate diffs. CodeGemma works well for pure code completion but doesn't reliably follow tool-call schemas, so it's flagged as "no tool-use" in the manifest and used only for inline suggestions.

Inline diff generation requires the model to emit structured output — a JSON patch or a before/after block — that the Rust diff engine can parse. Models that hallucinate line numbers or produce malformed patches are excluded from the catalog. Every code model entry includes a "diff reliability" score from 0–100 based on automated testing against 200 reference edits.

Vision models (Stratus, Gale vision)

Vision-language models power both Stratus (camera-to-ask OCR) and Gale's multimodal input. The catalog includes Qwen3-VL 3B (MLX 4-bit), LLaVA 1.6 7B (Q4_K_M GGUF), and Gemma-3-VIT 4B (MLX 4-bit). These models accept image tokens alongside text and can describe photos, read documents, extract tables, and answer questions about visual content.

Multimodal routing is handled by the capability manifest. When you attach an image to a Gale message, the runtime checks whether the active model has the vision flag. If it does, the image is resized, tokenized, and prepended to the prompt. If it doesn't, Nimbus8 offers to switch to a vision-capable model or falls back to Apple Vision framework OCR for text extraction.

Memory requirements are higher for VLMs. Qwen3-VL 3B needs roughly 3.2 GB at inference with a single 768×768 image. LLaVA 1.6 7B requires 5.1 GB and is gated to 8 GB devices only. Image resolution is automatically downscaled to fit within the model's maximum patch count — typically 1024 patches for 3B models and 2048 for 7B — to keep memory predictable.

Image models (Mirage)

Mirage generates images on-device using Apple's Core ML pipeline on the Neural Engine. The catalog includes Stable Diffusion 1.5 (palettized, ~1.7 GB), SDXL Turbo (palettized, ~2.9 GB), and PixArt-α 512 (Core ML, ~2.2 GB). All models are converted to Core ML's .mlmodelc format with 6-bit palettization where supported, which cuts model size by roughly 60% compared to float16.

SD 1.5 palettized is the default pick for 6 GB devices. It generates 512×512 images in 20 steps in about 8 seconds on A17 Pro, using approximately 2.4 GB of peak memory. SDXL Turbo produces 512×512 images in just 4 steps (~3 seconds) but needs 3.8 GB of peak memory, so it's gated to 8 GB devices. PixArt-α offers a different aesthetic — cleaner compositions, better text rendering — at a memory cost between the two.

The Neural Engine is the primary compute target. Core ML's scheduler routes the UNet and VAE to the Neural Engine automatically, with fallback to the GPU for operations the ANE doesn't support. Nimbus8's memory guard checks available memory before every generation and refuses to start if headroom is below 400 MB, preventing mid-generation OOM kills that would lose the user's work.

Speech models (Overture)

Overture's speech-to-text runs on Whisper models in GGML format via whisper.cpp. The catalog includes Whisper tiny (39M params, ~75 MB), base (74M, ~142 MB), small (244M, ~466 MB), and medium (769M, ~1.5 GB). Each model is tested for word error rate across English, Spanish, French, German, Japanese, and Mandarin test sets.

Whisper tiny and base are the default picks for 4 GB and 6 GB devices respectively. They run in real-time or faster on all supported chips — tiny processes a 60-second clip in about 2 seconds on A17 Pro. Small is the sweet spot for accuracy on 8 GB devices, with a word error rate roughly 40% lower than base on noisy audio. Medium is available on 8 GB devices but gated behind a manual toggle because it uses 2.1 GB at inference.

Language support covers 99 languages via Whisper's built-in language detection. You can also pass a language hint to skip detection and improve accuracy for known-language audio. Word-level timestamps are extracted from the cross-attention weights and exposed as structured data — Overture uses them for word-by-word playback highlighting and SRT/VTT export.

TTS models (Overture/Chinook)

Text-to-speech in Nimbus8 is powered by Kokoro-82M v1.0, a lightweight neural TTS model that runs through ONNX Runtime on the Neural Engine. At 82 million parameters and roughly 320 MB on disk, it's small enough to keep loaded alongside a chat model without memory pressure on 8 GB devices.

Kokoro ships with 54 curated voices across 10 languages: English (American and British), Spanish, French, German, Italian, Portuguese, Japanese, Korean, and Mandarin. Each voice is a style vector — a small embedding that conditions the decoder — so switching voices doesn't require loading a separate model. Voice previews are available in the Chinook settings screen.

Output is 24 kHz mono WAV, playable directly through AVAudioPlayer or exportable as a file. On A17 Pro, Kokoro synthesizes roughly 150 words per second — well above real-time. Latency from tap to first audio is typically under 200ms for short utterances. The ONNX Runtime backend is configured to use the Neural Engine as the primary execution provider, with CPU fallback for unsupported ops.

Embedding models (Mist)

Mist's hybrid search relies on dense vector embeddings for the semantic retrieval lane. The catalog includes BGE-small-en-v1.5 (33M params, ~130 MB ONNX), MiniLM-L6-v2 (22M params, ~90 MB ONNX), and EmbeddingGemma-300M (~1.1 GB ONNX). All produce 384-dimensional vectors (BGE and MiniLM) or 768-dimensional vectors (EmbeddingGemma), normalized to unit length for cosine similarity.

BGE-small-en is the default. It embeds a 512-token passage in about 12ms on A17 Pro and produces high-quality English embeddings that score competitively with models 10× its size on MTEB retrieval benchmarks. MiniLM-L6-v2 is slightly faster (~8ms per passage) but scores a few points lower on semantic similarity tasks. EmbeddingGemma-300M is the multilingual option — it handles 30+ languages well but needs 1.4 GB at inference, so it's gated to 8 GB devices.

All embedding models run through ONNX Runtime with the CoreML execution provider, which routes the transformer layers to the Neural Engine. The embedding pipeline tokenizes, pads, runs inference, and mean-pools in a single synchronous call. Batch embedding (used during knowledge pack ingest) processes up to 32 passages per batch to maximize Neural Engine throughput.

Translation models (Zephyr)

Zephyr doesn't use dedicated translation models. Instead, it routes through the same instruct LLMs available in Gale — Llama 3.2, Qwen3, Gemma 3, Phi-4-mini — with a translation-specific system prompt. The prompt instructs the model to translate from a source language to a target language, preserving formatting, and to output only the translation without commentary.

This approach means translation quality scales with model size. Qwen3 4B produces noticeably better translations than Llama 3.2 1B, especially for languages with complex morphology (Japanese, Korean, Arabic). The catalog marks each chat model's supported translation pairs based on testing — a model is listed for a language pair only if it scores above a quality threshold on a held-out test set of 500 sentence pairs.

For users who need higher-quality translation and are willing to use more memory, the catalog includes a note recommending Qwen3 4B or Gemma 3 4B as the best balance of quality and speed for multilingual work. The 1B models are adequate for simple sentences and common language pairs (English↔Spanish, English↔French) but struggle with idiomatic expressions and technical vocabulary.

Bring your own model

The catalog is curated, but Nimbus8 doesn't lock you in. You can download any compatible model from Hugging Face using the built-in browser — search by name, filter by format (GGUF, MLX, ONNX, Core ML), pick a quantization variant, and download. The app's resumable download manager handles large files gracefully, with HTTP Range resume and LFS redirect support.

Device-fit gating still applies to BYOD models. Before loading, Nimbus8 estimates the model's memory footprint from its file size and quantization level, then checks against available RAM. If the model is too large, you'll see a clear warning with the estimated requirement and your device's available headroom. You can override the warning, but the app will refuse to load if available memory drops below the 400 MB safety floor.

Supported formats are GGUF (via llama.cpp), MLX (via mlx-swift), ONNX (via ONNX Runtime), and Core ML (.mlmodelc). The format determines which backend handles inference. GGUF models work on all supported devices; MLX models require Apple Silicon (A14+); ONNX models use the CoreML execution provider for Neural Engine acceleration; Core ML models run natively. If you download a model in an unsupported format, the app will tell you why it can't load and suggest alternatives.

Models that actually run on your iPhone.