Architecture overview
Nimbus8 runs open models entirely on-device using three inference runtimes, selected automatically per device and model format. MLX is the primary runtime for Apple Silicon — it powers chat, code, agent, and translation workloads through the MLXLLM Swift package linked directly into the iOS target. MLX models are loaded in-process with no server, no socket, and no IPC overhead. llama.cpp handles GGUF-quantized models via the Rust core's inference_client module, which speaks the OpenAI-compatible HTTP protocol against a local llama-server process or any compatible endpoint on the user's LAN. Core ML drives the Neural Engine for diffusion (Mirage) and speech (Overture) workloads through Apple's ml-stable-diffusion and ONNX Runtime frameworks.
Runtime selection is not user-facing — it's determined by the model's format and the device's hardware profile. An MLX model downloaded from Hugging Face routes through LocalMLXProvider; a GGUF file routes through RemoteOpenAIProvider pointed at a local endpoint; a Core ML pipeline routes through Apple's own framework APIs. The user sees one unified interface regardless of which engine is doing the work underneath.
The entire stack is split across two languages. Swift owns the UI layer, the module surfaces, session management, and the InferenceManager orchestration. Rust owns the performance-critical foundation: the FFI bridge, download manager, diff engine, Hugging Face client, search index, speech-to-text, and text-to-speech. The two halves communicate through a C ABI defined in rust-core/src/ffi.rs, with a shared tokio runtime (RUNTIME) that persists across FFI calls so async work doesn't pay per-call thread setup costs.
The module system
Nimbus8 ships eight modules, all named after weather phenomena: Gale (chat), Cirrus (code), Ashe (agents), Mist (search), Mirage (image and video generation), Overture (audio — transcription, dictation, TTS, music generation), Stratus (vision and OCR), and Zephyr (translation). Each module is defined as a case of the NimbusModule enum, which carries a rawValue, displayName, SF Symbol icon, and human-readable description. Legacy module names ("echo", "chinook", "lens") are mapped to their merged successors in a custom init(from:) decoder so persisted data survives renames.
Every module shares the same on-device inference runtime. A model installed for Gale can be borrowed by Ashe, Cirrus, Zephyr, or Stratus — the ModelRouter.moduleAccepts(_:mountedFor:) function encodes the cross-module borrowing rules. Ashe reuses any LLM mounted under Gale. Cirrus accepts both coding-specialized models and generic Gale chat models. Zephyr borrows Gale's instruct LLM for translation. Stratus borrows Gale's vision-capable LLM for deep reads. This means a user who downloads a single model can use it across multiple modules without duplicating files on disk.
Module surfaces are SwiftUI views under Nimbus/Features/, each with its own ViewModel and any module-specific stores. Shared infrastructure — inference, model management, sessions, theming, hardware scanning — lives in Nimbus/Core/. The session system uses ModuleKind to tag each session with its originating module, and ModuleSessionAdapters to serialize module-specific payloads (Mirage generation parameters, Overture audio settings, Cirrus workspace state) alongside the generic chat history.
Capability manifest
Every model in Nimbus — curated catalog entries and user-downloaded Hugging Face models alike — resolves to a ModelCapabilities struct. This struct is the single source of truth for what a model can do and how the UI should render its controls. It carries a CapabilityKind (e.g. .textGeneration, .imageTextToText, .automaticSpeechRecognition, .textToImage, .textToSpeech, .featureExtraction), a list of target modules, a ParamTable of runtime-tunable knobs, and a set of CapabilityFeature flags.
Feature flags control optional behaviors the module UI hides when unsupported: .vision (accepts image inputs), .pdfNative (accepts PDF bytes directly), .toolCalling (OpenAI-style function calls), .code (show in Cirrus), .streaming (token streaming), .negativePrompt and .imgToImg (diffusion), .wordTimestamps and .translate (STT), .ssml and .multiVoice (TTS), .reranking and .multilingual (embedders). The ParamTable stores each parameter under a stable ParamKey string that matches the OpenAI-compatible wire name — temperature, top_p, top_k, min_p, repeat_penalty, max_tokens, steps, guidance, seed — so the settings UI renders controls generically from the schema without per-model view code.
Resolution happens in CapabilityRegistry. For curated catalog entries, capabilities are hand-tuned per model family with known context limits, vision support, and validated sampler/size presets. For ad-hoc Hugging Face downloads, the registry infers capabilities from the repo's pipeline_tag, tags, library name, and filename patterns. The inference is conservative — it shows fewer controls rather than lying about backend support. Vision-capable models are tracked in a hand-maintained visionCapableIds set; context limits are encoded per model id with a 4,096-token fallback for unknown entries.
Model routing
When a model finishes downloading, ModelRouter answers the question: "which module dropdown does this belong in?" For curated catalog entries, routing is a direct switch on ModelMetadata.category — .stt and .tts route to Overture, .image to Mirage, .coding to Cirrus, and everything else to Gale. For ad-hoc Hugging Face downloads, ModelRouter.classify(repoId:filename:tags:libraryName:pipelineTag:) runs a cascade of heuristics: HF pipeline tags first (the most reliable signal), then repo-id keyword matching, then tag and library-name checks. Each classification carries a Confidence level (.high, .medium, .low) and a human-readable reason so the UI can prompt the user to confirm borderline cases.
MountedModelStore is the persistence layer for installed models. It's a lightweight @Observable singleton backed by UserDefaults JSON. Each Entry records the repoId, filename, module, and installedAt timestamp. The store tracks the user's active pick per module (activeIdByModule) and the last-used model per module (lastUsedIdByModule) so navigating back to a module restores the model the user last worked with — no manual re-selection needed. First-mount-wins logic ensures the first model ever installed for a module becomes the active pick automatically.
CapabilityRegistry completes the picture by resolving which modules a model's capabilities qualify it for. A coding-tuned LLM lands in Gale, Cirrus, and Ashe simultaneously. A Whisper model lands only in Overture. An embedding model lands only in Mist. The modules array on ModelCapabilities is the authoritative list, and MountedModelStore.compatibleModels(for:) uses ModelRouter.moduleAccepts to include cross-module borrows when checking whether a module has something it can run.
Device fit
Nimbus tiers every device into one of four capability levels — Ultra, Pro, Standard, and Lite — based on chip generation, physical RAM, and a hand-tuned memory budget. The DeviceCapability singleton reads the machine identifier via utsname, looks it up in a comprehensive DeviceSpec table covering every iPhone from the iPhone 11 (A13, 4 GB, 900 MB budget) through the iPhone 17 Pro Max (A19 Pro, 12 GB, 6,000 MB budget), plus iPads and Macs. The budget represents the maximum model weight + KV cache the device can sustain without risking an OOM kill, accounting for iOS overhead and the increased-memory-limit entitlement.
The DeviceCapability.fit(for:) method evaluates a ModelShape (weights MB, KV cache MB, minimum GPU family) against the device's live state. It returns one of five verdicts: .ok, .tight (fits the tier but not right now — available memory is low), .tooHeavy (exceeds the tier budget), .tooOld (GPU family below the model's minimum), or .noDisk (insufficient free storage). The GPU family is detected via Metal's supportsFamily API, probing from .apple10 (A19/M5 class) down to .apple6 (A13 class).
AdaptiveEngine monitors thermal state and battery level in real time via ProcessInfo.thermalStateDidChangeNotification and UIDevice.batteryStateDidChangeNotification. When the device enters .serious or .critical thermal state, or battery drops to critical, the engine switches to Efficient Mode — halving context window and batch size, reducing thread count, and forcing quantization. When charging, it switches to Performance Mode with doubled batch sizes and quantization disabled. The AdaptiveSettings struct encodes the concrete parameters (context window, batch size, generation threads, streaming rate) for each tier, and the engine re-scans via CapabilityScanner on every condition change.
Inference pipeline
Every chat turn flows through a three-layer pipeline: InferenceManager → InferenceProvider → concrete provider. InferenceManager is a @MainActor singleton that owns the user's provider mode selection (.localMLX or .customEndpoint), the default model id, and the custom endpoint URL and API key. Its resolve() method returns the appropriate InferenceProvider implementation. There is no Nimbus cloud provider — the app has no backend. All inference runs on-device or on the user's own LAN.
LocalMLXProvider runs models in-process via the MLXLLM Swift package on Apple Silicon. RemoteOpenAIProvider speaks the OpenAI-compatible chat completions API over HTTP, targeting llama.cpp, Ollama, LM Studio, vLLM, or any spec-compliant server. Custom endpoints enforce HTTPS unless the host is loopback (127.0.0.1, ::1, localhost) — sending conversation content over cleartext on an open LAN is rejected at validation time. API keys are stored in the iOS Keychain with kSecAttrAccessibleAfterFirstUnlockThisDeviceOnly and never logged.
The InferenceProvider protocol defines a single chat(_:onToken:) method that accepts a ChatCompletionRequest and streams tokens via a callback. The request carries all OpenAI-standard parameters (temperature, top_p, max_tokens, stop, presence/frequency penalty, seed) plus widely-honored extensions (top_k, min_p, repeat_penalty) that local backends like LM Studio and Ollama support. Multimodal input is handled through InferenceChatMessage.Content, which encodes as either a bare string (text-only) or an array of typed ContentPart values (text + image URLs with base64-embedded bytes). The result includes the full text, finish reason, and prompt/completion token counts.
Rust core
The Rust core (nimbus-rust-core) is the native foundation library, compiled for iOS via cargo build --target aarch64-apple-ios and linked into the Swift target through a C ABI defined in ffi.rs. It exposes platform-neutral primitives that would be impractical or slow in pure Swift: workspace_engine (local project + file state on disk), hf_client (Hugging Face Hub search and metadata), download_manager (resumable model downloads via hf-hub with HTTP Range resume and LFS redirect handling), model_manager (installed model paths and load handles), model_registry (persistent index.json with SHA256 TOFU verification), and diff_engine (Myers diff algorithm for Cirrus code diffs and PR generation).
Audio and search capabilities are feature-gated. The stt module wraps whisper.cpp for on-device speech-to-text with word-level timestamps, language detection, and translation. The tts module wraps Kokoro-82M via ONNX Runtime for text-to-speech with 54 voices across 10 languages. The search module provides BGE/MiniLM embedding via ONNX Runtime for Mist's dense retrieval path. The packs module handles knowledge pack archive parsing (tar+gzip with Wikimedia Enterprise HTML NDJSON format).
A shared tokio runtime (RUNTIME, initialized via lazy_static) persists across all FFI calls so async Rust work doesn't pay the 1–2ms thread-pool setup cost on every Swift→Rust round-trip. Global state lives in NIMBUS_STATE, an Arc<RwLock<NimbusState>> that tracks loaded models and active projects. The telemetry module provides an event buffer that the Swift layer can flush — but in practice, Nimbus ships with telemetry disabled and the network off-switch as the default.
Ashe runtime
Ashe is the on-device agent runtime, implemented as a standalone Rust crate (nimbus-ashe) with its own turn loop, tool registry, memory system, and audit log. The core abstraction is the turn: one user prompt's worth of work. The TurnEngine composes an initial message list from the system prompt, hot memory entries, relevant skills (FTS-indexed and safety-scanned), recent session history, and the new user message. It then enters a loop: call the model, parse tool calls (supporting both native OpenAI function-calling and inline XML/sentinel formats via the normalize parser), dispatch tools, feed results back, and repeat until the model emits a final assistant message with no tool calls, the fuel budget is exhausted, or the caller cancels.
Fuel metering bounds every turn with three limits: max tokens, wall-time duration, and max tool calls. The FuelBudget struct defines four presets — interactive() (4,000 tokens, 60s, 12 tool calls), background() (1,500 tokens, 25s, 4 tool calls for BGAppRefreshTask), continued() (8,000 tokens, 5 min, 24 tool calls for BGContinuedProcessingTask), and overnight() (16,000 tokens, 10 min, 48 tool calls for charging + WiFi). The FuelMeter uses atomic counters shared via Arc so cloned meters track the same budget. Exhaustion is a clean termination, not a crash.
The loop guard detects when the model is ping-ponging the same tool call by fingerprinting each call with SHA256(name + normalized_args) and tracking frequency in a sliding window (default: 8 calls). Three identical calls trigger a throttle (a nudge message injected into the conversation); five trigger a break (the turn ends). The safety scanner sweeps any text injected into the system prompt for role-hijack prefixes, invisible Unicode (bidi overrides, zero-width characters), credential exfiltration patterns (curl/wget with $API_KEY), and SSRF probes (cloud metadata endpoints, localhost URLs). The audit log records every step with a Merkle hash chain. After a successful turn, optional skill distillation asks the model to extract a reusable skill from the transcript, and periodic memory extraction (every 5 user turns by default) scans for durable facts to persist to hot memory — both safety-scanned before storage.
Mist engine
MistEngine is the hybrid search engine, implemented in Rust and exposed to Swift through the Ashe FFI bridge. It combines BM25 keyword search (backed by tantivy via the Bm25Index) with dense vector retrieval (backed by SQLite with ONNX Runtime embeddings via the DocumentStore), fused through Reciprocal Rank Fusion. The indexing pipeline chunks documents via the Chunker, embeds chunks through a DenseEmbedder trait (wired to BGE or MiniLM on device), upserts the full text into BM25, and stores chunk embeddings in the document store.
Search runs both retrieval paths in parallel: BM25 fetches the top-K keyword matches, the embedder produces a query vector for dense search, and rrf_fuse merges the two ranked lists using the Cormack/Clarke/Buettcher algorithm with k=60. The formula score(d) = Σ 1/(k + rank(d)) is score-agnostic — BM25's log-TF-IDF scores and cosine similarities live in different numerical spaces, so RRF uses only rank ordering, no calibration needed. An optional CrossEncoder reranker (bge-reranker-base on device) rescores the fused top-K for final precision. Source filtering (note, file, web, pack:*) is applied before fusion so filtered items don't consume slots.
Knowledge packs are installable offline content bundles (Wikipedia, Wiktionary, Wikivoyage) that feed the search index. The install pipeline streams a tar+gzip archive through a parser on a blocking thread, sends batches through a bounded channel (capacity 4, for backpressure), and indexes them asynchronously via batch_index — one embedder call and one BM25 commit per batch. A full Simple Wikipedia (~200k articles) indexes in under 10 minutes on iPhone 15 Pro. Uninstall is a single prefix scan that removes every document indexed under a pack's source tag, plus the manifest row. The Chunker supports configurable strategies — sentence-boundary splitting with overlap is the default, tuned for the 512-token context window of the BGE embedding model.
Theming
Every color in Nimbus flows from a single NimbusPalette struct — a comprehensive set of tokens that mirrors the Material 3 color system: surface hierarchy (background, surface, surfaceDim, surfaceBright, five surfaceContainer levels), primary/secondary/tertiary color families with dim, container, fixed, and on-* variants, error colors, and outline/outlineVariant. The struct also provides computed properties like assistantBubbleGradient (primary → primaryDim) and tonalSeparator. This token-based approach means adding a new palette is a single new NimbusPalette instance — no view code changes.
Nimbus ships four themes, selectable via the NimbusTheme enum: Vanilla Wood (the brand default — warm paper tones with #feffd6 background and quiet ink), Light (clean white surfaces with cool-gray accent), Dark (true-black OLED background with soft tonal steps), and Space Gray (cooler graphite with blue-leaning surfaces). A fifth option, Device default, resolves to Light or Dark based on the iOS color scheme at runtime. Each theme maps to a preferredColorScheme and a palette(forSystem:) resolver.
The active palette lives in NimbusDesign.palette, a nonisolated(unsafe) static that ThemeManager updates on the main thread whenever the user switches themes. Views read it through NimbusDesign.palette.xxx and pick up new values on their next body evaluation — ThemeManager forces a root re-render on switch. Non-SwiftUI consumers (the syntax highlighter, for example) read NimbusDesign.theme to pick a matching highlight.js theme. All eight modules inherit the shared palette — there are no per-module color overrides, which is how Nimbus maintains visual coherence across Gale, Cirrus, Mirage, and every other surface.