Image understanding
Captions, Q&A, and visual reasoning on-device. Ask what's in a photo, what a diagram means, or how a screenshot is laid out — no frame ever leaves the phone.
Stratus is Nimbus8's vision module. Point your camera at a document or scene and ask anything. OCR, captioning, and visual Q&A all run on-device with Core ML and MLX.
Captions, Q&A, and visual reasoning on-device. Ask what's in a photo, what a diagram means, or how a screenshot is laid out — no frame ever leaves the phone.
Apple's Core ML text recognizer reads printed and handwritten text with per-line confidence scoring. Latin, CJK, Cyrillic, Arabic — whatever your camera catches.
Point the camera, ask once, answers in under a second. Stratus pre-runs OCR and a quick caption the moment the shutter fires — your question has context before you finish typing.
Stratus is Nimbus8's vision module — a camera-first surface for asking questions about the physical world. Point your iPhone at a sign, a page, a whiteboard, or a screenshot, and Stratus will read the text, describe the scene, and answer follow-up questions. Everything happens on-device.
Under the hood, Stratus is a thin layer over three engines: Apple's Vision framework for text recognition, Core ML for fast on-device image encoders, and MLX for running open vision-language models like LLaVA and Qwen-VL. Stratus is currently in beta — the capture loop and OCR are solid, VLM coverage is still expanding.
Stratus's OCR path uses Apple's VNRecognizeTextRequest — the same engine that powers Live Text across iOS — with the "accurate" recognition level and automatic language correction enabled. Each recognized line comes back with a bounding box and a confidence score; Stratus renders the top-confidence pass into the reader strip and keeps alternate candidates in memory for follow-up questions.
Because recognition runs on the Neural Engine, a full page of text resolves in well under a second on iPhone 15 Pro and newer. Languages supported out of the box include English, Spanish, French, German, Italian, Portuguese, Chinese (Simplified and Traditional), Japanese, Korean, and several others — the list tracks Apple's Vision framework capabilities per iOS release.
For scene understanding, Stratus routes the captured image to an open vision-language model running locally via MLX. Default picks include LLaVA-1.5-7B and Qwen-VL 2B — the first for descriptive captioning and open-ended Q&A, the second for fast triage on smaller devices.
Every frame is encoded once and cached per capture. Follow-up questions reuse the same image embedding, so the second, third, and fourth question cost a fraction of the first. A capability manifest marks which models can accept images directly; text-only models get the OCR pass plus the caption as prompt context instead.
The camera is the composer. When you tap the shutter, Stratus does three things in parallel before you can type:
VNRecognizeTextRequest on the frame and writes the recognized lines into the reader strip.By the time you finish typing "what does this say" or "what is this," the model already has the image encoded and the OCR transcript ready. First-token latency on iPhone 15 Pro is typically under half a second.
Stratus ships with a curated picker of open vision models, filtered by whether they actually run on your device. The current catalog includes:
The picker flags anything that exceeds your device's measured memory budget and sorts by measured first-token latency on your chip — not by raw parameter count.
Every captured frame. Every OCR transcript. Every caption. Every follow-up question and answer. None of it is uploaded, logged, or synced. The camera feed goes to the Neural Engine and back — it never touches a server.
The network is reached only when you install or update a vision model from Hugging Face. There is no telemetry and no account. See the privacy policy for the full data flow.
Yes. Once your chosen vision model is downloaded, capture, OCR, captioning, and Q&A all run in airplane mode. The network is only required to install new models.
On clean, well-lit printed text it's near-perfect — the same engine behind iOS Live Text. Handwriting and low-contrast scenes degrade gracefully, with per-line confidence scores surfaced in the reader strip so you know which lines to double-check.
Yes. Any image from the Photos picker, Files app, or a screenshot goes through the same capture-to-ask pipeline. The camera is the default entry point because it's the fastest loop, not the only one.
On iPhone 15 Pro or newer, start with LLaVA-1.5-7B for general use. On older devices or when you want snappier replies, Qwen-VL 2B is the better fit. The model picker flags what your device can run and sorts by measured speed on your chip.
OCR and the capture loop are production-quality, but vision-language model coverage on Hugging Face is still uneven for iOS runtimes — new MLX and Core ML ports appear weekly. We're shipping Stratus now so the capture-to-ask flow gets real use; the model catalog will keep expanding through beta.