What is Stratus?

Stratus is Nimbus8's vision module — a camera-first surface for asking questions about the physical world. Point your iPhone at a sign, a page, a whiteboard, or a screenshot, and Stratus will read the text, describe the scene, and answer follow-up questions. Everything happens on-device.

Under the hood, Stratus is a thin layer over three engines: Apple's Vision framework for text recognition, Core ML for fast on-device image encoders, and MLX for running open vision-language models like LLaVA and Qwen-VL. Stratus is currently in beta — the capture loop and OCR are solid, VLM coverage is still expanding.

OCR under the hood

Stratus's OCR path uses Apple's VNRecognizeTextRequest — the same engine that powers Live Text across iOS — with the "accurate" recognition level and automatic language correction enabled. Each recognized line comes back with a bounding box and a confidence score; Stratus renders the top-confidence pass into the reader strip and keeps alternate candidates in memory for follow-up questions.

Because recognition runs on the Neural Engine, a full page of text resolves in well under a second on iPhone 15 Pro and newer. Languages supported out of the box include English, Spanish, French, German, Italian, Portuguese, Chinese (Simplified and Traditional), Japanese, Korean, and several others — the list tracks Apple's Vision framework capabilities per iOS release.

Captioning & visual Q&A

For scene understanding, Stratus routes the captured image to an open vision-language model running locally via MLX. Default picks include LLaVA-1.5-7B and Qwen-VL 2B — the first for descriptive captioning and open-ended Q&A, the second for fast triage on smaller devices.

Every frame is encoded once and cached per capture. Follow-up questions reuse the same image embedding, so the second, third, and fourth question cost a fraction of the first. A capability manifest marks which models can accept images directly; text-only models get the OCR pass plus the caption as prompt context instead.

Capture-to-ask flow

The camera is the composer. When you tap the shutter, Stratus does three things in parallel before you can type:

Runs VNRecognizeTextRequest on the frame and writes the recognized lines into the reader strip.
Encodes the image with the loaded VLM's vision tower and holds the embedding in memory.
Generates a one-sentence caption so the screen has something to answer against immediately.

By the time you finish typing "what does this say" or "what is this," the model already has the image encoded and the OCR transcript ready. First-token latency on iPhone 15 Pro is typically under half a second.

Supported vision models

Stratus ships with a curated picker of open vision models, filtered by whether they actually run on your device. The current catalog includes:

LLaVA-1.5-7B — MLX — the default for descriptive captioning and visual Q&A on iPhone 15 Pro and newer.
Qwen-VL 2B — MLX — fast pick for older Neural Engines and tight memory budgets.
Qwen-VL 7B — MLX — strongest reasoning on iPad M-series and iPhone Pro Max.
PaliGemma — where a Core ML or MLX port is available — excellent for referring-expression and grounded Q&A.
Apple Core ML VNRecognizeTextRequest — the always-on OCR fallback used whenever a selected model is text-only or when you just want a clean transcript.

The picker flags anything that exceeds your device's measured memory budget and sorts by measured first-token latency on your chip — not by raw parameter count.

What stays local

Every captured frame. Every OCR transcript. Every caption. Every follow-up question and answer. None of it is uploaded, logged, or synced. The camera feed goes to the Neural Engine and back — it never touches a server.

The network is reached only when you install or update a vision model from Hugging Face. There is no telemetry and no account. See the privacy policy for the full data flow.

FAQ

Does Stratus work offline?

Yes. Once your chosen vision model is downloaded, capture, OCR, captioning, and Q&A all run in airplane mode. The network is only required to install new models.

How accurate is the OCR?

On clean, well-lit printed text it's near-perfect — the same engine behind iOS Live Text. Handwriting and low-contrast scenes degrade gracefully, with per-line confidence scores surfaced in the reader strip so you know which lines to double-check.

Can I use Stratus on photos from my library, not just the camera?

Yes. Any image from the Photos picker, Files app, or a screenshot goes through the same capture-to-ask pipeline. The camera is the default entry point because it's the fastest loop, not the only one.

Which vision model should I pick?

On iPhone 15 Pro or newer, start with LLaVA-1.5-7B for general use. On older devices or when you want snappier replies, Qwen-VL 2B is the better fit. The model picker flags what your device can run and sorts by measured speed on your chip.

Why is Stratus in beta?

OCR and the capture loop are production-quality, but vision-language model coverage on Hugging Face is still uneven for iOS runtimes — new MLX and Core ML ports appear weekly. We're shipping Stratus now so the capture-to-ask flow gets real use; the model catalog will keep expanding through beta.

See, read, and ask — from the camera.

Image understanding

OCR in any language

Capture to ask

What is Stratus?

OCR under the hood

Captioning & visual Q&A

Capture-to-ask flow

Supported vision models

What stays local

FAQ

Does Stratus work offline?

How accurate is the OCR?

Can I use Stratus on photos from my library, not just the camera?

Which vision model should I pick?

Why is Stratus in beta?

See, read, and ask — from the camera.

Image understanding

OCR in any language

Capture to ask

What is Stratus?

OCR under the hood

Captioning & visual Q&A

Capture-to-ask flow

Supported vision models

What stays local

FAQ

Does Stratus work offline?

How accurate is the OCR?

Can I use Stratus on photos from my library, not just the camera?

Which vision model should I pick?

Why is Stratus in beta?

Everything that comes with Nimbus8.