Mentra Live Build Notes | Christopher Mullins

ROLEProduct Design · Engineering · Solo Build

STAGETestFlight · shipping to testers

STACKSwift · SwiftUI · Apple Vision · VideoToolbox · LC3 · RTMP · Instruments

YEAR2026

01 — The Outcome

The problem, stated as an outcome.

A blind or low-vision person holding an object, standing in a room, or walking down a hallway wants one thing: a fast, trustworthy, spoken answer to “what is this / what’s around me?” — hands-free, private, and without ceremony.

Everything in this project is downstream of that sentence. The Mentra Live glasses provide a head-mounted camera; an iPhone or Mac provides the compute and the voice. That reframing — from capability to outcome — is the single most important design decision in the project, and it cascades into every subsystem below.

The product is a standalone assistive companion, not a platform demo. It feels like a calm, trustworthy narrator that speaks only when it has something useful and honest to say. Crucially, it delivers its core value with zero Mentra cloud, zero app store, zero SDK, and zero account — the glasses are treated as a camera and a speaker, nothing more.

Mentra Live smart glasses — camera, microphone, and speaker, no display — Primary user context — Mentra Live worn · mentraglass.com/live

02 — Five Principles

Constraints that a change can violate.

These aren’t aspirations. They’re constraints that a change can violate, and violating them is treated as a bug. They are the project’s definition of good.

`AUDIO-FIRST`

The spoken answer comes above all else. No work may precede or delay first audio. UI publishing, history persistence, memory accumulation, place recognition, diagnostics — none of it runs on the path between “the answer is ready” and “audio starts playing.” An os_signpost “assist” interval emits audio-armed → first-audio → first-sentence → record-kicked → answer events so the ordering is checkable in Instruments, not just hoped for. For someone standing in a doorway waiting to know if the path is clear, latency is the product.

`ON-DEVICE FIRST`

Apple Vision is the default for OCR, scene, faces, and places. Cloud vision (OpenAI / Gemini / Anthropic / custom HTTP) is opt-in and used to enrich, never as a hard dependency for core reading and safety output. The app is fully useful with no keys and no network beyond the local Wi-Fi link to the glasses. It guarantees the floor.

`PRIVACY BY DEFAULT`

Face embeddings, place feature-prints, and area memory live in the app’s Application Support directory and never leave the device unless the user explicitly enables a cloud provider for a given task. Recognition is limited to people the user personally enrolls — this is deliberately not a surveillance or identity-lookup tool. Making privacy structural (data simply isn’t sent) rather than configurable (a toggle you could get wrong) is the engineering-for-outcomes move for a user base disproportionately harmed by data leakage.

`HONEST UNCERTAINTY`

The app never guarantees safety or a clear path, and never asserts a person’s identity with certainty. Conservative distance thresholds and hedged phrasing (“this looks like it might be…”) are intentional design, not placeholders. Empty or failed results return an actionable hint (“try improving lighting or moving the camera”) instead of silence or invention. A confident wrong answer to a blind user is worse than no answer — they can’t visually cross-check it. Wrong-name-at-high-confidence is the gating quality metric, not accuracy.

`EYES-FREE BY CONSTRUCTION`

Every outcome is designed to be heard, not seen. Spoken output uses body-relative directions (“on your left”, “ahead” — never “left of the photo”), short TTS-friendly sentences, and leads with safety. Visual UI exists for setup and sighted helpers; it’s a convenience, not the product.

03 — Architecture

Transport, then policy, then experience.

One shared SwiftUI codebase ships as an iPhone app (iOS 17+) and a Mac desktop app (macOS 14+), distributed to testers via TestFlight. Roughly 25,500 lines of Swift across 129 source files, with 52 test files (~5,900 lines) — a deliberately high test-to-source ratio concentrated on the pure, deterministic pieces where correctness is checkable without hardware.

The directory tree is organized by transport, then policy, then experience — and the policy layer is where the engineering value concentrates. Capture/ decides which source to use for which task. Obstacle/ handles pacing and fallback. Reading/ owns stability and dedupe. Vision/ResponseRules governs what gets said. The hard part of this product was never talking to the camera; it was deciding, moment to moment, what is worth saying and how fast it can be said honestly.

AssistOrchestrator.runAssist(task:) is the state machine every capture flows through: trigger → capturing → analyzing → speaking, with re-entrancy guarded by an explicit synchronous isRunning flag so a button-press + voice double-trigger can never start two captures. It’s the enforcement point for the audio-first contract: interim cue, pre-armed audio session, speak, and only then fire-and-forget history, memory, and diagnostics.

Sight Assist History — describe scene at a staircase; safety-first spoken response before layout detail — System architecture — transport → policy → experience

04 — Signature Problems

Choosing the design that serves the user over the one that’s technically tidy.

Each of these was a decision point where the technically clean answer and the user-serving answer diverged. Notes on how the second one won.

The hardware is slow and honest about it

Direct measurement of the Mentra Live camera server revealed constraints no cleverness removes: /api/latest-photo lags a capture by 15–19 seconds; the gallery listing shows it at ~3.7 seconds. Overlapping capture triggers queue and compound in firmware. There’s no non-persisting capture path — every photo writes a gallery file. Rather than paper over these facts with a fake-fast UI, the app tells the truth (“scanning about every 5 seconds”) and re-architects around the constraint. The measured facts are pinned in code comments so nobody optimizes against a limit that isn’t real.

Live streaming, and an honest fallback

A ~3.7-second still-capture floor makes walk mode miserable. The solution: a dependency-free, in-app RTMP ingest server (Network.framework + VideoToolbox, no third-party libraries) that asks the glasses to publish 960×540 H.264 over shared Wi-Fi. Apple Vision samples the newest frame ~2×/second. Hazards get spoken within a second or two instead of a 5-second cycle. The engineering maturity is in the honest fallback: streaming drains battery faster (stated up front), the firmware refuses to stream below 15% battery, and on any failure the app speaks one honest line and drops back to snapshot polling.

A software fast shutter for a slow, head-mounted camera

Captures are task-aware. Read-type tasks (label, date, price) demand the full-resolution still; describe-type tasks are served from a live stream frame where 960×540 is plenty. Because a head-mounted camera with slow auto-exposure produces motion blur, stream captures pick the sharpest of the last ~1 second of frames using variance-of-Laplacian scoring. A clearly blurred read-task photo triggers one coached retake (“hold still a moment — taking another photo”); if both stay blurry, the answer opens honestly (“the photo is a bit blurry — here’s what I can make out”). The blur threshold was calibrated against real head-camera frames, not guessed.

Progressive TTS shaves the last half-second

Buffered TTS waits for the whole audio file before playing (~7 seconds in the worst case). The app instead streams synthesized PCM chunks so audio starts on the first chunk — measured ~400–530 ms sooner for OpenAI tts-1 and ~110–460 ms sooner for Deepgram Aura, with buffered playback as the universal fallback. Cloud answers also stream sentence-by-sentence, so time-to-first-speech is ~1.3 seconds (OpenAI) or ~0.8 seconds (Anthropic) versus ~7 seconds fully buffered. Measurement was within-stream, first-chunk-vs-whole-clip, because cross-call comparison is network-noisy and flips run to run. Adopt only what you can prove helps.

Continuous OCR, made useful instead of maddening

Reading text “as it comes into view” is trivial to prototype and infuriating in practice: pan across a room and it spews fragments; hold a label in view and it re-reads forever. Two pure, heavily unit-tested policies make it livable. A stability gate requires a text block to appear in roughly the same place across several consecutive samples and be large and centered enough to be intentional before it’s spoken. A dedupe tracker speaks each block once, then fuzzily suppresses it for a ~30-second recency window; it re-arms when the text leaves and returns. When nothing readable is in view, the app stays silent — no “no text found” spam.

Area memory — face recognition, but for places

Captures are folded, fire-and-forget, into per-area memory: the app recognizes the current area from a whole-image feature-print (or creates a new one), then merges landmarks, hazards, anchor layout, labels, and text with frequency-and-recency weighting. Over repeated visits an area accumulates a stabilized, queryable summary the user can name. Growth is bounded by construction — 50 areas with LRU eviction, capped landmarks and hazards per area — so the on-disk store can’t grow without limit. The accumulation adds zero milliseconds to the speak path because it runs after speech begins.

05 — Honesty as Practice

Enforced in code and in docs, not just requested of the model.

For a product whose entire value is trustworthy spoken answers to someone who can’t verify them, honesty can’t be a prompt request. It has to be structural. Three places where it lives in the code, not the aspirations.

Spoken output is governed by a single source of truth, Vision/ResponseRules.swift: no platitudes, no meta-commentary, no filler openers, concrete over vague, honest uncertainty preserved, verbosity budgets. It’s injected into every cloud prompt and enforced by a deterministic output sanitizer before speech — belt and suspenders, because you can’t trust a model to always follow a prompt.

The spatial description builder enforces a strict safety ordering: immediate hazards prefixed “stop.” and spoken first, then anchor and layout, then near hazards, then landmarks, ending with uncertainty. It’s a pure function with dedicated tests asserting the ordering holds.

The docs hold themselves accountable. The backlog once openly recorded that two advertised headline features — “who’s this?” and area memory — were fully built but never wired to a call site, and flagged that the intent doc therefore overstated what worked. Both are now fixed and the drift resolved, but the practice is the point: “doc-says-it-works but code-doesn’t-run-it” is treated as a first-class bug class. Every metric in the intent doc is tagged [code] (enforced) or [target] (aspirational, not yet instrumented), so nobody mistakes an intention for a guarantee.

Sight Assist — Describe scene with captured shelf photo and spoken answer listing helmets, books, mugs, and plants — Describe scene — captured frame and honest spoken answer

06 — Development Method

Parallel agents, one integration branch.

The project is built with an unusual but deliberate workflow: features and fixes are dispatched to parallel background agents, each in its own git worktree on a sight-assist/<n> branch, branched from an integration branch. File ownership is partitioned per agent when dispatching (e.g. Obstacle/ plus CameraServerClient versus Speech/ versus the vision prompts versus KnownPeople) so agents don’t collide.

Verification is scripted and non-negotiable: ./scripts/test-mac.sh runs the full XCTest suite (~335 tests) plus a generic iOS-simulator build. The Xcode project itself is generated from project.yml via xcodegen, so merge conflicts in project.pbxproj and entitlements are resolved by taking either side and regenerating — a deliberate choice that removes an entire class of painful, meaningless merge conflicts.

A hard-won governance rule lives in this workflow: background agents must not commit to base or touch credentials without explicit per-action confirmation, after one agent overstepped by committing unrequested work and autonomously configuring signing. The parallelism is a productivity multiplier; the guardrail is what keeps it trustworthy. The method mirrors the product: parallel exploration for speed, a strict verification gate and honest integration for trust.

07 — Shipping Reality

What’s live, and where the honest gap is.

TestFlight build 9 is the current shipped iOS build. The cumulative feature stack: full voice and tap command palette, chat-style history, full-resolution Wi-Fi capture, walk mode (live stream with snapshot fallback), reading mode, Known People, area memory, status-LED signaling, motion-informed lock-screen walk mode, and corrected glasses-audio routing.

Release is one command: scripts/release-ios.sh generates secrets, bumps the build number, archives, exports, and uploads via the App Store Connect API in one shot.

The honest gap: on-device iPhone validation of the whole feature stack end-to-end remains the big unfinished verification. The Simulator can’t exercise BLE, Wi-Fi capture, or glasses audio, so the deepest confidence still requires the physical device — and the docs say so plainly rather than claiming coverage the tests can’t provide.

08 — Reflection

Engineering for outcomes, in six stances.

Read as a whole, Sight Assist is a case study in a specific engineering stance. The takeaways aren’t lessons learned; they’re the framing that made the project possible in the first place.

Start from the outcome, not the hardware

“Shortest honest path from trigger to useful speech” generated the architecture. The glasses’ capabilities only constrained it. Starting from the sentence, not the SDK, is the reframe that made every downstream decision easier.

Make the values structural

Privacy is enforced by data never being sent, not by a toggle. Audio-first is enforced by signpost ordering, not by good intentions. Honesty is enforced by a sanitizer, not by a prompt alone. A value in a system prompt is a hope; a value in the type system is a guarantee.

Measure before optimizing, and only adopt what you can prove

Hardware latencies were probed directly and pinned in code comments. Progressive-TTS gains were measured within-stream because cross-call numbers lie. The measurement discipline is itself an engineering-for-outcomes choice.

Concentrate testing where correctness is checkable

The pure policy layer — stability gate, dedupe, spatial ordering, recognition thresholds, memory caps — carries the test weight. The hardware seam is validated on-device and honestly marked as the remaining gap. Test the parts where a test can tell the truth.

Degrade honestly

Every failure mode — blur, stream rejection, low battery, unreachable camera, no match — has a designed, spoken, truthful response instead of a silent failure or a confident guess. Failure paths are first-class product surfaces, not afterthoughts.

Hold the docs to the same standard as the code

[code] versus [target]. An evidence-based backlog that admits built-but-unwired features. An intent doc that’s a contract to reconcile against, not marketing. For the person wearing the glasses, none of this discipline is visible. What’s visible is that the app speaks quickly, tells the truth, keeps their data on the device, and says “I’m not sure” instead of getting it confidently wrong. That is the outcome the entire engineering effort exists to produce.

Overview ← Project overview All work →

Sight Assist — A calm, honest voice for blind and low-vision users of Mentra Live glasses