Whispero | Projects

Overview

Whispero is a macOS menu bar app for real-time voice dictation — entirely local, no cloud, no subscription. It runs a whisper.cpp inference server as a managed subprocess, captures audio at the OS level, and injects transcribed text directly into whatever app is in focus.

The interesting part isn't the UI — it's the system orchestration underneath. The app coordinates four independent concerns: a long-running ML inference process, a global keyboard event tap, a low-latency audio capture pipeline, and clipboard-safe text injection. These have to work together reliably without stepping on each other, across any app the user is in.

Architecture

WhisperServer — ML process management

Launches and monitors a whisper-server subprocess (whisper.cpp). Handles health checks via HTTP polling, port conflict resolution, automatic restarts on crash, and graceful termination on app quit. The app stays idle at near-zero CPU until a recording is triggered.

KeyboardMonitor + ShortcutManager — Global event tap

Uses a CGEventTap to intercept keyboard events system-wide regardless of which app is active. Implements a state machine (idle → recordingHold / recordingToggle) with hold-to-talk and tap-to-toggle modes. Events are suppressed so the hotkey doesn't leak into other apps during recording.

Recorder — Audio capture pipeline

Captures 16kHz mono PCM audio and exports to a temporary WAV file backed by RAM. Handles microphone disconnection, permission revocation at runtime, and audio session conflicts gracefully. Buffers are released immediately after the POST request to minimise memory pressure.

TranscriptionService — Inference API

Sends the WAV file to the local whisper-server via POST /inference as multipart/form-data. Handles timeouts, 5xx errors, and network failures — with one automatic server restart before notifying the user.

Paster — Clipboard-safe text injection

Backs up the current clipboard, sets transcribed text, posts a Cmd+V keyboard event to the frontmost app, waits 250ms, then restores the original clipboard. The user's clipboard content is never permanently overwritten.

Stack

Layer	Technology
ML Inference	whisper.cpp (local server, runs entirely on-device)
App Runtime	macOS 13+, Apple Silicon optimised
Audio	AVFoundation — 16kHz mono PCM capture
Input	CGEventTap — global keyboard interception
Inference API	HTTP — multipart/form-data POST to local server
Testing	XCTest — unit tests across all major components
Build	Swift Package Manager

What makes it non-trivial

—Everything runs locally — no API keys, no network dependency, no data leaving the machine
—The whisper-server subprocess must be managed across the full app lifecycle — startup, health checks, crashes, port conflicts, and clean shutdown
—Global keyboard interception works across all apps including those with their own event handling (terminals, editors, browsers)
—Clipboard safety is critical — the user's clipboard must be restored exactly, even if the paste fails or the app crashes mid-operation
—The app targets near-zero CPU at idle and <100MB resident memory — the ML inference cost is entirely offloaded to the subprocess
—Full test coverage: AppState, Hotkey, Recorder, TranscriptionService, Paster, ModelManager, WhisperServer — tested in isolation