Whispero
Local voice dictation — Whisper inference server + OS-level audio and input orchestration
Private repository
App demo
Trigger recording via hotkey, transcribe locally, inject text into the active app.
Setup & settings
First-launch setup, permissions flow, model selection, and preferences.
Overview
Whispero is a macOS menu bar app for real-time voice dictation — entirely local, no cloud, no subscription. It runs a whisper.cpp inference server as a managed subprocess, captures audio at the OS level, and injects transcribed text directly into whatever app is in focus.
The interesting part isn't the UI — it's the system orchestration underneath. The app coordinates four independent concerns: a long-running ML inference process, a global keyboard event tap, a low-latency audio capture pipeline, and clipboard-safe text injection. These have to work together reliably without stepping on each other, across any app the user is in.
Architecture
WhisperServer — ML process management
Launches and monitors a whisper-server subprocess (whisper.cpp). Handles health checks via HTTP polling, port conflict resolution, automatic restarts on crash, and graceful termination on app quit. The app stays idle at near-zero CPU until a recording is triggered.
KeyboardMonitor + ShortcutManager — Global event tap
Uses a CGEventTap to intercept keyboard events system-wide regardless of which app is active. Implements a state machine (idle → recordingHold / recordingToggle) with hold-to-talk and tap-to-toggle modes. Events are suppressed so the hotkey doesn't leak into other apps during recording.
Recorder — Audio capture pipeline
Captures 16kHz mono PCM audio and exports to a temporary WAV file backed by RAM. Handles microphone disconnection, permission revocation at runtime, and audio session conflicts gracefully. Buffers are released immediately after the POST request to minimise memory pressure.
TranscriptionService — Inference API
Sends the WAV file to the local whisper-server via POST /inference as multipart/form-data. Handles timeouts, 5xx errors, and network failures — with one automatic server restart before notifying the user.
Paster — Clipboard-safe text injection
Backs up the current clipboard, sets transcribed text, posts a Cmd+V keyboard event to the frontmost app, waits 250ms, then restores the original clipboard. The user's clipboard content is never permanently overwritten.
Stack
| Layer | Technology |
|---|---|
| ML Inference | whisper.cpp (local server, runs entirely on-device) |
| App Runtime | macOS 13+, Apple Silicon optimised |
| Audio | AVFoundation — 16kHz mono PCM capture |
| Input | CGEventTap — global keyboard interception |
| Inference API | HTTP — multipart/form-data POST to local server |
| Testing | XCTest — unit tests across all major components |
| Build | Swift Package Manager |
What makes it non-trivial
- —Everything runs locally — no API keys, no network dependency, no data leaving the machine
- —The whisper-server subprocess must be managed across the full app lifecycle — startup, health checks, crashes, port conflicts, and clean shutdown
- —Global keyboard interception works across all apps including those with their own event handling (terminals, editors, browsers)
- —Clipboard safety is critical — the user's clipboard must be restored exactly, even if the paste fails or the app crashes mid-operation
- —The app targets near-zero CPU at idle and <100MB resident memory — the ML inference cost is entirely offloaded to the subprocess
- —Full test coverage: AppState, Hotkey, Recorder, TranscriptionService, Paster, ModelManager, WhisperServer — tested in isolation