← Back to Projects

Whispero

Local voice dictation — Whisper inference server + OS-level audio and input orchestration

Private repository

App demo

Trigger recording via hotkey, transcribe locally, inject text into the active app.

Setup & settings

First-launch setup, permissions flow, model selection, and preferences.

Overview

Whispero is a macOS menu bar app for real-time voice dictation — entirely local, no cloud, no subscription. It runs a whisper.cpp inference server as a managed subprocess, captures audio at the OS level, and injects transcribed text directly into whatever app is in focus.

The interesting part isn't the UI — it's the system orchestration underneath. The app coordinates four independent concerns: a long-running ML inference process, a global keyboard event tap, a low-latency audio capture pipeline, and clipboard-safe text injection. These have to work together reliably without stepping on each other, across any app the user is in.

Architecture

WhisperServer — ML process management

Launches and monitors a whisper-server subprocess (whisper.cpp). Handles health checks via HTTP polling, port conflict resolution, automatic restarts on crash, and graceful termination on app quit. The app stays idle at near-zero CPU until a recording is triggered.

KeyboardMonitor + ShortcutManager — Global event tap

Uses a CGEventTap to intercept keyboard events system-wide regardless of which app is active. Implements a state machine (idle → recordingHold / recordingToggle) with hold-to-talk and tap-to-toggle modes. Events are suppressed so the hotkey doesn't leak into other apps during recording.

Recorder — Audio capture pipeline

Captures 16kHz mono PCM audio and exports to a temporary WAV file backed by RAM. Handles microphone disconnection, permission revocation at runtime, and audio session conflicts gracefully. Buffers are released immediately after the POST request to minimise memory pressure.

TranscriptionService — Inference API

Sends the WAV file to the local whisper-server via POST /inference as multipart/form-data. Handles timeouts, 5xx errors, and network failures — with one automatic server restart before notifying the user.

Paster — Clipboard-safe text injection

Backs up the current clipboard, sets transcribed text, posts a Cmd+V keyboard event to the frontmost app, waits 250ms, then restores the original clipboard. The user's clipboard content is never permanently overwritten.

Stack

Layer Technology
ML Inferencewhisper.cpp (local server, runs entirely on-device)
App RuntimemacOS 13+, Apple Silicon optimised
AudioAVFoundation — 16kHz mono PCM capture
InputCGEventTap — global keyboard interception
Inference APIHTTP — multipart/form-data POST to local server
TestingXCTest — unit tests across all major components
BuildSwift Package Manager

What makes it non-trivial

  • Everything runs locally — no API keys, no network dependency, no data leaving the machine
  • The whisper-server subprocess must be managed across the full app lifecycle — startup, health checks, crashes, port conflicts, and clean shutdown
  • Global keyboard interception works across all apps including those with their own event handling (terminals, editors, browsers)
  • Clipboard safety is critical — the user's clipboard must be restored exactly, even if the paste fails or the app crashes mid-operation
  • The app targets near-zero CPU at idle and <100MB resident memory — the ML inference cost is entirely offloaded to the subprocess
  • Full test coverage: AppState, Hotkey, Recorder, TranscriptionService, Paster, ModelManager, WhisperServer — tested in isolation