How I Built a Private, On-Device Speech-to-Text App with Whisper

Most transcription apps upload your audio to a server to process it. If you are recording anything sensitive, a doctor’s visit, an interview, a private voice note, a business call, that is a hard no. So I built Private Transcribe: speech-to-text powered by Whisper, running entirely on your device. Your audio never leaves your phone. “Your voice, your privacy.” Here is how it came together, and why the privacy promise shaped every technical decision.

The problem with “free” transcription

Online transcription is rarely free in the way that matters. You pay with your audio. It gets uploaded, processed on someone’s servers, and frequently retained, sometimes to improve their models. For casual voice memos, maybe you do not care. For anything confidential, you absolutely should, because once audio leaves your device you have lost control of it forever.

The promise I wanted to make, and keep, was simple: your audio never leaves your phone, and the app never uploads your audio or transcriptions to any server. You can only make that promise credibly if the app genuinely has no server in the loop. So it does not. The absence of a backend is not a limitation here, it is the entire product.

Whisper on a phone

The thing that makes this possible is Whisper, the open speech-recognition model that is remarkably good across languages. Private Transcribe handles 99+ languages through multilingual models, which is a feature you simply could not offer affordably if every minute of audio had to round-trip through your own servers. On-device, languages are free.

The engineering challenge, like with any on-device AI, is fitting a capable model onto a phone without melting it. Whisper comes in sizes, and the size you choose is a direct trade between speed, accuracy, and storage. Rather than guessing for the user, I expose that choice as a quality tier:

Tiny (75 MB): fastest, great for quick notes where rough accuracy is fine.
Base (142 MB): a balanced default that suits most people.
Small (466 MB): noticeably higher quality for important recordings.
Medium (1.5 GB): professional-grade accuracy for when it really matters.

Someone on an older phone who wants speed picks Tiny. Someone transcribing a crucial interview on a newer device picks Medium and waits a little longer for a better result. Giving users the dial, with a sane default already selected, beats pretending one size fits everyone. It also respects that people know their own situation better than I do.

The UX around the model

The model is the engine, but the app is everything around it, and that wrapper is where most of the product work actually lives:

Tap to record with real-time progress, so transcription feels responsive instead of like staring into a black box wondering if anything is happening.
Local storage of every transcription, with search, so you can actually find that note from three weeks ago instead of scrolling forever.
Sharing for the moments when you do want to send a transcript out, on your terms, by deliberate choice, never automatically.
A clean dark interface that is comfortable for reading long blocks of text.

A surprising amount of “AI app” work is exactly this: making the model’s output findable, usable, and trustworthy. The intelligence is table stakes. The product is the wrapper around it, and a brilliant model behind a frustrating interface is a brilliant model nobody uses.

Handling the unglamorous edges

Real recordings are messy. People pause, they ramble, they record for forty minutes. A toy demo transcribes ten seconds of clean speech. A real app has to handle long audio without running the phone out of memory, show sensible progress on a file that takes a while, and never lose a transcription because the app got backgrounded. None of that is glamorous, and all of it is the difference between a demo and something people rely on.

Where vibe coding fit

The AI assistant was a genuine force multiplier on the surrounding app: the recording screen, the progress UI, the local database with search, the model download and storage management, the share flow. All of that is well-trodden territory it could scaffold quickly, which freed my attention for the parts that mattered.

What stayed firmly on me was the careful work: running Whisper efficiently on-device, handling longer recordings without choking on memory, and, most importantly, making sure the privacy promise was actually true end to end, with no sneaky analytics call quietly shipping audio off the device. When your entire pitch is privacy, the privacy has to be real, and that is precisely the kind of thing you verify by reading every line yourself, not by trusting generated code. I dug into that division of labor in my vibe coding toolkit.

Why on-device was worth the extra work

Building this on-device was harder than wiring up a cloud API would have been. There was no shortcut, no someone-else’s-server to quietly offload the difficult part onto. But that difficulty is exactly the moat. Anyone can wrap a transcription API in a weekend, and a thousand people already have, which is why those apps all blur together. Almost nobody ships something that genuinely keeps your audio on your phone, because it is more work and there is no usage to bill for at the end of it. The hard path turned out to be the defensible one, and the privacy promise that came with it is the precise reason a user would choose this app over the dozen cloud-based ones above it in the search results.

Lessons

Privacy is a feature you can build a whole product around, but only if you mean it architecturally. “On-device” is a promise you have to earn in the code, not a sticker you slap on the listing.
Expose the real trade-off. Speed versus accuracy is genuine and personal, so let users choose their model tier instead of deciding for them.
The wrapper is the product. Search, storage, sharing, and a calm interface are what turn a model into something people open every day.

Private Transcribe is part of a clear pattern in what I build: take an AI capability people assume requires the cloud, and prove it can run privately in your pocket. If that idea interests you, I did the same thing for chat in how I built an offline AI chat app, and I wrote about the strategy behind it in why I build small apps.

The problem with “free” transcription#

Whisper on a phone#

The UX around the model#

Handling the unglamorous edges#

Where vibe coding fit#

Why on-device was worth the extra work#

Lessons#