Build Stories on Chinese Man

How I Built a Private, On-Device Speech-to-Text App with Whisper

Sun, 21 Jun 2026 00:00:00 +0000

Most transcription apps upload your audio to a server to process it. If you are recording anything sensitive, a doctor’s visit, an interview, a private voice note, a business call, that is a hard no. So I built Private Transcribe: speech-to-text powered by Whisper, running entirely on your device. Your audio never leaves your phone. “Your voice, your privacy.” Here is how it came together, and why the privacy promise shaped every technical decision.

The problem with “free” transcription

Online transcription is rarely free in the way that matters. You pay with your audio. It gets uploaded, processed on someone’s servers, and frequently retained, sometimes to improve their models. For casual voice memos, maybe you do not care. For anything confidential, you absolutely should, because once audio leaves your device you have lost control of it forever.

The promise I wanted to make, and keep, was simple: your audio never leaves your phone, and the app never uploads your audio or transcriptions to any server. You can only make that promise credibly if the app genuinely has no server in the loop. So it does not. The absence of a backend is not a limitation here, it is the entire product.

Whisper on a phone

The thing that makes this possible is Whisper, the open speech-recognition model that is remarkably good across languages. Private Transcribe handles 99+ languages through multilingual models, which is a feature you simply could not offer affordably if every minute of audio had to round-trip through your own servers. On-device, languages are free.

The engineering challenge, like with any on-device AI, is fitting a capable model onto a phone without melting it. Whisper comes in sizes, and the size you choose is a direct trade between speed, accuracy, and storage. Rather than guessing for the user, I expose that choice as a quality tier:

Tiny (75 MB): fastest, great for quick notes where rough accuracy is fine.
Base (142 MB): a balanced default that suits most people.
Small (466 MB): noticeably higher quality for important recordings.
Medium (1.5 GB): professional-grade accuracy for when it really matters.

Someone on an older phone who wants speed picks Tiny. Someone transcribing a crucial interview on a newer device picks Medium and waits a little longer for a better result. Giving users the dial, with a sane default already selected, beats pretending one size fits everyone. It also respects that people know their own situation better than I do.

The UX around the model

The model is the engine, but the app is everything around it, and that wrapper is where most of the product work actually lives:

Tap to record with real-time progress, so transcription feels responsive instead of like staring into a black box wondering if anything is happening.
Local storage of every transcription, with search, so you can actually find that note from three weeks ago instead of scrolling forever.
Sharing for the moments when you do want to send a transcript out, on your terms, by deliberate choice, never automatically.
A clean dark interface that is comfortable for reading long blocks of text.

A surprising amount of “AI app” work is exactly this: making the model’s output findable, usable, and trustworthy. The intelligence is table stakes. The product is the wrapper around it, and a brilliant model behind a frustrating interface is a brilliant model nobody uses.

Handling the unglamorous edges

Real recordings are messy. People pause, they ramble, they record for forty minutes. A toy demo transcribes ten seconds of clean speech. A real app has to handle long audio without running the phone out of memory, show sensible progress on a file that takes a while, and never lose a transcription because the app got backgrounded. None of that is glamorous, and all of it is the difference between a demo and something people rely on.

Where vibe coding fit

The AI assistant was a genuine force multiplier on the surrounding app: the recording screen, the progress UI, the local database with search, the model download and storage management, the share flow. All of that is well-trodden territory it could scaffold quickly, which freed my attention for the parts that mattered.

What stayed firmly on me was the careful work: running Whisper efficiently on-device, handling longer recordings without choking on memory, and, most importantly, making sure the privacy promise was actually true end to end, with no sneaky analytics call quietly shipping audio off the device. When your entire pitch is privacy, the privacy has to be real, and that is precisely the kind of thing you verify by reading every line yourself, not by trusting generated code. I dug into that division of labor in my vibe coding toolkit.

Why on-device was worth the extra work

Building this on-device was harder than wiring up a cloud API would have been. There was no shortcut, no someone-else’s-server to quietly offload the difficult part onto. But that difficulty is exactly the moat. Anyone can wrap a transcription API in a weekend, and a thousand people already have, which is why those apps all blur together. Almost nobody ships something that genuinely keeps your audio on your phone, because it is more work and there is no usage to bill for at the end of it. The hard path turned out to be the defensible one, and the privacy promise that came with it is the precise reason a user would choose this app over the dozen cloud-based ones above it in the search results.

Lessons

Privacy is a feature you can build a whole product around, but only if you mean it architecturally. “On-device” is a promise you have to earn in the code, not a sticker you slap on the listing.
Expose the real trade-off. Speed versus accuracy is genuine and personal, so let users choose their model tier instead of deciding for them.
The wrapper is the product. Search, storage, sharing, and a calm interface are what turn a model into something people open every day.

Private Transcribe is part of a clear pattern in what I build: take an AI capability people assume requires the cloud, and prove it can run privately in your pocket. If that idea interests you, I did the same thing for chat in how I built an offline AI chat app, and I wrote about the strategy behind it in why I build small apps.

Building Capybara Crossing: What Shipping a Casual Arcade Game Taught Me

Thu, 18 Jun 2026 00:00:00 +0000

After a run of utility apps, I wanted to build something that was pure fun, something my niece could pick up in two seconds without a tutorial. The “hop across endless lanes of traffic” formula is a classic for a reason, so I put a capybara in it and made Capybara Crossing: hop across roads, train tracks, and rivers, collect coins, and try not to die. “Hop to survive.” Here is what building it taught me, because a casual game is a very different discipline from a tool.

Why a casual game

Utility apps solve a problem. Casual games create a feeling. They are completely different design challenges, and I wanted the reps in the second one. The bar is also deceptively high. The gameplay is simple, but simple has nowhere to hide. If the hop does not feel good, there is no feature list to distract from it, no settings screen to get lost in. The core loop is the entire product, naked and exposed.

The pitch is tiny, and that is the point: you are a capybara, the world is an endless gauntlet of cars, trucks, buses, trains, and rivers, and you hop forward as far as you can. Cute character, instant understanding, one-thumb controls. A five-year-old and a fifty-year-old both get it in the first second.

Game feel is the entire product

In a utility app, “it works” is enough. In a game, “it works” is the floor, and feel is the ceiling. I spent a wildly disproportionate amount of time on things a spec sheet would never list:

The hop. Its timing, the little arc, the snap onto the next tile, the tiny squash when you land. This single animation is most of whether the game feels good or cheap.
Readability. You have to instantly see where it is safe to land. Lane spacing, hazard timing, and color all serve clarity, because a death that feels unfair makes people quit, while a death that feels like their own fault makes them try again.
The difficulty ramp. Easy enough to start, relentless enough that “one more try” becomes irresistible. The curve has to feel fair the whole way up.

You cannot spec your way to good feel. You build it, play it a hundred times, change one number, and play it a hundred more. That tuning loop is where most of the real work lives, and there is no shortcut through it.

The eagle: solving the “standing still” problem

Endless hoppers have a classic exploit. The player just stops moving and stays safe forever, which kills all the tension. My fix is a threat baked into the design: stay still too long and an eagle swoops down and takes you. Keep moving to survive.

It is a tiny mechanic with an outsized effect. It removes the safe option, keeps tension constant, and quietly converts the game from “avoid hazards when you feel like moving” into “always be moving, manage the danger as you go.” The best game design is often one small rule that invisibly forces the behavior you want, instead of a tutorial telling players how they should act. The eagle never explains itself. It just teaches you, once, and you never stand still again.

Coins, characters, and the reason to come back

A high score is a reason to play once. Progression is a reason to come back. The loop is simple:

Collect coins as you hop, which gives every run a second purpose beyond distance.
Spend them to unlock characters, six of them, including a Pirate, a Ninja, and a Space Capy.

Cosmetic unlocks are perfect for a casual game. They give players goals without adding rules to learn, and “I just want the Space Capy” turns out to be a surprisingly strong retention hook. It also keeps the game fair, because everything is earnable by playing rather than paying, which builds goodwill instead of resentment.

Where vibe coding helped, and where it did not

AI assistance was excellent for the scaffolding: the core game loop, procedural lane spawning, collision detection, the coin and save system, and the unlock screens. That got me from nothing to a playable prototype fast, which is exactly when a game project is most fragile and most likely to be abandoned. Getting to “I can play this” quickly kept the momentum alive.

What the AI absolutely could not do was tell me whether the game felt good. Tuning the hop, spacing the hazards so they are hard but fair, pacing the difficulty so it pulls you forward without frustrating you, all of that is taste and playtesting, and it stays firmly human. AI gets you a working game. Only playing it, over and over, turns it into a fun game.

What polish really means

The biggest surprise was how much the tiny, invisible details mattered. A casual game lives or dies on a hundred things no player could ever name: the exact delay before the eagle appears, the weight of the landing, the half-second of feedback after a coin, the way the camera nudges forward. None of it shows up in a feature list, and all of it is the difference between a game that feels cheap and one that feels good in the hand. That obsession with feel is a muscle, and once you build it on a game where feel is the entire product, you start noticing the same missing polish everywhere else you ship, in the tools you thought were already done.

Lessons from shipping it

The first 30 seconds decide everything. Casual players judge instantly. If the first session is not fun, there is no second session, so the opening has to land.
Juice matters. Small feedback, the sounds, the little animations, the satisfying coin pop, makes simple gameplay feel good. Juice is not decoration, it is the feel.
Keep it offline and free. No connection required, local progress saving, no paywalls standing between the player and fun. Friction is the enemy of casual.
Constraints breed charm. A capybara and one clean mechanic beat a sprawling design I would never have finished. The limits made the game, they did not hold it back.

Building a game after a string of tools was the best thing I did for my craft this year. It forces you to care about feel, not just function, and that lesson follows you back into everything else you build, including the serious apps where “it works” was secretly never quite enough either.

How I Built an AI Chat App That Runs Entirely On Your Phone

Sun, 14 Jun 2026 00:00:00 +0000

Every mainstream AI chat app sends your conversations to someone’s server. I wanted the opposite: an app where the model runs on your phone, your messages never leave the device, and there is no login and no subscription. That became Personal LLM, and getting a language model to run well on a phone turned out to be the most interesting engineering problem I have taken on as a solo builder.

The itch

I kept hitting the same three frustrations with mainstream AI apps:

Privacy. Everything you type goes to a server and, often, into training data.
Connectivity. On a plane or with bad signal, they are useless.
Cost. Another $10 to $20 a month for something I use in unpredictable bursts.

On-device models had quietly gotten good enough that I started to wonder: what if none of that were true? What if the AI just lived on your phone, the way a calculator does? “Your AI, your device,” private, offline, and actually powerful. The idea would not let go of me, so I built it.

The hard part: a language model on a phone

This is where it stops being a UI project and becomes a systems problem. Phones are not servers, and the constraints are brutal enough that they drive every single decision.

Memory. A phone might have 4 to 8GB of RAM, shared with the OS and every other app. A model that loads fine on a laptop will simply get killed on a phone. That alone rules out most models.
Model size on disk. Nobody will tolerate a 20GB download. The realistic sweet spot is a few hundred MB to a few GB.
Speed. Tokens per second has to feel like a conversation, not a fax machine slowly printing.
Heat and battery. Run the processor flat out and the phone gets hot and the battery drains, so the work has to be efficient, not just possible.

The unlock is small, quantized models. Quantization shrinks a model by storing its weights at lower precision, which trades a little quality for a huge reduction in size and memory. I built around Qwen 3, from tiny 0.6B variants up to larger ones, and GLM-Edge, including vision-capable variants, with downloads ranging from roughly 500MB to 4GB. The user picks a model that fits their device, downloads it once, and from then on it runs fully offline.

The trade-off you have to make peace with, and be honest with users about, is this: a 1B model on a phone is not a frontier model in the cloud. It will not write your dissertation. But for quick questions, drafting, summarizing, brainstorming, rewriting, and anything on a plane, it is genuinely useful, and it is yours. Setting that expectation clearly inside the app matters more than overpromising and disappointing people on their first message.

Giving users the dials, with sane defaults

Different people want different behavior, so I exposed the controls that matter: temperature, top-k, and top-p for shaping how creative or focused the responses are, plus toggles for a “thinking” mode versus a “fast” mode depending on whether you want careful reasoning or a quick reply. The risk with exposing model internals is overwhelming a casual user, so the rule was: every control has a good default, and you never need to touch any of them to get a solid answer. Power users get the dials. Everyone else gets something that just works out of the box.

Where vibe coding carried me

I am one person. I could not have shipped this typing every line. The AI assistant handled the parts that are tedious but well understood:

The chat UI: message bubbles, streaming token rendering, scroll behavior, the copy button, the empty state.
The model download manager: progress, resume after interruption, storage checks, and deletion to free space.
The settings layer: surfacing temperature, top-k, top-p, and the mode toggles in a way that a normal person can ignore safely.

What I had to own personally was the hard 20 percent: wiring up on-device inference, managing memory so the app does not get killed mid-generation, and tuning the defaults so a non-technical person gets a good answer without touching a single slider. That split, where the AI does the boilerplate and you own the load-bearing parts, is the whole game. I wrote about it in what vibe coding actually is.

Product decisions that mattered

One-time download, then offline forever. The only moment of friction is the first model download. After that, zero network and zero waiting, ever.
No account. An account is a privacy promise you can break. No account means there is nothing to leak, nothing to breach, and nothing to log in to.
Sensible defaults, optional depth. Casual users get a model that just answers. Power users get the sampling controls. Neither group is punished for the other.
Free. With no server costs to cover, there is no subscription to justify, which also means no billing, no payment processing, and no churn to manage.

What I would tell another builder

On-device AI feels like magic to users precisely because everyone assumes it is impossible. “You can run that on a phone?” That gap between expectation and reality is a wonderful place to build a product, because the perceived difficulty is doing your marketing for you. But respect the constraints. Pick models per device tier, be honest about what a small model can and cannot do, and spend your scarce hand-written effort on memory and performance, not on the parts an AI can scaffold in an afternoon.

The result is an app I actually use, on planes, on the subway, and any time I would rather not hand my private thoughts to a server. That is the kind of product worth shipping: one you reach for yourself. Building privacy-first apps turned into a whole theme for me, which I dug into again with an offline speech-to-text app.