Every mainstream AI chat app sends your conversations to someone’s server. I wanted the opposite: an app where the model runs on your phone, your messages never leave the device, and there is no login and no subscription. That became Personal LLM, and getting a language model to run well on a phone turned out to be the most interesting engineering problem I have taken on as a solo builder.
The itch
I kept hitting the same three frustrations with mainstream AI apps:
- Privacy. Everything you type goes to a server and, often, into training data.
- Connectivity. On a plane or with bad signal, they are useless.
- Cost. Another $10 to $20 a month for something I use in unpredictable bursts.
On-device models had quietly gotten good enough that I started to wonder: what if none of that were true? What if the AI just lived on your phone, the way a calculator does? “Your AI, your device,” private, offline, and actually powerful. The idea would not let go of me, so I built it.
The hard part: a language model on a phone
This is where it stops being a UI project and becomes a systems problem. Phones are not servers, and the constraints are brutal enough that they drive every single decision.
- Memory. A phone might have 4 to 8GB of RAM, shared with the OS and every other app. A model that loads fine on a laptop will simply get killed on a phone. That alone rules out most models.
- Model size on disk. Nobody will tolerate a 20GB download. The realistic sweet spot is a few hundred MB to a few GB.
- Speed. Tokens per second has to feel like a conversation, not a fax machine slowly printing.
- Heat and battery. Run the processor flat out and the phone gets hot and the battery drains, so the work has to be efficient, not just possible.
The unlock is small, quantized models. Quantization shrinks a model by storing its weights at lower precision, which trades a little quality for a huge reduction in size and memory. I built around Qwen 3, from tiny 0.6B variants up to larger ones, and GLM-Edge, including vision-capable variants, with downloads ranging from roughly 500MB to 4GB. The user picks a model that fits their device, downloads it once, and from then on it runs fully offline.
The trade-off you have to make peace with, and be honest with users about, is this: a 1B model on a phone is not a frontier model in the cloud. It will not write your dissertation. But for quick questions, drafting, summarizing, brainstorming, rewriting, and anything on a plane, it is genuinely useful, and it is yours. Setting that expectation clearly inside the app matters more than overpromising and disappointing people on their first message.
Giving users the dials, with sane defaults
Different people want different behavior, so I exposed the controls that matter: temperature, top-k, and top-p for shaping how creative or focused the responses are, plus toggles for a “thinking” mode versus a “fast” mode depending on whether you want careful reasoning or a quick reply. The risk with exposing model internals is overwhelming a casual user, so the rule was: every control has a good default, and you never need to touch any of them to get a solid answer. Power users get the dials. Everyone else gets something that just works out of the box.
Where vibe coding carried me
I am one person. I could not have shipped this typing every line. The AI assistant handled the parts that are tedious but well understood:
- The chat UI: message bubbles, streaming token rendering, scroll behavior, the copy button, the empty state.
- The model download manager: progress, resume after interruption, storage checks, and deletion to free space.
- The settings layer: surfacing temperature, top-k, top-p, and the mode toggles in a way that a normal person can ignore safely.
What I had to own personally was the hard 20 percent: wiring up on-device inference, managing memory so the app does not get killed mid-generation, and tuning the defaults so a non-technical person gets a good answer without touching a single slider. That split, where the AI does the boilerplate and you own the load-bearing parts, is the whole game. I wrote about it in what vibe coding actually is.
Product decisions that mattered
- One-time download, then offline forever. The only moment of friction is the first model download. After that, zero network and zero waiting, ever.
- No account. An account is a privacy promise you can break. No account means there is nothing to leak, nothing to breach, and nothing to log in to.
- Sensible defaults, optional depth. Casual users get a model that just answers. Power users get the sampling controls. Neither group is punished for the other.
- Free. With no server costs to cover, there is no subscription to justify, which also means no billing, no payment processing, and no churn to manage.
What I would tell another builder
On-device AI feels like magic to users precisely because everyone assumes it is impossible. “You can run that on a phone?” That gap between expectation and reality is a wonderful place to build a product, because the perceived difficulty is doing your marketing for you. But respect the constraints. Pick models per device tier, be honest about what a small model can and cannot do, and spend your scarce hand-written effort on memory and performance, not on the parts an AI can scaffold in an afternoon.
The result is an app I actually use, on planes, on the subway, and any time I would rather not hand my private thoughts to a server. That is the kind of product worth shipping: one you reach for yourself. Building privacy-first apps turned into a whole theme for me, which I dug into again with an offline speech-to-text app.