AI Runs on Your Phone Now: Gemma 4 Edge, BitNet and LiteRT-LM
I normally run Ollama on a PC with two GPUs installed. I’ve written plenty of local-AI articles, and I’ve grown completely used to a life of running LLMs on my home GPUs.
But across 2025–2026, phone-oriented AI models and frameworks were announced one after another. Google’s “Gemma 4 Edge," Microsoft’s “BitNet b1.58," and Google’s inference framework “LiteRT-LM." What they all share is a direction: running AI with no GPU, using only the CPU/NPU of a phone or laptop.
To put the conclusion up front: phone AI (2B–4B) is plenty for a quick question or a translation, but serious use still needs a GPU setup. In this article I compare the three technologies — Gemma 4 Edge, BitNet, and LiteRT-LM — and lay out how to split work between them and a GPU setup.
- 1. What is an edge LLM?
- 2. How quantization works: why it runs on a phone
- 3. The three big topics of April 2026
- 4. Comparison table: edge LLM vs GPU setup
- 5. Value scatter plot data
- 6. So, do you not need a GPU?
- 7. How to try it right now
- 8. Which phones can run it? Sorting out the minimum specs
- 9. Summary
- 10. Sources referenced
What is an edge LLM?
First, some terminology. An “edge LLM" is a large language model that runs directly on the CPU or NPU (neural processing unit) of a device in your hand (phone, tablet, laptop), rather than on a cloud GPU server.
Traditional local AI meant “putting a GPU in your home gaming PC and running it there." An edge LLM goes one step further, with the concept of running in the phone in your pocket.
Edge LLMs have two big advantages.
- Your data never leaves: nothing is sent to the cloud, so privacy is fully protected. You can summarize confidential documents or ask personal questions without a second thought
- No internet needed: it works even where there’s no signal. AI you can use on a plane, in the mountains, or during a disaster
The downside is just as clear: the model size is small. Because it has to fit in a phone’s memory (6–8GB), the current ceiling is 2B–4B parameters. Compared with the 14B–32B models that run on a GPU setup, there’s a gap in output quality and depth of reasoning.
How quantization works: why it runs on a phone
You can’t understand edge LLMs without “quantization." This is the core technology that makes phone AI possible, so let me go a little deeper.
Shrinking the model by lowering the precision of the parameters
An LLM’s parameters (weights) are originally trained in FP32 (32-bit floating point). Each parameter uses 32 bits = 4 bytes, so a 2B model is 2,000,000,000 × 4 bytes = about 8GB. That won’t fit in a phone’s memory.
Quantization is the technique of progressively lowering this precision to compress the size.
| Format | Bits | Size per parameter | Est. size of a 2B model | Impact on quality |
|---|---|---|---|---|
| FP32 (at training) | 32bit | 4 bytes | about 8GB | Baseline (100%) |
| FP16 / BF16 | 16bit | 2 bytes | about 4GB | Almost no degradation |
| INT8 | 8bit | 1 byte | about 2GB | Slight degradation |
| INT4 (GPTQ/AWQ) | 4bit | 0.5 bytes | about 1GB | Noticeable degradation |
| 1.58bit (BitNet) | 1.58bit | about 0.2 bytes | about 0.4GB | Compensated by dedicated design |
FP32’s 8GB compresses down to 0.4GB with quantization — a full 1/20th. That fits easily into an 8GB phone’s memory.
2B model size by quantization level
NPU vs GPU vs CPU: how much does compute efficiency differ?
Edge LLMs run on three kinds of processor: CPU, GPU, and NPU. Their efficiency at AI inference (TOPS/W = operations per watt) differs completely.
| Processor | Examples | AI compute performance | Power efficiency (TOPS/W) | Good at |
|---|---|---|---|---|
| CPU | Apple M2, Snapdragon 8 Gen 3 | 1–5 TOPS | about 0.5–1 TOPS/W | General compute, integer math |
| GPU (mobile) | Adreno 750, Apple GPU | 5–15 TOPS | about 1–3 TOPS/W | Parallel floating-point math |
| NPU | Apple Neural Engine, Hexagon | 15–45 TOPS | about 5–15 TOPS/W | Specialized for matrix math |
| GPU (desktop) | RTX 3090 (reference) | 285 TOPS (INT8) | about 0.8 TOPS/W | Large-scale parallel processing |
What I want you to notice is that the NPU’s power efficiency reaches 5–10× that of a GPU or CPU. An NPU is nowhere near a desktop GPU in raw compute performance, but in AI compute efficiency per watt it’s far superior. That’s the reason AI can run on a battery-powered phone.
The RTX 3090’s 285 TOPS is in another league, but it draws a full 350W. A phone battery (about 15Wh) would be empty in a few minutes at that rate. The NPU is modest at 15 TOPS, but it runs on 1–3W, so you can keep using it for hours on a phone.
Memory bandwidth determines token-generation speed
When running AI on a phone, memory bandwidth is often the bottleneck more than the CPU or NPU compute speed. That’s because LLM inference reads the model’s entire set of parameters from memory each time it generates a token.
The approximate formula for token-generation speed is as follows.
tok/s ≈ memory bandwidth (GB/s) ÷ model size (GB)
Wider memory bandwidth is faster, and a smaller model is faster. Let’s plug in some concrete numbers.
| Device | Memory bandwidth | Model | Model size | Est. tok/s | Reported measured value (reference) |
|---|---|---|---|---|---|
| iPhone 15 Pro (A17 Pro) | about 50 GB/s | Gemma 4 E4B (INT4) | about 2.5GB | about 20 | 10–15 |
| Pixel 9 Pro (Tensor G4) | about 44 GB/s | Gemma 4 E2B (INT4) | about 1.3GB | about 34 | 15–20 |
| Mac M2 (16GB) | 100 GB/s | BitNet 2B (1.58bit) | 0.4GB | about 250 | 45 (※ capped due to CPU execution) |
| RTX 3090 | 936 GB/s | Qwen3 8B (INT4) | about 5GB | about 187 | about 50 (※ overhead applies) |
Estimated token generation speed by device
The three big topics of April 2026
1. Gemma 4 E2B / E4B (Google)
An Edge-only LLM that Google announced in April 2026. The “E" in the name stands for Effective. E2B is a 2B-parameter model, E4B a 4B-parameter model.
Specs and features:
- E2B (2B parameters): conversational speed on iPhone 14 Pro and later. Estimated 15–20 tok/s
- E4B (4B parameters): conversational speed on iPhone 15 Pro and later. Estimated 10–15 tok/s
- Multimodal: handles not just text but images and audio. You can show a photo taken with the phone’s camera and ask “what’s this?"
- Fully offline: once you download the model, no internet connection is needed afterward
- Core ML conversion support: runs on iOS’s native inference engine, so it pairs well with Apple devices
Above all, it’s incredibly easy to try. Just install the Google AI Edge Gallery app that Google released (both iOS and Android). Pick Gemma 4 E2B or E4B on the model-selection screen and you can start chatting right away. No installing Ollama, no terminal work.
Compared with the steps to run Ollama on an RTX 3090 (install Ollama → download the model → type commands), the barrier is overwhelmingly lower. Since you only install a phone app, you can recommend it even to family members who aren’t tech-savvy.
2. BitNet b1.58 2B4T (Microsoft)
A model developed by Microsoft that’s seriously aiming for “no GPU required." It takes a technically fascinating approach.
How 1-bit quantization works:
A normal LLM represents parameters as 16-bit or 4-bit numbers. BitNet cuts this to the extreme, representing them with only the three values −1, 0, +1 (1.58bit). Multiplication becomes unnecessary and inference runs on addition and subtraction alone, so no GPU floating-point units are required.
Performance in numbers:
- Model size: on the official evaluation metric, the non-embedding memory is about 0.4GB. Even the full footprint when you actually download and run it stays around 1.2GB, so it can run on a phone with 4GB of RAM
- Speed on Apple M2 CPU: 45 tok/s. That’s quite fast; user reports describe it as “replies come right back"
- Speedup on x86 CPUs: achieves a 2.37–6.17× speedup over the conventional approach
- GitHub stars: passed 25,000 not long after release and kept climbing. Interest in the developer community is very high
The future roadmap is spelled out too, describing a plan to run models on the order of 100B parameters on a CPU via 1-bit quantization. If a 100B model could run on the CPU of an ordinary laptop, that would truly be a game changer.
That said, for now it’s only a 2B model, and output quality outside English is still developing. Even the official description says non-English languages are limited; while it posts good numbers on English-language benchmarks, a common assessment is that the same-size Gemma 4 E2B is better at grasping nuance and generating natural prose in other languages.
3. LiteRT-LM (Google)
An LLM inference framework for edge devices that Google announced in April 2026. Whereas Gemma 4 Edge is the model itself, LiteRT-LM is the engine that runs the model. It runs across Android, iOS, Web, desktop, and even Raspberry Pi, and it’s the same foundation Google uses to run Gemini Nano in Chrome and Pixel products.
LiteRT-LM’s aim is clear: it’s a tool for app developers to build LLM features into phone apps. It’s meant for uses like “adding offline AI translation to a translation app" or “adding an AI summary feature to a notes app."
It’s not something we end users touch directly. Even so, LiteRT-LM’s arrival matters. If it becomes easier for developers to embed LLMs, “phone apps with AI features" should multiply rapidly from here — and as offline AI that sends nothing to the cloud.
Comparison table: edge LLM vs GPU setup
Here’s the main event. Let’s put phone edge LLMs and home-GPU Ollama side by side.
| Item | Gemma 4 E2B | Gemma 4 E4B | BitNet 2B | Ollama 8B (GPU) | Ollama 27B (GPU) |
|---|---|---|---|---|---|
| Model size | 2B | 4B | 2B | 8B | 27B |
| Required hardware | iPhone 14 Pro+ | iPhone 15 Pro+ | M2 Mac / general PC | RTX 3060 | RTX 3090 |
| Speed (tok/s) | 15–20 | 10–15 | 45 (M2) / x86 depends on environment | about 50 | about 25–30 |
| Quality | Decent | Practical | Developing | Practical | Quite good |
| Multimodal | Yes | Yes | No | Depends on model | Depends on model |
| Offline | Full | Full | Full | Full | Full |
| Setup difficulty | Just install an app | Just install an app | Terminal work needed | Install Ollama | Install Ollama |
| Extra cost | ¥0 (phone you own) | ¥0 (phone you own) | ¥0 (PC you own) | GPU from ¥30,000 | GPU from ¥70,000 |
How to read this table: From left to right, the required investment rises but the AI gets smarter. Edge LLMs’ 2B–4B models dominate on ease, but on quality and reasoning power they’re clearly outdone by the 8B-and-up GPU models.
Value scatter plot data
A scatter plot of extra cost (horizontal axis) against AI practicality score (vertical axis).
[Scatter plot data]
| Setup | Extra cost (¥) | AI practicality score (out of 100) | |------|-----------------|---------------------------| | iPhone + Gemma 4 E2B | 0 | 25 | | iPhone + Gemma 4 E4B | 0 | 35 | | Mac M4 Pro + BitNet 2B | 0 | 30 | | Mac M4 Pro + Ollama 14B | 0 | 60 | | PC + RTX 3060 + Ollama 8B | 30,000 | 45 | | PC + RTX 5060 Ti 16GB + Ollama 14B | 90,000 | 65 | | PC + RTX 3090 + Ollama 32B | 70,000 (used) | 85 | | PC + RTX 5090 + Ollama 32B | 400,000 | 95 |
- X axis: extra cost (¥). ¥0 if you run it on a phone or PC you already own
- Y axis: AI practicality score. An overall rating that factors in output quality and generation speed (out of 100)
- Scoring basis: overall evaluation of naturalness of language, comprehension of questions, depth of reasoning, and generation speed
So, do you not need a GPU?
You still do.
Between edge LLMs’ 2B–4B models and the 14B–32B models that run on a GPU, there’s a clear wall in intelligence. Concretely, differences like these show up.
| Use case | Edge LLM (2B–4B) | GPU setup (14B–32B) |
|---|---|---|
| Answering short questions | Practical | Comfortable |
| Proofreading text | OK for simple fixes | Context-aware corrections possible |
| Summarizing long text | Possible for short passages | Handles multi-page A4 documents too |
| Code generation | Simple snippets at most | Can generate function-level code |
| Complex reasoning / analysis | Tough | Reasonably practical at 14B and up |
| Multi-turn conversation | Forgets context after a few turns | Relatively stable even in long chats |
Between a 2B model and a 32B model, there’s a 16× difference in parameter count. That’s simply like “the size of the brain" being 16× different, and it affects everything — knowledge, reasoning power, and grasp of linguistic nuance.
That said, depending on how you use it, there are situations where an edge LLM is enough.
- Asking “just give me the gist of this English email" while you’re out
- Checking text for typos in an offline environment
- Showing a photo and asking “what’s the name of this plant?" (Gemma 4 Edge’s multimodal)
- Quickly checking a simple translation
For these “quick questions" and “light tasks," a phone’s edge AI can handle it practically.
My conclusion is “split the work." When out, the phone’s edge AI; at home, Ollama on an RTX 3090. I think that’s the most sensible way to use it as of 2026. The era where GPUs become unnecessary is still off, but the era where “you can touch AI even without a GPU" has definitely arrived.
How to try it right now
iPhone / Android users (¥0, done in 5 minutes)
- Install the Google AI Edge Gallery app from the App Store / Google Play
- Open the app and pick Gemma 4 E2B from the model list
- Once the model finishes downloading, type a question in the chat screen
That’s all there is to it. No PC knowledge required. You need Wi-Fi for the download, but once it’s on your device you can use it offline.
If you have an iPhone 15 Pro or later, try E4B (the 4B model) too. From what benchmark reports show, it seems a notch smarter than E2B.
Mac users (¥0, done in 10 minutes)
- Install Ollama from ollama.com
- Open a terminal and type
ollama run gemma4 - The model downloads and the chat begins
On a Mac with M1 or later, Ollama runs even 14B-class models comfortably. Apple Silicon’s unified memory is actually quite well suited to local AI.
Windows PC users (a used GPU if value matters)
If you want to use local AI seriously, buying a used RTX 3060 12GB (¥20,000–30,000) and installing Ollama is the best value as of April 2026.
- Buy an RTX 3060 12GB from a used shop (¥20,000–30,000)
- Install it in your PC and install the driver
- Install Ollama from ollama.com
- Type
ollama run qwen3:8bin a terminal
With 12GB of VRAM, 8B models run with room to spare, and some 14B models work too in quantized form. Getting a “practical local AI setup" in this price range is a benefit unique to used GPUs.
How to get started with local AI is covered by budget tier in “Running a local AI chatbot at home: a budget-by-budget guide." If you’re unsure about picking a GPU, use that as a reference too.
Which phones can run it? Sorting out the minimum specs
“Will it run on the phone I have?" is a natural concern. Based on Google AI Edge Gallery’s requirements and the model sizes, I’ve organized the support situation by phone.
Android
Google AI Edge Gallery requirements: Android 12 or later. Practically, 6GB of RAM or more is the guideline
| RAM | Runnable model | Experience | Example device (SIM-free) | Price range |
|---|---|---|---|---|
| 4GB | Won’t run | The OS uses 3GB. The remaining 1GB is not enough | Moto G Play, Redmi, and other budget phones | ¥10,000–20,000 |
| 6GB | Gemma 4 E2B (1.3GB) | Runs but slow. OK for simple questions | Pixel 7a, OPPO Reno 9A | ¥20,000–40,000 |
| 8GB | Gemma 4 E4B (4B) | Practical. Text processing, Q&A | Pixel 8a, Galaxy A55 | ¥30,000–50,000 |
| 12GB+ | Gemma 4 E4B (4B) | Comfortable. Handles long text too | Pixel 9, Galaxy S24 | ¥60,000+ |
How to read this table: If you have a phone with 8GB of RAM or more, E4B (the 4B model) runs at a practical speed. If you’re thinking of “a second phone for AI," something like the Pixel 8a (about ¥50,000) is the minimum line. SIM-free versions are available on Amazon and Rakuten.
A 4GB phone effectively won’t run it. Even if you “want to try AI on a cheap phone," aim for at least 6GB, ideally 8GB of RAM.
iPhone
| Model | Chip | RAM | Runnable model | Notes |
|---|---|---|---|---|
| iPhone 13 and earlier | A15 and earlier | 4–6GB | E2B is borderline | Can’t call it comfortable |
| iPhone 14 Pro/Pro Max | A16 | 6GB | E2B (practical) | Core ML support |
| iPhone 15 Pro/Pro Max | A17 Pro | 8GB | E4B (practical) | High-performance NPU |
| iPhone 16 Pro/Pro Max | A18 Pro | 8GB | E4B (comfortable) | The most comfortable |
For iPhone, iPhone 15 Pro and later is the practical line for E4B. E2B runs even on the iPhone 14 Pro.
Is “a second phone for AI" realistic?
Honestly, if the phone you’re using as your main is a model from within the last 2–3 years, try it on that first. There’s no need to go out and buy a second one.
Still, a “dedicated sub-device for AI" might suit people like these.
- You don’t want to spend your main phone’s battery on AI (AI inference eats battery)
- You want an offline-only AI device (travel, business trips, disaster prep)
- Your main is an old phone (4GB RAM or less) and you’re considering a replacement
In that case, the Pixel 8a (SIM-free / about ¥50,000 / 8GB RAM) is probably the best-balanced option. Being Google’s own, it pairs well with AI Edge Gallery, and it gets long-term update support.
Summary
April 2026 looks likely to be a turning point for edge LLMs.
- Gemma 4 Edge: AI runs by just installing a phone app. Multimodal, and well suited to a first local-AI experience
- BitNet b1.58: a 2B model that fits into a small footprint via 1-bit quantization. Fast at 45 tok/s on the M2 CPU. A path toward a 100B model is in view
- LiteRT-LM: an inference framework for app developers. Groundwork for a rapid rise in AI-featured phone apps to come
Phone AI is shifting from “toy" to “practical tool." Judging by user reports, for short questions, simple translation, and offline text checking, a phone alone already seems to suffice.
But for serious, heavy use, a GPU setup is still stronger. The quality and reasoning power of 14B–32B models are on another level from 2B–4B models. “Phone AI when out, GPU AI at home" is, in my view, the optimal answer as of 2026.
Watching the pace at which edge LLMs are evolving, an era where phones run 10B-and-up models may arrive in 2–3 years. When it does, how will the place of a GPU setup change? I’ll keep following it.
Gear featured in this article
Sources referenced
- Gemma 4: Byte for byte, the most capable open models (Google official blog)
- Gemma 4 model card (Google AI for Developers)
- microsoft/bitnet-b1.58-2B-4T (Hugging Face model card)
- microsoft/BitNet (official 1-bit LLM inference framework, GitHub)
- google-ai-edge/LiteRT-LM (GitHub)
- LiteRT-LM Overview (Google AI for Developers)





Recent Comments