AI Runs on Your Phone Now: Gemma 4 Edge, BitNet and LiteRT-LM

2026年7月5日

I normally run Ollama on a PC with two GPUs installed. I’ve written plenty of local-AI articles, and I’ve grown completely used to a life of running LLMs on my home GPUs.

But across 2025–2026, phone-oriented AI models and frameworks were announced one after another. Google’s “Gemma 4 Edge," Microsoft’s “BitNet b1.58," and Google’s inference framework “LiteRT-LM." What they all share is a direction: running AI with no GPU, using only the CPU/NPU of a phone or laptop.

To put the conclusion up front: phone AI (2B–4B) is plenty for a quick question or a translation, but serious use still needs a GPU setup. In this article I compare the three technologies — Gemma 4 Edge, BitNet, and LiteRT-LM — and lay out how to split work between them and a GPU setup.

Contents

1. What is an edge LLM?
2. How quantization works: why it runs on a phone
3. The three big topics of April 2026
4. Comparison table: edge LLM vs GPU setup
5. Value scatter plot data
- 5.1. [Scatter plot data]
6. So, do you not need a GPU?
7. How to try it right now
8. Which phones can run it? Sorting out the minimum specs
9. Summary
- 9.1. Gear featured in this article
10. Sources referenced

What is an edge LLM?

First, some terminology. An “edge LLM" is a large language model that runs directly on the CPU or NPU (neural processing unit) of a device in your hand (phone, tablet, laptop), rather than on a cloud GPU server.

Traditional local AI meant “putting a GPU in your home gaming PC and running it there." An edge LLM goes one step further, with the concept of running in the phone in your pocket.

Edge LLMs have two big advantages.

Your data never leaves: nothing is sent to the cloud, so privacy is fully protected. You can summarize confidential documents or ask personal questions without a second thought
No internet needed: it works even where there’s no signal. AI you can use on a plane, in the mountains, or during a disaster

The downside is just as clear: the model size is small. Because it has to fit in a phone’s memory (6–8GB), the current ceiling is 2B–4B parameters. Compared with the 14B–32B models that run on a GPU setup, there’s a gap in output quality and depth of reasoning.

How quantization works: why it runs on a phone

You can’t understand edge LLMs without “quantization." This is the core technology that makes phone AI possible, so let me go a little deeper.

Shrinking the model by lowering the precision of the parameters

An LLM’s parameters (weights) are originally trained in FP32 (32-bit floating point). Each parameter uses 32 bits = 4 bytes, so a 2B model is 2,000,000,000 × 4 bytes = about 8GB. That won’t fit in a phone’s memory.

Quantization is the technique of progressively lowering this precision to compress the size.

Format	Bits	Size per parameter	Est. size of a 2B model	Impact on quality
FP32 (at training)	32bit	4 bytes	about 8GB	Baseline (100%)
FP16 / BF16	16bit	2 bytes	about 4GB	Almost no degradation
INT8	8bit	1 byte	about 2GB	Slight degradation
INT4 (GPTQ/AWQ)	4bit	0.5 bytes	about 1GB	Noticeable degradation
1.58bit (BitNet)	1.58bit	about 0.2 bytes	about 0.4GB	Compensated by dedicated design

FP32’s 8GB compresses down to 0.4GB with quantization — a full 1/20th. That fits easily into an 8GB phone’s memory.

2B model size by quantization level

FP32 (training)

8 GB

FP16

4 GB

INT8

2 GB

INT4 (GPTQ/AWQ)

1 GB

BitNet 1.58bit

0.4 GB

In a nutshell: Quantization is the technique of “lowering the numeric resolution to dramatically cut file size." It’s like converting a photo from PNG (high quality, large) to JPEG (slightly degraded, small). BitNet’s 1.58bit is like JPEG compression taken to the extreme. But because it’s designed around that compression from the start, the quality falls off gently.

NPU vs GPU vs CPU: how much does compute efficiency differ?

Edge LLMs run on three kinds of processor: CPU, GPU, and NPU. Their efficiency at AI inference (TOPS/W = operations per watt) differs completely.

Processor	Examples	AI compute performance	Power efficiency (TOPS/W)	Good at
CPU	Apple M2, Snapdragon 8 Gen 3	1–5 TOPS	about 0.5–1 TOPS/W	General compute, integer math
GPU (mobile)	Adreno 750, Apple GPU	5–15 TOPS	about 1–3 TOPS/W	Parallel floating-point math
NPU	Apple Neural Engine, Hexagon	15–45 TOPS	about 5–15 TOPS/W	Specialized for matrix math
GPU (desktop)	RTX 3090 (reference)	285 TOPS (INT8)	about 0.8 TOPS/W	Large-scale parallel processing

What I want you to notice is that the NPU’s power efficiency reaches 5–10× that of a GPU or CPU. An NPU is nowhere near a desktop GPU in raw compute performance, but in AI compute efficiency per watt it’s far superior. That’s the reason AI can run on a battery-powered phone.

The RTX 3090’s 285 TOPS is in another league, but it draws a full 350W. A phone battery (about 15Wh) would be empty in a few minutes at that rate. The NPU is modest at 15 TOPS, but it runs on 1–3W, so you can keep using it for hours on a phone.

Memory bandwidth determines token-generation speed

When running AI on a phone, memory bandwidth is often the bottleneck more than the CPU or NPU compute speed. That’s because LLM inference reads the model’s entire set of parameters from memory each time it generates a token.

The approximate formula for token-generation speed is as follows.

tok/s ≈ memory bandwidth (GB/s) ÷ model size (GB)

Wider memory bandwidth is faster, and a smaller model is faster. Let’s plug in some concrete numbers.

Device	Memory bandwidth	Model	Model size	Est. tok/s	Reported measured value (reference)
iPhone 15 Pro (A17 Pro)	about 50 GB/s	Gemma 4 E4B (INT4)	about 2.5GB	about 20	10–15
Pixel 9 Pro (Tensor G4)	about 44 GB/s	Gemma 4 E2B (INT4)	about 1.3GB	about 34	15–20
Mac M2 (16GB)	100 GB/s	BitNet 2B (1.58bit)	0.4GB	about 250	45 (※ capped due to CPU execution)
RTX 3090	936 GB/s	Qwen3 8B (INT4)	about 5GB	about 187	about 50 (※ overhead applies)

Note: This formula is only a theoretical upper bound. In practice, CPU/NPU compute speed, cache efficiency, and software overhead bring you down to about 30–70% efficiency. Even so, the basic law holds: “wide memory bandwidth + a small model = fast." BitNet hitting 45 tok/s on the M2 is thanks to its tiny 0.4GB model size.

Estimated token generation speed by device

iPhone 15 Pro + E4B

12 tok/s

Pixel 9 Pro + E2B

18 tok/s

Mac M2 + BitNet 2B

45 tok/s

RTX 3090 + Qwen3 8B

50 tok/s

The three big topics of April 2026

1. Gemma 4 E2B / E4B (Google)

An Edge-only LLM that Google announced in April 2026. The “E" in the name stands for Effective. E2B is a 2B-parameter model, E4B a 4B-parameter model.

Specs and features:

E2B (2B parameters): conversational speed on iPhone 14 Pro and later. Estimated 15–20 tok/s
E4B (4B parameters): conversational speed on iPhone 15 Pro and later. Estimated 10–15 tok/s
Multimodal: handles not just text but images and audio. You can show a photo taken with the phone’s camera and ask “what’s this?"
Fully offline: once you download the model, no internet connection is needed afterward
Core ML conversion support: runs on iOS’s native inference engine, so it pairs well with Apple devices

Above all, it’s incredibly easy to try. Just install the Google AI Edge Gallery app that Google released (both iOS and Android). Pick Gemma 4 E2B or E4B on the model-selection screen and you can start chatting right away. No installing Ollama, no terminal work.

Compared with the steps to run Ollama on an RTX 3090 (install Ollama → download the model → type commands), the barrier is overwhelmingly lower. Since you only install a phone app, you can recommend it even to family members who aren’t tech-savvy.

Key point: Gemma 4 Edge’s biggest strength is that “anyone can try it immediately." For someone who’s interested in AI but bad at PC configuration, it can be their first local-AI experience.

2. BitNet b1.58 2B4T (Microsoft)

A model developed by Microsoft that’s seriously aiming for “no GPU required." It takes a technically fascinating approach.

How 1-bit quantization works:

A normal LLM represents parameters as 16-bit or 4-bit numbers. BitNet cuts this to the extreme, representing them with only the three values −1, 0, +1 (1.58bit). Multiplication becomes unnecessary and inference runs on addition and subtraction alone, so no GPU floating-point units are required.

Performance in numbers:

Model size: on the official evaluation metric, the non-embedding memory is about 0.4GB. Even the full footprint when you actually download and run it stays around 1.2GB, so it can run on a phone with 4GB of RAM
Speed on Apple M2 CPU: 45 tok/s. That’s quite fast; user reports describe it as “replies come right back"
Speedup on x86 CPUs: achieves a 2.37–6.17× speedup over the conventional approach
GitHub stars: passed 25,000 not long after release and kept climbing. Interest in the developer community is very high

The future roadmap is spelled out too, describing a plan to run models on the order of 100B parameters on a CPU via 1-bit quantization. If a 100B model could run on the CPU of an ordinary laptop, that would truly be a game changer.

That said, for now it’s only a 2B model, and output quality outside English is still developing. Even the official description says non-English languages are limited; while it posts good numbers on English-language benchmarks, a common assessment is that the same-size Gemma 4 E2B is better at grasping nuance and generating natural prose in other languages.

Note: BitNet’s 45 tok/s figure is a measurement on the M2 CPU. On the x86 CPU of a typical Windows PC, speed varies a lot with the CPU generation and memory bandwidth, so you won’t necessarily get the same experience.

3. LiteRT-LM (Google)

An LLM inference framework for edge devices that Google announced in April 2026. Whereas Gemma 4 Edge is the model itself, LiteRT-LM is the engine that runs the model. It runs across Android, iOS, Web, desktop, and even Raspberry Pi, and it’s the same foundation Google uses to run Gemini Nano in Chrome and Pixel products.

LiteRT-LM’s aim is clear: it’s a tool for app developers to build LLM features into phone apps. It’s meant for uses like “adding offline AI translation to a translation app" or “adding an AI summary feature to a notes app."

It’s not something we end users touch directly. Even so, LiteRT-LM’s arrival matters. If it becomes easier for developers to embed LLMs, “phone apps with AI features" should multiply rapidly from here — and as offline AI that sends nothing to the cloud.

Comparison table: edge LLM vs GPU setup

Here’s the main event. Let’s put phone edge LLMs and home-GPU Ollama side by side.

Item	Gemma 4 E2B	Gemma 4 E4B	BitNet 2B	Ollama 8B (GPU)	Ollama 27B (GPU)
Model size	2B	4B	2B	8B	27B
Required hardware	iPhone 14 Pro+	iPhone 15 Pro+	M2 Mac / general PC	RTX 3060	RTX 3090
Speed (tok/s)	15–20	10–15	45 (M2) / x86 depends on environment	about 50	about 25–30
Quality	Decent	Practical	Developing	Practical	Quite good
Multimodal	Yes	Yes	No	Depends on model	Depends on model
Offline	Full	Full	Full	Full	Full
Setup difficulty	Just install an app	Just install an app	Terminal work needed	Install Ollama	Install Ollama
Extra cost	¥0 (phone you own)	¥0 (phone you own)	¥0 (PC you own)	GPU from ¥30,000	GPU from ¥70,000

How to read this table: From left to right, the required investment rises but the AI gets smarter. Edge LLMs’ 2B–4B models dominate on ease, but on quality and reasoning power they’re clearly outdone by the 8B-and-up GPU models.

Value scatter plot data

A scatter plot of extra cost (horizontal axis) against AI practicality score (vertical axis).

[Scatter plot data]

| Setup | Extra cost (¥) | AI practicality score (out of 100) |
|------|-----------------|---------------------------|
| iPhone + Gemma 4 E2B | 0 | 25 |
| iPhone + Gemma 4 E4B | 0 | 35 |
| Mac M4 Pro + BitNet 2B | 0 | 30 |
| Mac M4 Pro + Ollama 14B | 0 | 60 |
| PC + RTX 3060 + Ollama 8B | 30,000 | 45 |
| PC + RTX 5060 Ti 16GB + Ollama 14B | 90,000 | 65 |
| PC + RTX 3090 + Ollama 32B | 70,000 (used) | 85 |
| PC + RTX 5090 + Ollama 32B | 400,000 | 95 |

X axis: extra cost (¥). ¥0 if you run it on a phone or PC you already own
Y axis: AI practicality score. An overall rating that factors in output quality and generation speed (out of 100)
Scoring basis: overall evaluation of naturalness of language, comprehension of questions, depth of reasoning, and generation speed

How to read this graph: Closer to the top-left is the ideal position — “you get a smart AI with no extra cost." The top-right is the “you gain smarts in proportion to what you invest in a GPU" zone. What’s worth noting is that the edge-LLM crowd clusters in the bottom-left, staying at scores of just 25–35 despite ¥0 cost. By contrast, the Mac M4 Pro + Ollama 14B is ¥0 cost with a score of 60 — very good value for someone who already has a Mac.

So, do you not need a GPU?

You still do.

Between edge LLMs’ 2B–4B models and the 14B–32B models that run on a GPU, there’s a clear wall in intelligence. Concretely, differences like these show up.

Use case	Edge LLM (2B–4B)	GPU setup (14B–32B)
Answering short questions	Practical	Comfortable
Proofreading text	OK for simple fixes	Context-aware corrections possible
Summarizing long text	Possible for short passages	Handles multi-page A4 documents too
Code generation	Simple snippets at most	Can generate function-level code
Complex reasoning / analysis	Tough	Reasonably practical at 14B and up
Multi-turn conversation	Forgets context after a few turns	Relatively stable even in long chats

Between a 2B model and a 32B model, there’s a 16× difference in parameter count. That’s simply like “the size of the brain" being 16× different, and it affects everything — knowledge, reasoning power, and grasp of linguistic nuance.

That said, depending on how you use it, there are situations where an edge LLM is enough.

Asking “just give me the gist of this English email" while you’re out
Checking text for typos in an offline environment
Showing a photo and asking “what’s the name of this plant?" (Gemma 4 Edge’s multimodal)
Quickly checking a simple translation

For these “quick questions" and “light tasks," a phone’s edge AI can handle it practically.

My conclusion is “split the work." When out, the phone’s edge AI; at home, Ollama on an RTX 3090. I think that’s the most sensible way to use it as of 2026. The era where GPUs become unnecessary is still off, but the era where “you can touch AI even without a GPU" has definitely arrived.

How to try it right now

iPhone / Android users (¥0, done in 5 minutes)

Install the Google AI Edge Gallery app from the App Store / Google Play
Open the app and pick Gemma 4 E2B from the model list
Once the model finishes downloading, type a question in the chat screen

That’s all there is to it. No PC knowledge required. You need Wi-Fi for the download, but once it’s on your device you can use it offline.

If you have an iPhone 15 Pro or later, try E4B (the 4B model) too. From what benchmark reports show, it seems a notch smarter than E2B.

Mac users (¥0, done in 10 minutes)

Install Ollama from ollama.com
Open a terminal and type ollama run gemma4
The model downloads and the chat begins

On a Mac with M1 or later, Ollama runs even 14B-class models comfortably. Apple Silicon’s unified memory is actually quite well suited to local AI.

Windows PC users (a used GPU if value matters)

If you want to use local AI seriously, buying a used RTX 3060 12GB (¥20,000–30,000) and installing Ollama is the best value as of April 2026.

Buy an RTX 3060 12GB from a used shop (¥20,000–30,000)
Install it in your PC and install the driver
Install Ollama from ollama.com
Type ollama run qwen3:8b in a terminal

With 12GB of VRAM, 8B models run with room to spare, and some 14B models work too in quantized form. Getting a “practical local AI setup" in this price range is a benefit unique to used GPUs.

Want the details?
How to get started with local AI is covered by budget tier in “Running a local AI chatbot at home: a budget-by-budget guide." If you’re unsure about picking a GPU, use that as a reference too.

Which phones can run it? Sorting out the minimum specs

“Will it run on the phone I have?" is a natural concern. Based on Google AI Edge Gallery’s requirements and the model sizes, I’ve organized the support situation by phone.

Android

Google AI Edge Gallery requirements: Android 12 or later. Practically, 6GB of RAM or more is the guideline

RAM	Runnable model	Experience	Example device (SIM-free)	Price range
4GB	Won’t run	The OS uses 3GB. The remaining 1GB is not enough	Moto G Play, Redmi, and other budget phones	¥10,000–20,000
6GB	Gemma 4 E2B (1.3GB)	Runs but slow. OK for simple questions	Pixel 7a, OPPO Reno 9A	¥20,000–40,000
8GB	Gemma 4 E4B (4B)	Practical. Text processing, Q&A	Pixel 8a, Galaxy A55	¥30,000–50,000
12GB+	Gemma 4 E4B (4B)	Comfortable. Handles long text too	Pixel 9, Galaxy S24	¥60,000+

How to read this table: If you have a phone with 8GB of RAM or more, E4B (the 4B model) runs at a practical speed. If you’re thinking of “a second phone for AI," something like the Pixel 8a (about ¥50,000) is the minimum line. SIM-free versions are available on Amazon and Rakuten.

A 4GB phone effectively won’t run it. Even if you “want to try AI on a cheap phone," aim for at least 6GB, ideally 8GB of RAM.

Note: Installed RAM and actually usable RAM are two different things. Even with 8GB installed, the OS and background apps use 3–4GB, so only about 4–5GB is available for the AI model. Try it with as many other apps closed as possible.

iPhone

Model	Chip	RAM	Runnable model	Notes
iPhone 13 and earlier	A15 and earlier	4–6GB	E2B is borderline	Can’t call it comfortable
iPhone 14 Pro/Pro Max	A16	6GB	E2B (practical)	Core ML support
iPhone 15 Pro/Pro Max	A17 Pro	8GB	E4B (practical)	High-performance NPU
iPhone 16 Pro/Pro Max	A18 Pro	8GB	E4B (comfortable)	The most comfortable

For iPhone, iPhone 15 Pro and later is the practical line for E4B. E2B runs even on the iPhone 14 Pro.

Is “a second phone for AI" realistic?

Honestly, if the phone you’re using as your main is a model from within the last 2–3 years, try it on that first. There’s no need to go out and buy a second one.

Still, a “dedicated sub-device for AI" might suit people like these.

You don’t want to spend your main phone’s battery on AI (AI inference eats battery)
You want an offline-only AI device (travel, business trips, disaster prep)
Your main is an old phone (4GB RAM or less) and you’re considering a replacement

In that case, the Pixel 8a (SIM-free / about ¥50,000 / 8GB RAM) is probably the best-balanced option. Being Google’s own, it pairs well with AI Edge Gallery, and it gets long-term update support.

Summary

April 2026 looks likely to be a turning point for edge LLMs.

Gemma 4 Edge: AI runs by just installing a phone app. Multimodal, and well suited to a first local-AI experience
BitNet b1.58: a 2B model that fits into a small footprint via 1-bit quantization. Fast at 45 tok/s on the M2 CPU. A path toward a 100B model is in view
LiteRT-LM: an inference framework for app developers. Groundwork for a rapid rise in AI-featured phone apps to come

Phone AI is shifting from “toy" to “practical tool." Judging by user reports, for short questions, simple translation, and offline text checking, a phone alone already seems to suffice.

But for serious, heavy use, a GPU setup is still stronger. The quality and reasoning power of 14B–32B models are on another level from 2B–4B models. “Phone AI when out, GPU AI at home" is, in my view, the optimal answer as of 2026.

Watching the pace at which edge LLMs are evolving, an era where phones run 10B-and-up models may arrive in 2–3 years. When it does, how will the place of a GPU setup change? I’ll keep following it.