I Want to Generate AI Images: Where’s the Sweet Spot for Value?

2026年7月5日

I’ve installed ComfyUI — an app that makes AI image generation easy to run on a PC — and I’ve been having fun making AI images. Blog thumbnails, social-media assets, visualizing an idea — cloud services are fine too, but the appeal of local AI is that a quick, casual generation is always within reach.

In this article I’ve broken down, by budget, “what kind of images you can make, and how fast."

* The specs and prices in this article are as of May 2026.

- 0.1. Measured on the author’s machine (RTX 3090 24GB)
1. The upside of generating images at home
2. The tool we’ll use: ComfyUI
3. By budget: what your GPU can make
4. Value graph by GPU
- 4.1. AI image generation performance vs price
- 4.2. What the graph tells us
5. How to choose
6. Choose by “what you want to make"
7. Summary: image generation has a low barrier to “just trying it"
- 7.1. GPUs mentioned in this article

Measured on the author’s machine (RTX 3090 24GB)

Model	Resolution	Steps	Generation time
SD 1.5	512×512	20	8.0s
SDXL (Animagine)	1024×1024	20	26.0s

Test setup: RTX 3090 (24GB) / ComfyUI / Linux / measured May 2026

The upside of generating images at home

Cloud (Midjourney, etc.)	Local (ComfyUI, etc.)
Around ¥1,500–6,000/month (depending on plan)	Upfront cost only
Limits on how many images you can make	Unlimited
The service decides which models you get	Use any model or LoRA you like
Your prompts are sent to a server	Fully local
Commercial use may be restricted	Free to use, depending on the model’s license

The tool we’ll use: ComfyUI

As of 2026, the most widely used tool for local image generation is ComfyUI.

A node-based workflow, so you can see the flow of processing visually
Supports the major models — Stable Diffusion, FLUX, SDXL, and more
A rich ecosystem of extensions: ControlNet, LoRA, upscalers, and so on
An NVIDIA GPU + CUDA is the most stable (AMD ROCm is partially supported)

To install, just download and unpack it from the official ComfyUI site. No Python knowledge required.

Deep dive: why VRAM is the key to image generation — how Latent Diffusion worksThe Stable Diffusion and FLUX models ComfyUI uses are based on a technique called the Latent Diffusion Model.

Processing a 512×512 image pixel by pixel would mean computing over roughly 260,000 pixels, but an LDM first compresses this into a 64×64 latent space with a VAE (Variational Autoencoder) before denoising. The amount of computation is about 1/64 of pixel space. That’s why it runs at practical speeds even on a local GPU.

The flow looks like this.
1. Vectorize the text with a CLIP model → 2. Repeatedly denoise in latent space (U-Net / DiT) → 3. Restore a pixel image from the latent space with the VAE decoder

The step that loads VRAM the most is the denoising in step 2. As you raise the resolution, the latent-space size grows in proportion, so at 1024×1024 (SDXL’s standard) the VRAM used around the latent space is about 4x that of 512×512.

By budget: what your GPU can make

¥60–70k range (RTX 5060 8GB / RTX 5060 Ti 8GB)

[kimono_product id="15770″]

What 8GB can do:

Model	Can it run?	Rough time per image	Quality
FLUX.1 Schnell (FP8)	◎	10–20s	High. Good at rendering text, too
SD 1.5	◎	3–8s	The staple. Tons of LoRAs
SDXL	△	30–60s	Runs, but slow. Combining with LoRA is rough
FLUX.1 Dev	×	Not enough VRAM	—

What you can do:

Easily generate high-quality images with FLUX Schnell
Freely tweak styles — anime, photorealistic, and more — with SD 1.5 + LoRA
Create blog thumbnail images
Mass-produce images for social posts

What you can’t do:

Complex SDXL workflows (ControlNet + LoRA at the same time)
High-quality FLUX Dev generation
Direct high-resolution (2K+) generation

Deep dive: what is FP8 quantization?An AI model’s “weights" are normally stored in FP32 (32-bit floating point). Convert them to FP16 (16-bit) and VRAM use is halved; FP8 (8-bit) brings it down to a quarter.

Model-size example for FLUX.1 (12 billion parameters):
BF16 (standard distribution): ~24GB → FP8: ~12GB → GGUF 4-bit: ~6–7GB

FP8 loses some precision, but in image generation the difference usually falls within a range the human eye can’t distinguish. FLUX Schnell running on just 8GB of VRAM is thanks to this quantization plus ComfyUI’s automatic offloading.

FLUX Schnell running on 8GB is revolutionary. If you just want to try AI image generation, it’s plenty. But it’s not enough to unlock SDXL’s full potential.

Images per ¥10k (based on FLUX Schnell): Unlimited (upfront cost only, so cost-efficiency improves the more you use it)

Value: ★★★☆☆ (fine as a taster)

¥100k range (RTX 5070 12GB)

What 12GB can do:

Model	Can it run?	Rough time per image	Quality
FLUX.1 Schnell (FP8)	◎	5–10s	Fast
SD 1.5	◎	2–5s	Comfortable
SDXL	◎	10–20s	Comfortable. Can combine LoRA too
SDXL + ControlNet	○	15–30s	Lets you specify composition
FLUX.1 Dev	△	Runs, but barely	FP8 required

What you can do:

SDXL runs comfortably → you can reliably produce high-quality images
Generate images with a specified composition or pose using ControlNet
Finely control style with LoRA
Batch processing (continuous generation) is practical too

Deep dive: how LoRA works — why a tiny file can change the styleThe SDXL base model has about 3.5 billion parameters (roughly a 7GB file). When teaching it a new art style or character, retraining every parameter is impractical.

LoRA (Low-Rank Adaptation) is a technique that adds only a “low-rank difference matrix" to the model’s weight matrices. Without directly touching the huge original matrix, you can change the style with a small adapter of a few million to a few tens of millions of parameters (about 1% of the original or less).

A LoRA file is usually around 10–200MB. On VRAM it only adds a few hundred MB on top of the base model, so with 12GB you can use an SDXL base model + multiple LoRAs at once.

The sweet spot for image generation. 12GB is the minimum line where SDXL runs comfortably. From here you start to get the feeling of “I can make what I want to make."

Value: ★★★★☆ (the best balance if image generation is your main use)

¥90–100k range, 16GB (RTX 5060 Ti 16GB / RX 9070)

[kimono_product id="15760″]

What 16GB can do:

Model	Can it run?	Rough time per image	Quality
SDXL + ControlNet + LoRA	◎	15–25s	Complex workflows OK
FLUX.1 Dev	○	30–60s	Runs. Top-class quality
SD 3.5	◎	15–25s	A new-generation model
High-res upscaling	◎	+10–30s	Up to 2K–4K

Deep dive: how to estimate VRAM useVRAM use during image generation can be roughly estimated as follows.

The model itself (for FP16: number of parameters x 2 bytes)
+ latent-space buffers (proportional to resolution)
+ additional modules like LoRA / ControlNet
+ the peak during VAE decode

Worked example — generating 1024×1024 with SDXL:
U-Net itself: ~5.1GB (FP16) + CLIP: ~1.3GB + VAE: ~0.3GB + latent-space buffer: ~2GB
= about 8.7GB total (minimal configuration, no LoRA or ControlNet)

Adding ControlNet adds +1.5–2.5GB, and one LoRA adds +0.1–0.3GB. At 12GB, one ControlNet is the limit, but at 16GB you get room to use ControlNet + multiple LoRAs at the same time.

RTX 5060 Ti 16GB vs RTX 5070 12GB:

Comparison	RTX 5060 Ti 16GB (¥90k)	RTX 5070 12GB (¥100k)
VRAM	16GB	12GB
SDXL speed	A bit slow (128-bit bus)	Fast
FLUX Dev	Runs	Barely
Complex workflows	Comfortable	Barely

More VRAM headroom, or more speed? If you want to try lots of models or build complex workflows, go 16GB of VRAM; if you want to simply generate fast and in bulk, 12GB suits you better.

A caution about the AMD RX 9070 (16GB / ~¥100k)
Its price per GB of VRAM is the cheapest, but its compatibility with ComfyUI falls well short of NVIDIA. Operation on Windows can be unstable in places, and some custom nodes won’t work. For image generation, NVIDIA is recommended.

Value: ★★★★★ (the cheapest tier per GB of VRAM)

¥160k range (RTX 5070 Ti 16GB)

[kimono_product id="15762″]

Same VRAM as the 5060 Ti 16GB, but with higher GPU performance the generation speed is 1.5–2x.

Comparison	RTX 5060 Ti 16GB	RTX 5070 Ti 16GB
SDXL, one image	15–25s	8–15s
FLUX Dev, one image	30–60s	20–35s

For people who generate in bulk, or who iterate on workflows a lot, the speed difference starts to matter. It’s also strong for doubling up with VR or a local LLM.

Value: ★★★★☆ (ideal if you’re planning to double up on uses)

¥180k and up (RX 7900 XTX 24GB / RTX 5090 32GB)

What 24GB and above can do:

Model	24GB	32GB
FLUX.1 Dev	◎ Comfortable	◎ With room to spare
SDXL complex workflows	◎	◎
Video generation (Wan 2.1, etc.)	△ Offloading needed	○
Ultra-high resolution (4K+)	◎	◎

Video generation is still tough on a consumer GPU, but with 24GB you reach a state where there’s almost nothing you can’t do.

Value graph by GPU

AI image generation performance vs price

How to read this graph: The horizontal axis is price (in ¥10k), the vertical axis is an overall AI-image-generation performance score. The closer to the top-left, the better the value. Point size represents VRAM capacity.

GPU	Price (¥10k)	Image-gen score	VRAM	Notes
RTX 5060 Ti 8GB	7	35	8GB
RTX 5060	6	30	8GB
RX 9070	8	40	16GB	* AMD = iffy ComfyUI compatibility
RTX 5060 Ti 16GB	9	55	16GB
RTX 5070	10	65	12GB	★ The image-generation sweet spot
RTX 5070 Ti	16	80	16GB
RX 9070 XT	9	45	16GB	* AMD
RX 7900 XTX	18	85	24GB	* Linux recommended
RTX 5080	20	90	16GB
RTX 5090	40	98	32GB

* How the image-generation score is calculated:

SDXL generation speed: 40%
Range of supported models (VRAM-dependent): 35%
Capacity for complex workflows: 25%

What the graph tells us

The RTX 5070 (¥100k) is the value king for image generation. SDXL runs comfortably at 12GB, and the speed is plenty
The RTX 5060 Ti 16GB (¥90k) is for the VRAM-first crowd. It reaches FLUX Dev, but it’s slower than the RTX 5070
AMD (the RX 9070 line) looks like good value for its score, but the score is discounted by ComfyUI compatibility issues. On Linux it’s effectively a bit higher
The RTX 5080 and up are for “mass production." There’s no difference in the quality of a single image, but the speed gap kicks in when generating in bulk

How to choose

Case 1: For a hobby (tens to hundreds of images a month)

→ RTX 5070 (12GB / ¥100k)

SDXL is comfortable and FLUX Schnell is fast too. You can use LoRA and ControlNet. You’ll fully enjoy the “image generation is fun" side of it. At a few hundred images a month, generation speed won’t be a bottleneck.

Case 2: Practical use for a blog or social media (tens of images a week)

→ RTX 5060 Ti 16GB (¥90k) or RTX 5070 (¥100k)

It comes down to how you view the ¥10k difference. If you want to try lots of models and play with FLUX Dev, go 5060 Ti 16GB. If speed matters and SDXL is your main, go 5070. Either is a good call.

Case 3: Commercial use, high-volume generation (hundreds a day and up)

→ RTX 5070 Ti (16GB / ¥160k)

16GB of VRAM + a fast GPU. Even building complex workflows and running batch jobs, you have headroom. Generation speed is 1.5–2x the 5060 Ti, so at high volume you recoup the price difference.

Case 4: Wanting to dabble in AI video generation too

→ RX 7900 XTX (24GB / ¥180k) * Linux recommended
→ If you can wait, hold out for the rumored RTX 5080 Ti (24GB?)

Video generation lives and dies by VRAM. At 16GB, offloading is mandatory and it’s not practical. 24GB is the minimum line.

Case 5: Wanting to do a local LLM (Ollama) too

Hands-on: I run image generation and a local LLM at the same time on a dual-GPU setup — an RTX 3090 (24GB) and an RTX 3060 (12GB).

→ RTX 5060 Ti 16GB (¥90k)

16GB pays off in both image generation and LLMs. “Covering two uses for ¥90k" is unbeatable value.

Choose by “what you want to make"

What you want to do	VRAM needed	Recommended GPU	Budget
Blog thumbnails	8GB	RTX 5060 Ti 8GB	¥70k
Images for social posts	8–12GB	RTX 5070	¥100k
Style control with LoRA	12GB+	RTX 5070	¥100k
Composition control with ControlNet	12–16GB	RTX 5070 / 5060 Ti 16GB	¥90–100k
FLUX Dev’s top quality	16GB+	RTX 5070 Ti	¥160k
Commercial illustration work	16GB+	RTX 5070 Ti	¥160k
AI video generation	24GB+	RX 7900 XTX	¥180k

Deep dive: why ControlNet is heavy — the cost of “conditional generation"ControlNet extracts a “feature map" from a pose image or depth map and injects it into the U-Net denoising process. Because an additional network that duplicates the encoder part of the base model’s U-Net runs, VRAM use goes up substantially (about +1.5–2.5GB with SDXL).

If memory is tight in ComfyUI, the following help.
1. Tiled VAE Decode — split the image into 512×512 tiles for decoding (drastically cuts the VRAM peak)
2. Use an FP8-quantized model — ControlNet itself is also available in FP8
3. The –lowvram option — process in stages, trading speed for lower VRAM use

[kimono_heatmap title="AI image generation support by GPU" note="As of May 2026. ◎=Comfortable ○=Works △=Limited ×=Not enough VRAM"]
VRAM|FLUX Schnell|SDXL|FLUX Dev|Video
8GB|◎ FP8|△ Slow|×|×
12GB|◎|◎|△ FP8 required|×
16GB|◎|◎|○|△
24GB|◎|◎|◎|○
[/kimono_heatmap]

Summary: image generation has a low barrier to “just trying it"

Of all the corners of local AI, AI image generation is the “most visually fun" genre. Type in some text and a picture appears in a few seconds to a few tens of seconds — once you experience it, you’re hooked.

2026, where FLUX Schnell runs even on an 8GB GPU, has lowered the barrier to entry like never before.

And an image you generate can be turned into a 3D model to view in VR, or physically printed on a 3D printer — as a first step connecting the virtual and the real, AI image generation is just the right entry point.

Related

Running a local AI chatbot at home: a budget-by-budget guide — how to get started with local LLMs
The full GPU spec list, 2026 edition — compare price, VRAM, and bandwidth across every GPU
Starting local AI with a used GPU — a value check on the RTX 30/40 generations
Recommended GPUs by use-case mix — which one if you’re doubling up

The specs and prices in this article are as of May 2026. Generation times vary by model, settings, and resolution.

GPUs mentioned in this article

[kimono_product id="15760″]

[kimono_product id="15762″]

[kimono_product id="15761″]

▶ Go deeper on local AI (related)

Image Generation,Local AI