ComfyUI for Beginners: From Install to Your First Image (2026)

2026年7月5日

I’ve put a GPU in my PC and generate AI images locally. Blog thumbnails, social-media assets, visualizing an idea — the uses vary.

The old ComfyUI required setting up an environment: “install Python, clone with Git, pip install, match the CUDA version…" — a fairly high hurdle for anyone without programming experience.

That changed completely when the Desktop edition was released in October 2024. Now you just download and run an installer — no programming knowledge required at all. In this article I’ll walk through the Desktop edition, from install to generating your first image.

Right now, with the Desktop edition you can get started in just five steps.

What is ComfyUI?

ComfyUI is a node-based AI image generation tool.

It acts as a frontend for running text-to-image AI models — Stable Diffusion, FLUX, SD3 — on your local PC. It’s open source and free. Commercial use is also free, depending on the license of the model you use.

Its defining feature is a method where you connect blocks called “nodes" with lines to build up the flow of processing. It looks complex at first glance, but thanks to this design, customizations like “I want to swap the model," “I want to fix just part of the image," or “I want to add upscaling" are flexible to do.

A similar tool is Automatic1111 (WebUI); here’s how they differ.

Comparison	ComfyUI	Automatic1111 (WebUI)
UI style	Node-based (connect with lines)	Form-input style
Customizability	Very high	Moderate
Beginner-friendliness	A bit hard to grasp	Intuitive
Speed of support for new models	Fast (supports FLUX and others quickly)	A bit slow
Memory efficiency	Good	Average
Development status in 2026	Active (Desktop edition released)	Updated less often than ComfyUI

Latent Diffusion — the principle behind ComfyUI’s image generation

The Stable Diffusion and FLUX models used in ComfyUI run on a mechanism called the Latent Diffusion Model. Computing a normal image (say, 1024×1024 pixels, about 1 million pixels) as-is would need an enormous amount of processing, but Latent Diffusion compresses the image into a “latent space" of about 128×128 before processing. The amount of computation is about 1/64 of pixel space. That’s why it can generate images at practical speeds even on a local GPU.

The flow is “random noise → gradually remove noise → an image emerges." How many times denoising is repeated corresponds to the KSampler’s “Steps" parameter.

As of 2026, my honest take is that if you’re starting fresh, ComfyUI is a strong choice. Support for new models is fast and development is active. With the arrival of the Desktop edition, the beginner’s wall is nearly gone too.

Installation

Until 2024, you needed manual Python / Git / CUDA setup, and someone with no programming experience could get stuck at the environment-building stage. With the Desktop edition, you just run an installer. It supports Windows / macOS, and it takes about 10–30 minutes including the model download.

Desktop edition (Windows / Mac)

Go to comfy.org and download the Desktop edition from “Download"
Run the installer. No special settings needed — basically just keep clicking “Next"
When ComfyUI launches, a template-selection screen appears
Select the “Text to Image (FLUX)" template. The automatic download of the recommended model (FLUX Schnell, etc.) begins
Once the download finishes, the prompt input field becomes usable

The model download size varies by the model you choose. For FLUX Schnell it’s about 23GB (FP16). On a fiber connection, roughly 10–15 minutes.

Note: The Desktop edition bundles a Python environment and CUDA runtime in addition to ComfyUI itself. So even if you already have a Python environment, you don’t need to worry about conflicts.

Linux (the traditional method)

For Linux users, here’s the traditional method as well.

# Clone the repository
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

# Install dependencies (venv recommended)
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Launch
python main.py

Once it launches, you can reach it in the browser at http://127.0.0.1:8188. Place model files manually in the models/checkpoints/ folder.

Generating your first image

Once the Desktop edition is installed, let’s generate an image right away.

Steps

Launch ComfyUI
Select “Text to Image (FLUX)" from the templates (it shows automatically on first launch; from the second time on, you can pick it from the menu)
Enter text in English in the CLIP Text Encode node (the prompt input field)
Press the “Queue Prompt" button (or Ctrl+Enter)
In a few seconds to a few tens of seconds an image is generated and shown in the Save Image node

Example prompts

Start with something simple.

a cat sitting on a wooden desk, soft lighting, photorealistic, 4k

a Japanese garden in autumn, golden leaves, pond reflection, cinematic

a futuristic city at night, neon lights, rain, cyberpunk style

Prompts are basically in English. Some models support other languages, but English tends to give more stable, on-intent results.

Rough generation times

The first generation takes longer because the model has to load. From the second time on it’s fast, since the model stays in memory.

Model	Resolution	First generation	Second time on	Notes
FLUX Schnell	512×512	15–20s	3–5s	Desktop edition default
FLUX Schnell	1024×1024	25–35s	8–12s	A practical resolution
SDXL	1024×1024	20–30s	10–15s	At 20 steps
SD 1.5	512×512	10–15s	2–4s	The lightest

* Generation times are a rough guide on an RTX 3090. 20 steps, 24GB VRAM environment. They vary greatly by GPU.

Model comparison: which one to use

ComfyUI lets you switch between several AI models. Here’s a comparison of the major models as of April 2026.

Model	File size	VRAM needed	Time (1024×1024)	Quality	Characteristics
FLUX Schnell	~23GB (FP16)	8GB+	8–12s	◎	Fast and high-quality. Desktop edition default. Start here
FLUX Dev	~23GB	12GB+	25–35s	◎◎	Top-class quality. Good at rendering text. But heavy
SDXL	~6.5GB	6GB+	10–15s	○	Lightweight with a rich set of custom models (LoRA, etc.)
SD 1.5	~4GB	4GB+	3–5s	△	Old but the lightest. Runs even on low VRAM

* Generation times are for an RTX 3090 at 20 steps (as of April 2026)

FLUX Schnell is plenty to start with. The Desktop edition’s template selects it automatically, so there’s no need to change it. Try SDXL once you reach the stage of “I want to generate in a specific style" or “I want to use LoRA."

Basic node concepts

When you open the ComfyUI screen, you see a diagram of square blocks (nodes) connected by lines. It looks complex at first, but the basics are just five nodes.

The five basic nodes

1. Checkpoint Loader (load the model)
The node that loads the AI model file. It’s the starting point that decides “which model draws the picture."

2. CLIP Text Encode (prompt conversion)
Converts the text you enter (the prompt) into numerical data the AI model can understand. You use two: a positive prompt (what you want to draw) and a negative prompt (what you don’t want to draw).

3. KSampler (the heart of image generation)
The “drawing engine," so to speak, that gradually generates the image from a lump of noise. It has several settings, but the first three to learn are these.

Key KSampler parameters

Steps: The number of times denoising is repeated. The standard is 20; more steps add fineness, but past 30 the quality gains diminish while compute time keeps rising linearly. In practice, a range of 15–25 is plenty.

CFG Scale (Classifier-Free Guidance): A value that controls fidelity to the prompt. For SDXL, around 7.0 is standard. Closer to 1.0 weakens the prompt’s influence, giving “the AI’s free interpretation"; above 15.0 the prompt drags too hard, causing color saturation or broken outlines. FLUX models have the CFG mechanism built in internally, so it’s common to use them fixed at 1.0.

Sampler: The denoising calculation method. There’s “euler," “dpmpp_2m," “dpmpp_sde," and others, each with a different balance of speed and quality. If unsure, “euler" (fast and stable) or “dpmpp_2m" (higher quality, a bit slower) will do fine.

4. VAE Decode (convert to an image)
What KSampler generated is “latent space" data invisible to humans. VAE Decode converts it into an image we can see.

5. Save Image (save the image)
The node that saves and displays the finished image.

The data flow

DATA FLOW

ComfyUI basic workflow

Checkpoint Loader
[model]

→

KSampler
[latent image]

→

VAE Decode
[image]

→

Save Image

CLIP Text Encode
(positive / negative)

──→ connects to KSampler

The lines between nodes are color-coded.

Line color	Data it carries	Meaning
Purple	MODEL	The AI model itself
Yellow	CLIP	The text-processing part
Orange	CONDITIONING	The prompt’s information
Pink	LATENT	Latent-space image data
Blue	IMAGE	An image humans can see
Red	VAE	The image-conversion engine

As long as you use a template, you don’t need to touch this wiring yourself. Understanding “so this is how it works" makes it easier to guess the cause when an error occurs.

Recommended settings by VRAM

The models and settings you can use change with your GPU’s VRAM capacity.

VRAM	Recommended model	Launch argument	Feel
4GB	SD 1.5	`--lowvram`	Slow but runs. For a taster
6–8GB	SDXL / FLUX Schnell (FP8)	None	Practical. FLUX Schnell running is a big deal
12GB	FLUX Schnell / SDXL + ControlNet	None	Comfortable. Room for LoRA too
16GB	FLUX Dev (FP16)	None	Comfortable. Nearly every model runs
24GB	FLUX Dev (full precision)	`--highvram`	Top quality. The model stays resident, so switching is fast

--lowvram and --highvram are arguments passed when launching ComfyUI. On the Desktop edition you set them from the settings screen; on the Linux edition you specify them like python main.py --lowvram.

--lowvram is a feature that offloads part of the model to main memory when VRAM is short — it runs slower, but lets you force a model that otherwise wouldn’t run. Conversely, --highvram keeps the whole model resident in VRAM for speed, and shows its effect on GPUs with 24GB or more.

What VRAM is spent on

VRAM use during image generation breaks into three main parts.

1. The model weights: about 12GB for FLUX Schnell (FP8), about 5GB for SDXL (FP16). This takes up most of the VRAM.
2. Intermediate compute data: the tensors KSampler holds during generation. It grows in proportion to resolution — about 1–2GB at 1024×1024.
3. Additional models like ControlNet: using ControlNet adds +1–2GB. LoRA adds very little (a few tens of MB).

--lowvram offloads “1. the model weights" to CPU memory (system RAM) and transfers only the part needed for computation to the GPU. Because GPU-CPU data transfer occurs, generation speed drops by roughly 2–5x.

Hands-on: On my setup (RTX 3090, 24GB), I use --highvram to keep FLUX Dev resident at full precision. There’s no reloading when switching models, which makes iterating a real pleasure.

Common errors and fixes

Once you start using ComfyUI, there are a few errors nearly everyone hits at least once. Here’s how to handle them.

Error / symptom	Cause	Fix
CUDA out of memory	Not enough VRAM	Add the `--lowvram` argument. Or lower the resolution (1024→768→512)
A node shows up in red	A wiring mistake between nodes, or a custom node isn’t installed	Check the connections. Install missing nodes via Manager → Install Missing Nodes
Checkpoint not found	The model file isn’t in the right place	Put the model file in the `models/checkpoints/` folder. The Desktop edition can download it directly from the Manager
The result is pure black	The VAE failed to load, or a mismatch between an FP8 model and the VAE	Explicitly specify the VAE in the Checkpoint Loader settings. If you’re using an FP8 model, download and specify a matching VAE separately
Generation is abnormally slow	It’s computing on the CPU, not the GPU	Check that the NVIDIA driver is installed correctly. The Desktop edition usually auto-detects it
Pressing Queue Prompt does nothing	A node connection is incomplete	Check that every node’s input ports are connected. If a port is unconnected, it won’t run

As long as you use the Desktop edition, errors stemming from environment setup barely happen. The most common one is CUDA out of memory (not enough VRAM). Lowering the resolution or trying --lowvram is your first move.

ComfyUI Manager: install extensions easily

One of ComfyUI’s strengths is an extension-management tool called ComfyUI Manager. It comes preinstalled in the Desktop edition.

With the Manager, the following are all done within the GUI.

Install, update, and remove custom nodes
Search for and download model files
Auto-detect and install missing nodes (handy when you load someone else’s workflow)
Update ComfyUI itself

When you load a workflow someone else made (a .json file) into ComfyUI, it may show up in red because the custom nodes it uses are missing. In that case, just press “Install Missing Custom Nodes" in the Manager, and the required nodes install in one go.

Next steps

Once you can generate your first image, here are the features to try next.

Feature	What it does	Difficulty
ControlNet	Generate an image with a specified pose or composition. “In this pose," “in this composition" becomes possible	★★☆
LoRA	Add a model trained on a specific style, character, or art direction. Powerful in combination with SDXL	★★☆
img2img	Transform the style or content based on an existing image	★☆☆
Inpainting	Select just part of an image and redraw it. “Fix only the face," “change only the background" become possible	★★☆
Upscaling	Increase the resolution of a generated image. 512×512→2048×2048, etc.	★☆☆
Video generation	Generate short videos with AnimateDiff, Wan 2.1, etc. (16GB+ VRAM recommended)	★★★

My personal recommendation is to try img2img and upscaling first. Both have templates ready and are usable as an extension of the basic Text to Image.

I’d suggest getting into ControlNet and LoRA once you have a sense of “the direction of the images I want to make" — that way it’s easier to feel “ah, this is what I wanted to do."

ControlNet — the technique for “specifying" composition

In Text to Image you instruct composition and pose via the prompt, but there’s a limit to fine control. ControlNet is a technique that extracts structural information — “outline (Canny)," “depth (Depth)," “pose (OpenPose)" — from an input image, and generates a new image while keeping that structure.

For example, you can detect Canny edges from a hand-drawn rough sketch and generate an image that follows those edges. VRAM use is about +1–2GB. Download a ControlNet model from the ComfyUI Manager, add two or three nodes, and you’re set.

Measured on the author’s machine (RTX 3090 24GB)

[kimono_product id="15761″]

[kimono_bar title="Generation time by model (measured on RTX 3090)" unit="sec" color="#2196f3″ max="30″ note="ComfyUI / Linux / measured May 2026. 20 steps."]
SD 1.5 (512×512)|8.0
SDXL (1024×1024)|26.0
[/kimono_bar]

Summary

With the arrival of the Desktop edition, 2026’s ComfyUI has made “your first image five minutes after install" a reality.

Just run the installer and pick a template. No Python, Git, or CUDA knowledge required.

Once you’ve generated your first image, try changing the prompt, swapping the model, or specifying composition with ControlNet. As you get used to ComfyUI’s node-based design, there’s a joy in building up “a workflow that’s all your own."

For how to choose a GPU, I’ve put together recommendations by VRAM capacity in a separate article, so if you’re considering a purchase, take a look at that too.

The specs and generation times in this article are a rough guide on an RTX 3090 environment and vary greatly by GPU. Prices and compatibility are as of April 2026.

Gear used for testing