ComfyUI for Beginners: From Install to Your First Image (2026)
I’ve put a GPU in my PC and generate AI images locally. Blog thumbnails, social-media assets, visualizing an idea — the uses vary.
The old ComfyUI required setting up an environment: “install Python, clone with Git, pip install, match the CUDA version…" — a fairly high hurdle for anyone without programming experience.
That changed completely when the Desktop edition was released in October 2024. Now you just download and run an installer — no programming knowledge required at all. In this article I’ll walk through the Desktop edition, from install to generating your first image.
Right now, with the Desktop edition you can get started in just five steps.
What is ComfyUI?
ComfyUI is a node-based AI image generation tool.
It acts as a frontend for running text-to-image AI models — Stable Diffusion, FLUX, SD3 — on your local PC. It’s open source and free. Commercial use is also free, depending on the license of the model you use.
Its defining feature is a method where you connect blocks called “nodes" with lines to build up the flow of processing. It looks complex at first glance, but thanks to this design, customizations like “I want to swap the model," “I want to fix just part of the image," or “I want to add upscaling" are flexible to do.
A similar tool is Automatic1111 (WebUI); here’s how they differ.
| Comparison | ComfyUI | Automatic1111 (WebUI) |
|---|---|---|
| UI style | Node-based (connect with lines) | Form-input style |
| Customizability | Very high | Moderate |
| Beginner-friendliness | A bit hard to grasp | Intuitive |
| Speed of support for new models | Fast (supports FLUX and others quickly) | A bit slow |
| Memory efficiency | Good | Average |
| Development status in 2026 | Active (Desktop edition released) | Updated less often than ComfyUI |
The Stable Diffusion and FLUX models used in ComfyUI run on a mechanism called the Latent Diffusion Model. Computing a normal image (say, 1024×1024 pixels, about 1 million pixels) as-is would need an enormous amount of processing, but Latent Diffusion compresses the image into a “latent space" of about 128×128 before processing. The amount of computation is about 1/64 of pixel space. That’s why it can generate images at practical speeds even on a local GPU.
The flow is “random noise → gradually remove noise → an image emerges." How many times denoising is repeated corresponds to the KSampler’s “Steps" parameter.
As of 2026, my honest take is that if you’re starting fresh, ComfyUI is a strong choice. Support for new models is fast and development is active. With the arrival of the Desktop edition, the beginner’s wall is nearly gone too.
Installation
Until 2024, you needed manual Python / Git / CUDA setup, and someone with no programming experience could get stuck at the environment-building stage. With the Desktop edition, you just run an installer. It supports Windows / macOS, and it takes about 10–30 minutes including the model download.
Desktop edition (Windows / Mac)
- Go to comfy.org and download the Desktop edition from “Download"
- Run the installer. No special settings needed — basically just keep clicking “Next"
- When ComfyUI launches, a template-selection screen appears
- Select the “Text to Image (FLUX)" template. The automatic download of the recommended model (FLUX Schnell, etc.) begins
- Once the download finishes, the prompt input field becomes usable
The model download size varies by the model you choose. For FLUX Schnell it’s about 23GB (FP16). On a fiber connection, roughly 10–15 minutes.
Note: The Desktop edition bundles a Python environment and CUDA runtime in addition to ComfyUI itself. So even if you already have a Python environment, you don’t need to worry about conflicts.
Linux (the traditional method)
For Linux users, here’s the traditional method as well.
# Clone the repository git clone https://github.com/comfyanonymous/ComfyUI.git cd ComfyUI # Install dependencies (venv recommended) python -m venv venv source venv/bin/activate pip install -r requirements.txt # Launch python main.py
Once it launches, you can reach it in the browser at http://127.0.0.1:8188. Place model files manually in the models/checkpoints/ folder.
Generating your first image
Once the Desktop edition is installed, let’s generate an image right away.
Steps
- Launch ComfyUI
- Select “Text to Image (FLUX)" from the templates (it shows automatically on first launch; from the second time on, you can pick it from the menu)
- Enter text in English in the CLIP Text Encode node (the prompt input field)
- Press the “Queue Prompt" button (or Ctrl+Enter)
- In a few seconds to a few tens of seconds an image is generated and shown in the Save Image node
Example prompts
Start with something simple.
a cat sitting on a wooden desk, soft lighting, photorealistic, 4k
a Japanese garden in autumn, golden leaves, pond reflection, cinematic
a futuristic city at night, neon lights, rain, cyberpunk style
Prompts are basically in English. Some models support other languages, but English tends to give more stable, on-intent results.
Rough generation times
The first generation takes longer because the model has to load. From the second time on it’s fast, since the model stays in memory.
| Model | Resolution | First generation | Second time on | Notes |
|---|---|---|---|---|
| FLUX Schnell | 512×512 | 15–20s | 3–5s | Desktop edition default |
| FLUX Schnell | 1024×1024 | 25–35s | 8–12s | A practical resolution |
| SDXL | 1024×1024 | 20–30s | 10–15s | At 20 steps |
| SD 1.5 | 512×512 | 10–15s | 2–4s | The lightest |
* Generation times are a rough guide on an RTX 3090. 20 steps, 24GB VRAM environment. They vary greatly by GPU.
Model comparison: which one to use
ComfyUI lets you switch between several AI models. Here’s a comparison of the major models as of April 2026.
| Model | File size | VRAM needed | Time (1024×1024) | Quality | Characteristics |
|---|---|---|---|---|---|
| FLUX Schnell | ~23GB (FP16) | 8GB+ | 8–12s | ◎ | Fast and high-quality. Desktop edition default. Start here |
| FLUX Dev | ~23GB | 12GB+ | 25–35s | ◎◎ | Top-class quality. Good at rendering text. But heavy |
| SDXL | ~6.5GB | 6GB+ | 10–15s | ○ | Lightweight with a rich set of custom models (LoRA, etc.) |
| SD 1.5 | ~4GB | 4GB+ | 3–5s | △ | Old but the lightest. Runs even on low VRAM |
* Generation times are for an RTX 3090 at 20 steps (as of April 2026)
FLUX Schnell is plenty to start with. The Desktop edition’s template selects it automatically, so there’s no need to change it. Try SDXL once you reach the stage of “I want to generate in a specific style" or “I want to use LoRA."
Basic node concepts
When you open the ComfyUI screen, you see a diagram of square blocks (nodes) connected by lines. It looks complex at first, but the basics are just five nodes.
The five basic nodes
1. Checkpoint Loader (load the model)
The node that loads the AI model file. It’s the starting point that decides “which model draws the picture."
2. CLIP Text Encode (prompt conversion)
Converts the text you enter (the prompt) into numerical data the AI model can understand. You use two: a positive prompt (what you want to draw) and a negative prompt (what you don’t want to draw).
3. KSampler (the heart of image generation)
The “drawing engine," so to speak, that gradually generates the image from a lump of noise. It has several settings, but the first three to learn are these.
Steps: The number of times denoising is repeated. The standard is 20; more steps add fineness, but past 30 the quality gains diminish while compute time keeps rising linearly. In practice, a range of 15–25 is plenty.
CFG Scale (Classifier-Free Guidance): A value that controls fidelity to the prompt. For SDXL, around 7.0 is standard. Closer to 1.0 weakens the prompt’s influence, giving “the AI’s free interpretation"; above 15.0 the prompt drags too hard, causing color saturation or broken outlines. FLUX models have the CFG mechanism built in internally, so it’s common to use them fixed at 1.0.
Sampler: The denoising calculation method. There’s “euler," “dpmpp_2m," “dpmpp_sde," and others, each with a different balance of speed and quality. If unsure, “euler" (fast and stable) or “dpmpp_2m" (higher quality, a bit slower) will do fine.
4. VAE Decode (convert to an image)
What KSampler generated is “latent space" data invisible to humans. VAE Decode converts it into an image we can see.
5. Save Image (save the image)
The node that saves and displays the finished image.
The data flow
[model]
[latent image]
[image]
(positive / negative)
The lines between nodes are color-coded.
| Line color | Data it carries | Meaning |
|---|---|---|
| Purple | MODEL | The AI model itself |
| Yellow | CLIP | The text-processing part |
| Orange | CONDITIONING | The prompt’s information |
| Pink | LATENT | Latent-space image data |
| Blue | IMAGE | An image humans can see |
| Red | VAE | The image-conversion engine |
As long as you use a template, you don’t need to touch this wiring yourself. Understanding “so this is how it works" makes it easier to guess the cause when an error occurs.
Recommended settings by VRAM
The models and settings you can use change with your GPU’s VRAM capacity.
| VRAM | Recommended model | Launch argument | Feel |
|---|---|---|---|
| 4GB | SD 1.5 | --lowvram |
Slow but runs. For a taster |
| 6–8GB | SDXL / FLUX Schnell (FP8) | None | Practical. FLUX Schnell running is a big deal |
| 12GB | FLUX Schnell / SDXL + ControlNet | None | Comfortable. Room for LoRA too |
| 16GB | FLUX Dev (FP16) | None | Comfortable. Nearly every model runs |
| 24GB | FLUX Dev (full precision) | --highvram |
Top quality. The model stays resident, so switching is fast |
--lowvram and --highvram are arguments passed when launching ComfyUI. On the Desktop edition you set them from the settings screen; on the Linux edition you specify them like python main.py --lowvram.
--lowvram is a feature that offloads part of the model to main memory when VRAM is short — it runs slower, but lets you force a model that otherwise wouldn’t run. Conversely, --highvram keeps the whole model resident in VRAM for speed, and shows its effect on GPUs with 24GB or more.
VRAM use during image generation breaks into three main parts.
1. The model weights: about 12GB for FLUX Schnell (FP8), about 5GB for SDXL (FP16). This takes up most of the VRAM.
2. Intermediate compute data: the tensors KSampler holds during generation. It grows in proportion to resolution — about 1–2GB at 1024×1024.
3. Additional models like ControlNet: using ControlNet adds +1–2GB. LoRA adds very little (a few tens of MB).
--lowvram offloads “1. the model weights" to CPU memory (system RAM) and transfers only the part needed for computation to the GPU. Because GPU-CPU data transfer occurs, generation speed drops by roughly 2–5x.
--highvram to keep FLUX Dev resident at full precision. There’s no reloading when switching models, which makes iterating a real pleasure.Common errors and fixes
Once you start using ComfyUI, there are a few errors nearly everyone hits at least once. Here’s how to handle them.
| Error / symptom | Cause | Fix |
|---|---|---|
| CUDA out of memory | Not enough VRAM | Add the --lowvram argument. Or lower the resolution (1024→768→512) |
| A node shows up in red | A wiring mistake between nodes, or a custom node isn’t installed | Check the connections. Install missing nodes via Manager → Install Missing Nodes |
| Checkpoint not found | The model file isn’t in the right place | Put the model file in the models/checkpoints/ folder. The Desktop edition can download it directly from the Manager |
| The result is pure black | The VAE failed to load, or a mismatch between an FP8 model and the VAE | Explicitly specify the VAE in the Checkpoint Loader settings. If you’re using an FP8 model, download and specify a matching VAE separately |
| Generation is abnormally slow | It’s computing on the CPU, not the GPU | Check that the NVIDIA driver is installed correctly. The Desktop edition usually auto-detects it |
| Pressing Queue Prompt does nothing | A node connection is incomplete | Check that every node’s input ports are connected. If a port is unconnected, it won’t run |
As long as you use the Desktop edition, errors stemming from environment setup barely happen. The most common one is CUDA out of memory (not enough VRAM). Lowering the resolution or trying --lowvram is your first move.
ComfyUI Manager: install extensions easily
One of ComfyUI’s strengths is an extension-management tool called ComfyUI Manager. It comes preinstalled in the Desktop edition.
With the Manager, the following are all done within the GUI.
- Install, update, and remove custom nodes
- Search for and download model files
- Auto-detect and install missing nodes (handy when you load someone else’s workflow)
- Update ComfyUI itself
When you load a workflow someone else made (a .json file) into ComfyUI, it may show up in red because the custom nodes it uses are missing. In that case, just press “Install Missing Custom Nodes" in the Manager, and the required nodes install in one go.
Next steps
Once you can generate your first image, here are the features to try next.
| Feature | What it does | Difficulty |
|---|---|---|
| ControlNet | Generate an image with a specified pose or composition. “In this pose," “in this composition" becomes possible | ★★☆ |
| LoRA | Add a model trained on a specific style, character, or art direction. Powerful in combination with SDXL | ★★☆ |
| img2img | Transform the style or content based on an existing image | ★☆☆ |
| Inpainting | Select just part of an image and redraw it. “Fix only the face," “change only the background" become possible | ★★☆ |
| Upscaling | Increase the resolution of a generated image. 512×512→2048×2048, etc. | ★☆☆ |
| Video generation | Generate short videos with AnimateDiff, Wan 2.1, etc. (16GB+ VRAM recommended) | ★★★ |
My personal recommendation is to try img2img and upscaling first. Both have templates ready and are usable as an extension of the basic Text to Image.
I’d suggest getting into ControlNet and LoRA once you have a sense of “the direction of the images I want to make" — that way it’s easier to feel “ah, this is what I wanted to do."
In Text to Image you instruct composition and pose via the prompt, but there’s a limit to fine control. ControlNet is a technique that extracts structural information — “outline (Canny)," “depth (Depth)," “pose (OpenPose)" — from an input image, and generates a new image while keeping that structure.
For example, you can detect Canny edges from a hand-drawn rough sketch and generate an image that follows those edges. VRAM use is about +1–2GB. Download a ControlNet model from the ComfyUI Manager, add two or three nodes, and you’re set.
Measured on the author’s machine (RTX 3090 24GB)
[kimono_product id="15761″]
[kimono_bar title="Generation time by model (measured on RTX 3090)" unit="sec" color="#2196f3″ max="30″ note="ComfyUI / Linux / measured May 2026. 20 steps."]
SD 1.5 (512×512)|8.0
SDXL (1024×1024)|26.0
[/kimono_bar]
Summary
With the arrival of the Desktop edition, 2026’s ComfyUI has made “your first image five minutes after install" a reality.
Just run the installer and pick a template. No Python, Git, or CUDA knowledge required.
Once you’ve generated your first image, try changing the prompt, swapping the model, or specifying composition with ControlNet. As you get used to ComfyUI’s node-based design, there’s a joy in building up “a workflow that’s all your own."
For how to choose a GPU, I’ve put together recommendations by VRAM capacity in a separate article, so if you’re considering a purchase, take a look at that too.
The specs and generation times in this article are a rough guide on an RTX 3090 environment and vary greatly by GPU. Prices and compatibility are as of April 2026.
Related
- AI chatbots: a budget-by-budget guide to what you can do
- Starting local AI with a used GPU
- The full GPU spec list
- NVIDIA, AMD, Intel — which should you buy in the end?
Gear used for testing
[kimono_product id="15761″]
[kimono_product id="15759″]
[kimono_product id="15760″]