FLUX.2Black-Forest-Labsimage-generationdiffusionH100benchmarkDiffusersCLIPGPU-scalingcost-analysis

FLUX.2-klein-4B on H100: Image Generation Benchmark

We ran 6 benchmarks on Black Forest Labs' 4B-parameter diffusion model — generation speed, image quality, consistency, batch throughput, and multi-GPU scaling. A single H100 generates photorealistic 1024x1024 images in 0.57 seconds.

InferenceBench TeamBenchmark Engineering|April 12, 2026|18 min read

FLUX.2-klein-4B on H100: Image Generation Benchmark

The 4-Billion Pixel Machine

Forget tokens per second. This benchmark measures images per second.

FLUX.2-klein-4B by Black Forest Labs generates a photorealistic 512x512 image in 0.19 seconds. That is not a typo. On a single NVIDIA H100, this 4-billion-parameter diffusion model produces images faster than most people can blink. At 1024x1024 — the standard resolution for production use — it completes a 4-step generation in 0.57 seconds.

Here is what 0.57 seconds buys you:

FLUX.2-klein-4B photorealistic generation of a golden retriever puppy — generated in under 1 second on H100 — Photo realism — CLIP score 0.373. Generated in 0.57s at 1024x1024, 4 steps.

And here is what happens when you ask a diffusion model to render text — historically the hardest task for image generators:

FLUX.2-klein-4B neon HELLO text rendering — CLIP 0.335 — Text rendering — CLIP score 0.335. Not perfect, but legible and stylistically coherent.

We ran six distinct benchmark suites on this model: generation speed across resolutions and step counts, CLIP-based quality evaluation across 10 categories, attribute binding consistency, maximum batch throughput, multi-GPU scaling from 1 to 8 H100s, and MLPerf-style scenario testing. This is the most comprehensive public benchmark of FLUX.2-klein-4B to date.

What Is FLUX.2?

FLUX.2 is the second-generation image synthesis family from Black Forest Labs, the company founded by the original creators of Stable Diffusion. If Stability AI brought text-to-image to the mainstream, Black Forest Labs is refining the architecture for production deployment.

The "klein" variant (German for "small") is the efficiency-focused member of the FLUX.2 lineup. At 4 billion parameters, it is roughly 40% the size of SDXL's 6.6B and a fraction of the proprietary models from Midjourney or DALL-E. The tradeoff is deliberate: Black Forest Labs optimized for inference speed and memory efficiency while maintaining competitive image quality.

Key architectural details:

Architecture: Diffusion Transformer (DiT) — not the original U-Net backbone. Transformer-based diffusion enables better scaling and parallelism.
Parameters: 4 billion (dense, all active during generation)
Pipeline: Hugging Face Diffusers (FluxPipeline) — not vLLM, not TensorRT. This is a Diffusers-native model.
Precision: bfloat16 natively
Step efficiency: Designed for 4-step generation (flow matching scheduler)
VRAM: 16-18 GB depending on resolution — fits on a single consumer GPU
Lineage: Successor to FLUX.1 Dev and FLUX.1 Schnell, with improved detail and text rendering

The 4-step capability is critical. Older diffusion models like Stable Diffusion 1.5 typically need 20-50 denoising steps. SDXL improved to 15-25 steps. FLUX.2-klein-4B achieves production-quality results in just 4 steps thanks to its flow matching training — and as our data shows, quality actually peaks at lower step counts.

6 Benchmarks, One GPU

Hardware and Methodology

Component	Specification
GPU	NVIDIA H100 SXM 80 GB HBM3
GPU Configs	1, 2, 4, 8 GPUs (multi-GPU tests)
Interconnect	NVLink 4.0 (900 GB/s bidirectional)
Pipeline	Hugging Face Diffusers (FluxPipeline)
Precision	bfloat16
Scheduler	Flow matching (default FLUX.2 scheduler)
Model Load Time	4.0 seconds
Scaling Strategy	Data parallel (replicated model per GPU)

Why not vLLM? FLUX.2 is a diffusion model, not an autoregressive language model. It does not produce tokens — it denoises latent images through iterative refinement. The inference pipeline is fundamentally different: there is no KV-cache, no sequence length, no time-to-first-token. The relevant metrics are images per second, latency per image, VRAM consumption, and CLIP alignment score.

We tested 6 benchmark suites:

Generation Performance — 15 configurations (5 resolutions x 3 step counts): speed, latency, VRAM
Image Quality — CLIP alignment across 10 semantic categories, LPIPS diversity, steps-vs-quality
Consistency — Attribute binding accuracy across 6 prompt categories
InferenceMax — Maximum throughput via batch size ramp
Multi-GPU Scaling — Data parallel scaling from 1 to 8 GPUs
MLPerf-style Scenarios — Standard and fast configuration testing

Speed: Sub-Second Generation

The generation performance benchmark tested every combination of 5 resolutions and 3 step counts — 15 total configurations. The results separate fast, practical, and high-quality tiers clearly.

Full Results

Resolution	Steps	Time (s)	Images/sec
512x512	4	0.19	5.14
768x768	4	0.33	3.00
1280x720	4	0.50	2.01
1024x1024	4	0.57	1.77
1024x1024	8	1.00	1.00
1024x1024	12	1.43	0.70

The relationship between resolution and speed is nearly linear: doubling the pixel count roughly doubles the generation time. This is expected — diffusion models operate on latent representations proportional to image size, and each denoising step processes the full latent tensor.

At 512x512 with 4 steps, the model produces 5.14 images per second. That is real-time generation. You could build a live preview that updates as the user types a prompt, and the latency would feel instantaneous.

At the standard production resolution of 1024x1024 with 4 steps, you get 1.77 images per second — roughly one image every 570 milliseconds. For comparison:

SDXL (6.6B): Typically 3-8 seconds per 1024x1024 image, depending on step count and GPU
Midjourney v6: 15-60 seconds including queue time (server-side)
DALL-E 3: 5-15 seconds via API (includes network latency)
Stable Diffusion 3.5: 2-5 seconds at 1024x1024

FLUX.2-klein-4B is 5-10x faster than SDXL on equivalent hardware. The combination of fewer parameters (4B vs 6.6B), the DiT architecture, and 4-step flow matching creates a significant speed advantage.

VRAM usage is remarkably stable: 16 GB at 512x512, climbing only to 17.8 GB at 1024x1024. This means the model comfortably fits on a single RTX 4090 (24 GB) or even an RTX 4080 (16 GB) at lower resolutions.

FLUX.2-klein-4B sample generated at 1024x1024 resolution with 4 denoising steps — 1024x1024, 4 steps — the sweet spot for production. 0.57 seconds on H100.

Quality: Where 4B Punches Above Its Weight

Speed means nothing if the images look bad. We evaluated FLUX.2-klein-4B's output quality using CLIP alignment scores across 10 distinct semantic categories, each with purpose-built prompts designed to stress different capabilities of the model.

CLIP Scores by Category

Category	CLIP Score	Assessment
Photo Realism	0.373	Strong — natural lighting, textures, depth
Macro Detail	0.372	Strong — close-up subjects rendered sharply
Spatial Reasoning	0.361	Good — understands relative positioning
Artistic Style	0.358	Good — transfers named styles convincingly
Composition	0.351	Good — follows layout instructions
Text Rendering	0.335	Decent — legible but occasionally imprecise
Human Anatomy	0.334	Decent — faces good, hands still challenging
Fine Detail	0.305	Moderate — small objects can blur
Infographic	0.302	Moderate — layout present but not precise
Diagram	0.263	Weak — not designed for structured visuals

Average CLIP score: 0.335. For context, CLIP scores above 0.30 generally indicate meaningful text-image alignment, and scores above 0.35 indicate strong prompt following. FLUX.2's strengths — photo realism, macro detail, spatial reasoning — align perfectly with its most common production use cases.

The weakest category, diagrams (0.263), is unsurprising. Diffusion models generate images through iterative denoising, not structured layout engines. Diagrams, flowcharts, and technical illustrations require precise geometric relationships that the stochastic generation process struggles with. This is not a flaw — it is a known boundary of the diffusion paradigm.

LPIPS diversity score: 0.400. This measures how different images generated from the same prompt are from each other. A higher score means more variety. At 0.400, FLUX.2 produces meaningfully diverse outputs — you will not get the same image twice with different seeds.

FLUX.2-klein-4B artistic style generation — Japanese watercolor style — Artistic style — CLIP 0.358. The model convincingly adapts to named artistic styles.

FLUX.2-klein-4B human anatomy generation — portrait with natural proportions — Human anatomy — CLIP 0.334. Faces and proportions are natural; hands remain a challenge for all diffusion models.

The Counterintuitive Finding: Fewer Steps = Higher Quality

This was the most surprising result in the entire benchmark. When we measured CLIP alignment across different step counts, the scores went down as steps increased:

Steps	CLIP Score	Relative
2	0.375	Best
4	0.373	Near-best
8	0.369	Slightly lower
12	0.367	Lowest

With traditional diffusion models, more steps almost always means better quality. With FLUX.2's flow matching scheduler, the model is trained to produce its best output in very few steps. Additional steps introduce marginal refinement to textures but can actually reduce prompt alignment as the model over-refines details at the expense of global coherence.

The practical takeaway: 4 steps is the sweet spot. You get 98% of the peak CLIP score at 2x the speed of 8 steps and 3.5x the speed of 12 steps. There is no reason to run more than 4 steps in production.

Text Rendering: The Litmus Test

For years, text rendering has been the Achilles' heel of diffusion models. Midjourney v5 routinely produced gibberish when asked to render words. DALL-E 2 was barely legible. Even SDXL with specialized training struggled with anything beyond short, common words.

FLUX.2-klein-4B rendering of neon HELLO text — demonstrating text generation capability — Text rendering — CLIP 0.335. The neon "HELLO" is clearly legible, with consistent letter forms and realistic light diffusion.

FLUX.2-klein-4B scores 0.335 on text rendering — decent but not perfect. In our test prompts, short words (3-6 characters) rendered cleanly in most cases. Longer phrases, unusual fonts, and small text sizes showed occasional letter substitution or blurring. This is a meaningful improvement over SDXL-era models but still falls short of DALL-E 3, which benefits from a separate text understanding module.

For production text overlay, we recommend generating the image with FLUX.2 and compositing text separately. For decorative text, stylized signage, or short labels embedded in scenes, FLUX.2 handles it well enough to be useful.

Consistency and Attribute Binding

Attribute binding measures whether a model correctly associates attributes with the right objects in a prompt. If you ask for "a red car next to a blue house," does the car end up red and the house blue — or do the colors swap?

We tested 6 categories of attribute binding:

Category	Accuracy	What It Tests
Spatial	0.288	Above, below, left, right, behind, in front
Counting	0.282	Exact number of objects
Material	0.269	Wood, metal, glass, fabric textures
Complex	0.265	Multi-attribute combinations
Color	0.252	Specific color assignments to objects
Action	0.246	Subjects performing specific actions

Overall consistency: 0.265. Spatial reasoning leads at 0.288 — the model understands relative positioning better than any other attribute type. Action binding is weakest at 0.246, which is a known limitation across diffusion models: specifying "a person running while holding an umbrella" is harder for the denoising process than specifying "a person next to an umbrella."

FLUX.2-klein-4B spatial reasoning test — objects placed in correct relative positions — Spatial reasoning — CLIP 0.361. The model correctly interprets positional relationships between objects.

What does this mean for production? If your prompts are concrete — specific objects, clear descriptions, single-subject focus — FLUX.2 follows them reliably. If your prompts are compositionally complex with multiple attributed objects and actions, expect occasional misattribution. This is where prompt engineering matters, and it is a limitation shared by every open-weight image model at this parameter count.

Multi-GPU Scaling: Data Parallel Done Right

Unlike language models that use tensor parallelism to split a single inference across GPUs, image generation models scale through data parallelism: each GPU holds a full copy of the model and processes its own batch of images independently. This is simpler and, for image generation, more efficient.

Scaling Results

GPUs	Throughput (img/s)	Speedup	Efficiency
1	0.43	1.00x	100%
2	0.82	1.93x	97%
4	1.48	3.46x	87%
8	2.53	5.94x	74%

97% efficiency at 2 GPUs is exceptional. The 3% loss comes from batch distribution overhead and result collection — trivial for data parallel. At 4 GPUs, efficiency drops to 87%, and at 8 GPUs to 74%. The 26% efficiency loss at 8 GPUs is primarily due to I/O contention and CPU-side prompt encoding becoming the bottleneck, not GPU compute.

The practical throughput numbers tell the real story:

1 GPU: 0.43 img/s = 1,548 images/hour
2 GPUs: 0.82 img/s = 2,952 images/hour
4 GPUs: 1.48 img/s = 5,328 images/hour
8 GPUs: 2.53 img/s = 9,108 images/hour

An 8-GPU node produces over 9,000 images per hour at 1024x1024. For batch processing workloads — e-commerce catalogs, dataset generation, content pipelines — this is transformative throughput.

Maximum Batch Throughput

Our InferenceMax benchmark pushed batch sizes from 1 to 16 on a single GPU at 1024x1024, 4 steps:

Batch Size	Throughput (img/s)
1	1.75
2	1.82
4	1.89
8	1.91
16	1.88

Peak throughput of 1.91 img/s at batch size 8 — a 9% improvement over single-image generation. Batch size 16 shows a slight regression, indicating the H100's memory bandwidth is saturated. For latency-sensitive applications, batch size 1 is fine. For throughput-optimized pipelines, batch sizes of 4-8 squeeze out meaningful extra performance.

The Economics of Image Generation

Here is where self-hosted FLUX.2 becomes a compelling story. Using our generation time data and current GPU pricing:

Cost Per Image (1024x1024, 4 Steps)

Service	Cost per Image	Images per $1
FLUX.2 self-hosted (Lambda H100, $2.49/hr)	$0.0004	2,500
FLUX.2 self-hosted (RunPod H100, $3.09/hr)	$0.0005	2,000
SDXL via API (Stability AI)	$0.002-0.006	167-500
Midjourney (subscription, ~$0.01-0.06/img)	$0.01-0.06	17-100
DALL-E 3 (OpenAI API)	$0.04	25

At $0.0004 per image, self-hosted FLUX.2 is 100x cheaper than DALL-E 3 and 25-150x cheaper than Midjourney. Even compared to other self-hosted options running SDXL, FLUX.2's speed advantage translates directly into cost savings — the same GPU generates 5-10x more images per hour.

The breakeven math is straightforward. A Lambda H100 at $2.49/hour produces approximately 6,372 images per hour (at 512x512, 4 steps) or 1,548 images per hour (at 1024x1024). If you are generating more than 1,000 images per month, self-hosting on-demand GPUs is cheaper than any API. If you are generating more than 100,000 images per month, the savings are staggering.

Run your own numbers with our GPU cost calculator.

Industry Applications

FLUX.2-klein-4B's combination of speed, quality, and low cost opens production use cases that were previously impractical or prohibitively expensive:

E-Commerce and Product Photography

Generate product variations, lifestyle shots, and background swaps at scale. At 9,100 images/hour on an 8-GPU node, you can regenerate an entire product catalog overnight. The photo realism score (CLIP 0.373) is the model's strongest category — exactly what product photography demands.

Gaming and Creative Production

Concept art iteration, texture generation, and style exploration at interactive speeds. A game artist can generate and evaluate 5 variations per second at 512x512, making AI-assisted design a real-time workflow rather than a batch process.

FLUX.2-klein-4B composition test — demonstrating scene layout capabilities for creative production — Composition — CLIP 0.351. The model follows scene layout instructions, useful for concept art and creative direction.

Advertising and Marketing

Rapid creative iteration and A/B test visual generation. At $0.0004/image, generating 100 ad variations costs 4 cents. Test every visual concept, every background, every style — the cost of experimentation drops to zero.

Architecture and Real Estate

Visualization rendering, virtual staging, and property marketing. FLUX.2's spatial reasoning (CLIP 0.361) ensures rooms, furniture, and lighting are placed coherently.

Media and Publishing

Editorial illustration, social media content, and visual storytelling at scale. A single newsroom GPU can produce every illustration needed for daily publication.

FLUX.2-klein-4B macro detail generation — close-up subject with sharp detail — Macro detail — CLIP 0.372. Close-up subjects render with sharp detail and natural textures.

FLUX.2 vs the Competition

An honest comparison based on publicly available data and our own benchmark results:

Model	Params	1024x1024 Time	Quality	Cost Model
FLUX.2-klein-4B	4B	0.57s (4-step)	CLIP 0.335 avg	Open-weight, self-host
SDXL	6.6B	3-8s (20-25 step)	Comparable CLIP	Open-weight, self-host
SD 3.5 Large	8B	2-5s (28 step)	Slightly higher CLIP	Open-weight, self-host
FLUX.1 Schnell	12B	~1s (2-step)	Lower detail	Open-weight, self-host
Midjourney v6	Unknown	15-60s (queue)	Higher aesthetics	Subscription API only
DALL-E 3	Unknown	5-15s (API)	Better instruction following	$0.04/image API

FLUX.2 vs SDXL: Clear winner on speed (5-10x faster) with comparable quality. If you are currently running SDXL in production, FLUX.2 is a drop-in improvement — same Diffusers pipeline, same deployment pattern, dramatically better throughput.

FLUX.2 vs Midjourney v6: Midjourney produces more aesthetically polished images, particularly for artistic and photographic styles. But it is proprietary, API-only, queue-based, and 25-150x more expensive per image. For production pipelines that need control, speed, and cost efficiency, FLUX.2 wins on every axis except raw aesthetic preference.

FLUX.2 vs DALL-E 3: DALL-E 3 has superior instruction following and text rendering — OpenAI's model benefits from a separate language understanding stage that FLUX.2 lacks. But at $0.04/image, DALL-E 3 costs 100x what self-hosted FLUX.2 does. At scale, that difference is the entire business case.

FLUX.2 vs FLUX.1 Schnell: Schnell is a 2-step model optimized for maximum speed at the cost of detail. FLUX.2's 4-step generation gives noticeably better fine detail and text rendering while remaining sub-second at 1024x1024. If you need the absolute fastest generation and can accept rougher output, Schnell still has a place. For quality-sensitive work, FLUX.2 is the upgrade.

How to Deploy

FLUX.2-klein-4B uses the standard Hugging Face Diffusers pipeline. A minimal deployment:

pip install diffusers transformers accelerate torch

from diffusers import FluxPipeline
import torch

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-4B",
    torch_dtype=torch.bfloat16
).to("cuda")

image = pipe(
    prompt="A photorealistic golden retriever puppy in a sunlit garden",
    num_inference_steps=4,
    height=1024,
    width=1024,
).images[0]

image.save("output.png")

For production deployment with multi-GPU data parallel, Docker containerization, and batch processing pipelines, see our deployment guides or contact us at inferencebench.io/support.

To find the optimal GPU configuration for your image generation workload, use our workload matcher.

Conclusion: The Democratization of Image Generation

FLUX.2-klein-4B is not the highest-quality image model available. Midjourney produces more polished aesthetics. DALL-E 3 follows complex instructions more reliably. Larger diffusion models can generate finer detail.

But none of those models can do what FLUX.2 does at its price point and speed.

At 4 billion parameters, it runs on consumer hardware — an RTX 4090 handles it comfortably, and even an RTX 4080 can manage lower resolutions.

At 0.19 seconds per image (512x512), it enables real-time applications that were previously impossible with open-weight models — live previews, interactive editing, instant feedback loops.

At $0.0004 per image, it makes every API-based image service look expensive. Two thousand five hundred images for a dollar. One hundred thousand images for forty dollars of GPU time.

The multi-GPU scaling is nearly linear up to 4 GPUs (87% efficiency), and even an 8-GPU node maintains 74% efficiency while producing over 9,000 images per hour. For batch processing at enterprise scale, this is production-ready infrastructure.

The model is honest about its limitations. Diagram generation is weak (CLIP 0.263). Text rendering is decent but not flawless (0.335). Complex multi-attribute prompts occasionally misassign properties. These are real boundaries that affect real deployment decisions.

But within its strengths — photo realism (0.373), macro detail (0.372), spatial reasoning (0.361) — FLUX.2-klein-4B delivers quality that was state-of-the-art just 18 months ago, at a speed and cost that makes image generation a commodity rather than a luxury. That is what democratization looks like.

FLUX.2-klein-4B fine detail generation — demonstrating the model's capability with intricate subjects — Fine detail — CLIP 0.305. The model handles intricate subjects, though the smallest elements can soften.

For more GPU benchmarks and cost analysis across 47 GPUs, 160 models, and 19 providers, visit inferencebench.io.

Gemma 4 vs the MoE Field: When a 31B Dense Model Wins and When It Doesn't

Gemma 4 31B scores 9.73/10 MT-Bench from 31B dense params. We compare it against Mixtral 8x22B and DeepSeek V3 on cost, latency, and quality tradeoffs.

Apr 12, 2026

MiniMax M2.5: A 229B MoE Model That Defies Easy Judgment

MiniMax M2.5 229B MoE benchmarked on 8x H100: 8,876 tok/s peak, 100% needle-in-haystack, 87% tool use, but 1.57/10 MT-Bench. The full contradictory picture.

Apr 12, 2026

MiniMax M2.5 vs M2.7: Does Doubling MoE Params Help?

Head-to-head benchmark of MiniMax M2.5 (229B) vs M2.7 (456B) on 8x H100: 11% throughput gain but 17% MT-Bench drop. More MoE params does not mean better.

Apr 12, 2026

The 4-Billion Pixel Machine

What Is FLUX.2?

6 Benchmarks, One GPU

Hardware and Methodology

Speed: Sub-Second Generation

Full Results

Quality: Where 4B Punches Above Its Weight

CLIP Scores by Category

The Counterintuitive Finding: Fewer Steps = Higher Quality

Text Rendering: The Litmus Test

Consistency and Attribute Binding

Multi-GPU Scaling: Data Parallel Done Right

Scaling Results

Maximum Batch Throughput

The Economics of Image Generation

Cost Per Image (1024x1024, 4 Steps)

Industry Applications

E-Commerce and Product Photography

Gaming and Creative Production

Advertising and Marketing

Architecture and Real Estate

Media and Publishing

FLUX.2 vs the Competition

How to Deploy

Conclusion: The Democratization of Image Generation

More articles

Gemma 4 vs the MoE Field: When a 31B Dense Model Wins and When It Doesn't

MiniMax M2.5: A 229B MoE Model That Defies Easy Judgment

MiniMax M2.5 vs M2.7: Does Doubling MoE Params Help?