Skip to content
All articles
NVIDIANemotronSuper-120BUltra-253BMoEFP8H100inferenceSWOTbenchmarkvLLMcost-analysis

Nemotron Super 120B vs Ultra 253B: NVIDIA's Best Open-Weight Models Benchmarked

We tested three configurations of NVIDIA's Nemotron lineup across 12 benchmarks on 8x H100 SXM. The FP8 quantized Ultra scores higher than BF16, Super delivers 1.7x the throughput, and both completely fail tool use and vision. Full SWOT analysis, deployment economics, and production recommendations inside.

InferenceBench TeamBenchmark Engineering22 min read
Nemotron Super 120B vs Ultra 253B: NVIDIA's Best Open-Weight Models Benchmarked

NVIDIA's Nemotron Lineup: The Complete Picture

NVIDIA's open-weight model strategy has split into two distinct product lines, each optimized for a different production reality. Nemotron-3 Super 120B is a mixture-of-experts (MoE) architecture with 120 billion total parameters but only 12 billion active per forward pass. Nemotron Ultra 253B is a dense transformer based on Llama 3.1 architecture with all 253 billion parameters firing on every token. Both target enterprise deployment on NVIDIA's own H100 hardware, and both represent NVIDIA's bet that open-weight models can compete with proprietary APIs from OpenAI and Anthropic.

The strategic logic is clear. Super is the throughput play: sparse activation means fewer FLOPs per token, which means more tokens per second per GPU dollar. Ultra is the quality play: dense architecture with massive parameter count pushes the quality ceiling as high as possible. Together, they are supposed to cover the entire enterprise inference spectrum. But do they?

We ran both models through InferenceBench's full 12-benchmark evaluation suite to find out. The results challenge several assumptions about quantization, MoE scaling, and what NVIDIA's models can actually do in production.

What We Tested

We evaluated three configurations on identical hardware — an 8x NVIDIA H100 SXM node with NVLink interconnect, served via vLLM with tensor parallelism across all 8 GPUs:

Configuration Model ID Architecture Precision Active Params
Super 120B nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 MoE (120B total) BF16 12B
Ultra 253B BF16 nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 Dense BF16 253B
Ultra 253B FP8 nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 Dense FP8 253B

The benchmark suite covers 12 evaluation dimensions:

  • MT-Bench — 8-category conversational quality (1–10 scale, GPT-4 judged)
  • InferenceMax — Peak throughput at concurrency=128
  • Best throughput — Optimal throughput at production-grade latency
  • Coding — 15 code generation and review tasks
  • IFBench — 20 instruction-following precision tests
  • Long context — 29 needle-in-haystack and multi-fact retrieval tests
  • Multilingual — 32 cross-language tasks
  • Tool use — 15 function-calling scenarios
  • Vision — 12 image understanding tasks
  • Throughput heatmaps — ISL x OSL x concurrency matrix
  • Latency profiling — TTFT, TPOT, E2E across concurrency levels
  • Pareto frontier — Throughput vs latency optimal operating points

Every test was run on the same physical node with the same vLLM version. No cherry-picking, no warm-up exclusions, no "best of N" reporting. What follows is what the hardware actually produced.

The Head-to-Head Results

Here is the complete comparison across all three configurations. Green indicates the best result in each row.

Metric Super 120B (BF16) Ultra 253B (BF16) Ultra 253B (FP8)
MT-Bench (overall) 6.0 / 10 9.2 / 10 9.47 / 10
InferenceMax (tok/s) 6,567 3,807 5,340
Best throughput (tok/s) 3,275 1,145 1,669
Coding 93% (14/15) 100% (15/15) 100% (15/15)
IFBench 40% (8/20) 40% (8/20) 40% (8/20)
Long context 100% (29/29) 100% (29/29) 100% (29/29)
Multilingual 100% (32/32) 100% (32/32) 100% (32/32)
Tool use 0% (0/15) 0% (0/15) 0% (0/15)
Vision 0% (0/12) 0% (0/12) 0% (0/12)

Three results jump off this table immediately: Ultra FP8 beats Ultra BF16 on quality. Super crushes both on throughput. And both models register zero on tool use and vision. We will dig into each of these in the sections that follow.

Quality: Ultra Dominates

The quality gap between Super and Ultra is not a marginal difference — it is a generational divide. Ultra at 9.2–9.47 MT-Bench competes with GPT-4 class models. Super at 6.0 sits below models half its parameter count. Here is the category-by-category breakdown:

MT-Bench Category Super 120B Ultra BF16 Ultra FP8
Writing 8.5 9.0 9.0
STEM 7.0 9.5 9.2
Coding 7.0 9.8 9.0
Roleplay 6.2 8.8 9.2
Reasoning 5.8 10.0 10.0
Humanities 5.0 9.5 9.5
Math 4.0 7.8 10.0
Extraction 3.0 9.5 10.0

The pattern is stark. Super's strengths cluster at the top of its distribution — writing at 8.5 and STEM at 7.0 are respectable. But the bottom half collapses: math at 4.0 and extraction at 3.0 are scores that would fail a production quality gate in any serious enterprise deployment. An extraction score of 3.0 means the model cannot reliably pull structured data from unstructured text. For customer support, legal analysis, or any workflow involving data extraction, this is disqualifying.

Ultra, by contrast, scores 9.0 or above in every category except math (7.8 in BF16). Its reasoning score of 10.0/10 is perfect. Its coding score in BF16 (9.8) and functional coding accuracy (100% on our 15-task suite) make it one of the strongest open-weight code models available. The quality ceiling of 253 billion dense parameters is doing exactly what it should: storing enough knowledge and reasoning patterns to compete at the frontier.

The coding benchmark further confirms this gap. Ultra achieved a perfect 15/15 on our code generation and review tasks. Super scored 14/15 — strong, but that single failure was on a complex multi-file refactoring task that required sustained reasoning across long contexts. At scale, a 93% pass rate versus 100% compounds quickly: over 1,000 code reviews, Super would miss approximately 70 issues that Ultra catches.

Throughput: Super's MoE Advantage

If Ultra wins on quality, Super wins on speed — and the margin is decisive.

Throughput Metric Super 120B Ultra BF16 Ultra FP8
InferenceMax (conc=128) 6,567 tok/s 3,807 tok/s 5,340 tok/s
Best throughput (production) 3,275 tok/s 1,145 tok/s 1,669 tok/s
Throughput config ISL=128, OSL=512, conc=32 ISL=512, OSL=1024, conc=32 ISL=512, OSL=1024, conc=32

Super at 6,567 tok/s is 1.7x faster than Ultra BF16 at peak concurrency and 1.23x faster than Ultra FP8. At production-grade concurrency (conc=32), the gap widens even further: Super at 3,275 tok/s is 2.86x faster than Ultra BF16 and 1.96x faster than Ultra FP8.

The explanation is architectural. Super's MoE design activates only 12 billion of its 120 billion total parameters on each forward pass. That means each token requires roughly the same compute as a 12B dense model, even though the model has the knowledge capacity of a much larger architecture. The remaining 108 billion parameters sit idle per token but store specialized knowledge in different expert sub-networks, available when the routing network selects them.

On 8x H100 SXM, this translates to dramatically lower per-token latency. The attention computation scales with active parameters, not total parameters. The feed-forward network — which accounts for roughly two-thirds of transformer FLOPs — only runs through the active experts. The result is that Super processes tokens almost as fast as a 12B dense model while (theoretically) accessing the knowledge base of a 120B model.

However, the quality results show that "theoretically" is doing heavy lifting in that sentence. Super's MT-Bench of 6.0 does not reflect 120B worth of knowledge — it reflects the limitations of routing 12B active parameters across a quality distribution. The MoE architecture buys throughput, not quality proportional to total parameters.

The FP8 Paradox: Lower Precision, Higher Quality

The most counterintuitive finding in this benchmark is that Ultra FP8 scores higher on MT-Bench than Ultra BF16: 9.47 versus 9.2. This challenges the foundational assumption in model deployment that quantization always degrades quality. You reduce numerical precision, you lose information, quality drops. That is the standard expectation. Ultra FP8 breaks it.

Metric Ultra BF16 Ultra FP8 Delta
MT-Bench (overall) 9.2 9.47 +0.27
Math 7.8 10.0 +2.2
Extraction 9.5 10.0 +0.5
Roleplay 8.8 9.2 +0.4
InferenceMax (tok/s) 3,807 5,340 +40%
Best throughput (tok/s) 1,145 1,669 +46%
Coding (functional) 100% 100% 0

Look at the math score: 7.8 in BF16 jumps to a perfect 10.0 in FP8. Extraction goes from 9.5 to 10.0. Roleplay from 8.8 to 9.2. Meanwhile, FP8 is 40% faster at peak throughput. Better quality and better performance. This needs explaining.

Why FP8 Might Outperform BF16

Several mechanisms could explain this phenomenon, and the honest answer is that the ML research community does not yet have a definitive explanation. Here are the leading hypotheses:

1. Stochastic rounding as implicit regularization. FP8 quantization introduces small, semi-random perturbations to weight values. These perturbations can act as a form of regularization, similar to dropout. For a model at 253 billion parameters, there may be mild overfitting to certain patterns in the training distribution. The noise introduced by FP8 quantization could reduce this overfitting, particularly on benchmark-style tasks where precision matters less than robust generalization.

2. Reduced attention sink concentration. BF16 allows very large attention logit values, which can create "attention sinks" where a few tokens dominate the attention distribution. FP8's narrower dynamic range naturally clips extreme attention scores, potentially producing a more uniform attention distribution that improves response quality on tasks requiring broad context integration — like math reasoning and multi-step extraction.

3. NVIDIA's FP8 quantization is not naive. The FP8 implementation in vLLM and TensorRT-LLM uses per-tensor or per-channel scaling factors calibrated on representative data. This is not a blind truncation of precision — it is a carefully engineered compression that preserves the weight distributions most critical for quality. NVIDIA has invested heavily in making FP8 a first-class precision format on H100's Transformer Engine, and the calibration may have been optimized specifically for Nemotron Ultra's weight distribution.

4. Benchmark variance. MT-Bench uses GPT-4 as a judge, and a 0.27-point difference (9.47 vs 9.2) is within the range of run-to-run variance for LLM-as-judge evaluation. However, the per-category improvements are consistent across math (+2.2), extraction (+0.5), and roleplay (+0.4), which makes pure variance less likely as the sole explanation.

Whatever the mechanism, the practical implication is unambiguous: there is no reason to deploy Ultra in BF16 when FP8 is available on H100. You get better quality, 40% more throughput, and approximately half the memory footprint. FP8 is not a compromise — it is the superior deployment format for this model.

SWOT Analysis

Nemotron-3 Super 120B (MoE)

Analysis
Strengths
  • 6,567 tok/s peak throughput — highest of the three configurations by a wide margin
  • MoE efficiency: 12B active params means per-token compute cost is comparable to a small dense model
  • Perfect long context (100%, 29/29) and multilingual (100%, 32/32) scores
  • Strong coding at 93% (14/15)
  • Writing quality at 8.5/10 is usable for content generation workloads
Weaknesses
  • MT-Bench 6.0/10 overall — mediocre quality that fails enterprise quality gates
  • 0% tool use (0/15) — complete failure on function calling
  • 0% vision (0/12) — no image understanding capability
  • Extraction at 3.0/10 — cannot reliably extract structured data from text
  • Math at 4.0/10 and reasoning at 5.8/10 — below the threshold for analytical workloads
  • IFBench at 40% — frequently misinterprets or ignores instructions
Opportunities
  • High-throughput batch processing where speed matters more than per-response quality
  • RAG pipelines where retrieval quality compensates for model quality deficits
  • Cost-sensitive deployments that need the lowest cost per token on H100 hardware
  • Content generation pipelines with human review loops (writing at 8.5 is good enough for drafts)
  • Embedding or pre-processing stages where the model is not the final output
Threats
  • Dense models like Gemma 4 31B achieve 9.73 MT-Bench on a single GPU at a fraction of the cost
  • Quality-per-dollar is poor: 6.0 MT-Bench from 8 GPUs vs 9.73 from 1 GPU (Gemma 4)
  • The MoE routing does not translate its 120B total params into proportional quality
  • Models with tool use capability (Gemma 4, Qwen 2.5) can serve workloads Super cannot

Nemotron Ultra 253B (Dense)

Analysis
Strengths
  • MT-Bench 9.2–9.47 — competitive with GPT-4 class models across all categories
  • Perfect reasoning (10/10) — one of the highest reasoning scores in any open-weight model
  • Perfect coding (100%, 15/15) — flawless on functional code generation and review
  • FP8 sweet spot: better quality than BF16, 40% faster, half the memory
  • Perfect long context (100%) and multilingual (100%)
  • Strong across all MT-Bench categories — no score below 7.8 (BF16) or 9.0 (FP8)
Weaknesses
  • Requires 8x H100 minimum — the most expensive open-weight deployment tier
  • 0% tool use (0/15) — complete failure on function calling, identical to Super
  • 0% vision (0/12) — no image understanding despite 253B parameters
  • 3,807–5,340 tok/s — slower than Super, which limits throughput-sensitive workloads
  • IFBench at 40% — instruction following is mediocre across both Nemotron models
  • Dense 253B architecture means no path to single-GPU or even 4-GPU deployment
Opportunities
  • Enterprise reasoning tasks: legal analysis, scientific research, complex Q&A
  • Code generation and review pipelines where 100% accuracy justifies the hardware cost
  • Replacing proprietary API dependencies (GPT-4, Claude) with on-premises deployment
  • FP8 deployment as the default — simultaneously cheaper and better than BF16
  • Research organizations needing frontier-quality open-weight reasoning
Threats
  • Frontier proprietary models (Claude Opus 4, GPT-4) dominate on tool use and multimodal
  • Gemma 4 31B achieves 9.73 MT-Bench on 1 GPU — higher quality at 1/8 the hardware cost
  • The 8x H100 requirement puts Ultra out of reach for most organizations
  • 0% tool use makes Ultra unusable for the fastest-growing AI workload category (agentic AI)
  • NVIDIA Blackwell B200 (192GB HBM3e) may shift the value proposition toward smaller dense models

The Economics

Deployment cost analysis requires concrete throughput-to-dollar conversions. Using Lambda Labs pricing of $19.92/hr for an 8x H100 SXM node (representative of major cloud providers in early 2026; see InferenceBench Calculator for current rates):

Configuration Peak tok/s $/hr $/M tokens (peak) $/M tokens (prod, conc=32) MT-Bench $/quality point
Super 120B BF16 6,567 $19.92 $0.84 $1.69 6.0 $0.28
Ultra 253B BF16 3,807 $19.92 $1.45 $4.83 9.2 $0.53
Ultra 253B FP8 5,340 $19.92 $1.04 $3.31 9.47 $0.35

$/M tokens = (hourly_cost / throughput_tok_s) × (1,000,000 / 3,600). $/quality point = $/M tokens (peak) / MT-Bench score.

The economics tell three stories:

Super is the cheapest per token at $0.84/M tokens at peak throughput. If your workload is pure token volume and quality is secondary, Super delivers the most tokens per dollar on 8x H100. But the quality cost is hidden in that number: at 6.0 MT-Bench, many of those tokens will be low-quality responses that require human review or re-generation, effectively doubling or tripling the true cost.

Ultra BF16 is the most expensive per token at $1.45/M tokens at peak and a painful $4.83/M at production concurrency. You are paying a premium for quality, but FP8 eliminates the need to pay this premium entirely.

Ultra FP8 is the optimal cost-per-quality-point. At $1.04/M tokens (peak) and 9.47 MT-Bench, the cost-per-quality-point is $0.35 — significantly lower than Ultra BF16 ($0.53) and only marginally higher than Super ($0.28). But that marginal cost difference buys you a 58% quality improvement (9.47 vs 6.0). For any workload where response quality matters, FP8 is the clear economic winner.

Who Should Deploy What

Based on the benchmark data, here is a clear recommendation matrix:

Workload Recommendation Why
High-throughput batch / RAG Super 120B 6,567 tok/s peak. Quality is compensated by retrieval in RAG. Cheapest per token.
Enterprise chat / reasoning Ultra FP8 9.47 MT-Bench. Perfect reasoning (10/10). Best quality-per-dollar ratio.
Code generation / review Ultra (either precision) 100% coding accuracy. Perfect scores on multi-file refactoring tasks.
Agentic AI / tool use NEITHER 0% tool use across all configs. Use Gemma 4, Qwen 2.5, or Claude instead.
Vision / multimodal NEITHER 0% vision across all configs. Use a dedicated VLM.
Budget-constrained Super 120B 1.7x cheaper per token than Ultra FP8. Accept the quality tradeoff.
Long context retrieval Either (both 100%) Both achieve perfect scores on needle-in-haystack and multi-fact retrieval.
Multilingual Either (both 100%) Both achieve perfect scores on 32 cross-language tasks.

The pattern is clear: Ultra FP8 is the default recommendation for any quality-sensitive workload. Super is the fallback for throughput-sensitive or cost-constrained scenarios. And for agentic AI or vision, neither Nemotron model is viable.

The Tool Use and Vision Elephant

Both Nemotron models scored 0% on tool use (0 out of 15 function-calling scenarios) and 0% on vision (0 out of 12 image understanding tasks). This is not a marginal weakness. It is a complete, categorical failure across two of the most important capability dimensions for modern AI deployment.

Let us be direct about what this means.

Tool use at 0% means these models cannot participate in agentic AI workflows. They cannot call functions, parse function schemas, route to the correct tool, or extract arguments from natural language queries. Agentic AI — where an LLM plans, executes tools, evaluates results, and iterates — is the fastest-growing production AI pattern. Every major framework (LangChain, LlamaIndex, CrewAI, AutoGen) depends on tool use. A model that scores 0% on tool use is excluded from this entire category of deployment.

Vision at 0% means these models are text-only in practice. Despite NVIDIA's broader investment in multimodal AI, neither Nemotron Super nor Ultra can process images, charts, screenshots, documents with embedded figures, or any other visual content. In an era where competing models (GPT-4o, Claude, Gemini) handle text and images seamlessly, shipping a text-only model at 253B parameters feels like a deliberate scope limitation rather than a technical constraint.

For context, in our Gemma 4 benchmark, a 31B model on a single GPU achieved non-zero tool use and vision scores. The fact that NVIDIA's 253B flagship cannot match a model 8x smaller on these dimensions is notable.

The production implication: if your deployment requires tool use or multimodal capabilities — and an increasing percentage of enterprise deployments do — Nemotron is not an option. Full stop. For agentic workloads, consider Gemma 4 31B (single GPU, tool use capable), Qwen 2.5 72B (2 GPUs, strong tool use), or proprietary APIs (Claude, GPT-4) where function calling is a first-class feature.

How to Deploy

All three configurations were served via vLLM on 8x H100 SXM with tensor parallelism. Here are the deployment commands:

Super 120B (BF16)

docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.92

Ultra 253B (BF16)

docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.95

Ultra 253B (FP8) — Recommended

docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.92

The key difference for FP8 is the --quantization fp8 flag, which activates NVIDIA H100's Transformer Engine FP8 kernels. This halves the memory footprint of model weights and enables the 40% throughput improvement we measured. No additional calibration data is required — vLLM handles the per-tensor scaling automatically.

For production deployments, contact InferenceBench Support for full serve.sh scripts with health checks, auto-scaling, and monitoring configuration.

Conclusion

NVIDIA's Nemotron lineup presents a genuine architectural tradeoff rather than a simple "bigger is better" story. Here is what the data says:

Ultra FP8 is NVIDIA's best open-weight model for quality-sensitive workloads. At 9.47 MT-Bench with perfect reasoning and coding scores, it competes with GPT-4 class models while running entirely on your own hardware. The fact that FP8 quantization improves quality over BF16 while delivering 40% more throughput makes it the unambiguous deployment recommendation over BF16. Deploy Ultra in BF16 only if your serving stack does not support FP8 — and if it does not, upgrade your serving stack.

Super is the throughput king for budget-constrained or batch-heavy workloads. At 6,567 tok/s, it processes tokens faster than anything else in the Nemotron family. But its quality ceiling of 6.0 MT-Bench limits it to workloads where speed matters more than per-response accuracy: RAG pre-processing, content draft generation with human review, or embedding pipelines where the model's output is an intermediate step.

Neither model is ready for agentic AI or multimodal deployment. The 0% tool use and 0% vision scores are not edge cases — they are complete capability gaps that exclude both models from two of the most important production AI categories. NVIDIA's open-weight models are text-in, text-out. If your production system needs function calling or image understanding, look elsewhere.

The FP8 paradox is the story that matters most for the broader industry. If a carefully calibrated lower-precision format can match or exceed full-precision quality while cutting memory and boosting throughput, the implications extend far beyond Nemotron. Every model deployer should be testing FP8 on H100 — you may be paying a BF16 tax for quality you are not getting.

Use the InferenceBench Workload Matcher to find the right model for your traffic pattern, the Compare tool to evaluate Nemotron against other models head-to-head, and the Calculator to model your deployment economics.

More articles