Skip to content
All articles
NVIDIAAlpamayoautonomous-drivingH100benchmarkvision-language-modeltrajectory-predictionGPU-inference

NVIDIA Alpamayo 1.5-10B on H100: Autonomous Driving Inference Benchmark

We ran 5 inference tests on NVIDIA's vision-language driving model to measure trajectory prediction, visual QA, and probabilistic planning on a single H100 GPU.

InferenceBench TeamBenchmark Engineering18 min read
NVIDIA Alpamayo 1.5-10B on H100: Autonomous Driving Inference Benchmark

Introduction: What Is NVIDIA Alpamayo?

In early 2025, NVIDIA Research released Alpamayo 1.5-10B, a 10-billion parameter vision-language model designed specifically for autonomous driving. Unlike conventional trajectory prediction systems that operate as black-box regressors, Alpamayo generates natural language reasoning about a driving scene before outputting physical trajectory coordinates. NVIDIA calls this paradigm Chain-of-Causation (CoC) reasoning, and it represents a fundamental shift in how autonomous systems can explain their decisions.

The model processes six surround-view camera feeds simultaneously, fuses them into a coherent scene understanding, and then produces both a language explanation of the driving situation and a set of future trajectory waypoints. This dual-output architecture means you can ask Alpamayo why it chose a particular path, not just what path it chose. For safety-critical applications, that distinction matters enormously.

We wanted to know: how does Alpamayo actually perform on real driving data, what does it need in terms of GPU infrastructure, and what can you do with it today? We ran five distinct inference tests on the AIAV dataset using a single NVIDIA H100 SXM GPU and recorded everything. This post presents the full results.

What We Tested

Our test script (test_alpamayo_full.py) exercised five inference modes that cover the breadth of Alpamayo's capabilities:

Test Mode What It Measures
1 Chain-of-Causation Trajectory Prediction CoC reasoning quality + trajectory accuracy (minADE)
2 Visual Question Answering (VQA) Scene understanding via free-form Q&A
3 Navigation-Conditioned Trajectories How natural language nav commands steer predictions
4 BEV Counterfactual Analysis Nav vs. no-nav vs. contradicting instructions
5 Trajectory Distribution (32 Samples) Uncertainty quantification at temperature=0.8

Hardware Setup

All tests ran on a single NVIDIA H100 SXM GPU with 80 GB of HBM3 memory. The model was loaded in bfloat16 precision, consuming approximately 20 GB of VRAM at runtime. This leaves 60 GB of headroom for batch processing, KV cache, or running additional models in parallel. No quantization was applied; this is the full-precision 10B model.

Test 1: Chain-of-Causation Trajectory Prediction

This is Alpamayo's signature capability. Given six surround-view camera images and ego-vehicle state, the model first generates a Chain-of-Causation explanation in natural language, then predicts a trajectory. We tested on three driving clips from the AIAV dataset, each presenting a different driving challenge.

Results

Clip Scenario minADE (m) CoC Reasoning
0 Construction zone 0.375 "Nudge to the left to clear the construction equipment blocking the right side"
1 Oncoming traffic 0.254 "Nudge to the right to increase clearance from the oncoming traffic"
2 Following lead car 9.884 "Keep distance to the lead vehicle since it is directly ahead"
6-camera surround view grid for Clip 0 showing a construction zone scenario
Six-camera surround view for Clip 0: construction zone with equipment blocking the right lane.
Trajectory prediction plot for Clip 0 showing predicted vs ground truth paths
Predicted trajectory (blue) vs. ground truth (green) for the construction zone clip. minADE = 0.375m.

Analysis

Clips 0 and 1 produced impressive results. A minADE of 0.254 meters means the model's best predicted trajectory was, on average, about 25 centimeters from the ground truth across the prediction horizon. For context, that is roughly the width of a lane marking. The CoC reasoning is also semantically accurate: the model correctly identifies the construction equipment as a right-side obstruction and the oncoming traffic as a left-side hazard.

Clip 2 is the outlier at 9.884 meters. This is a car-following scenario where the lead vehicle's future behavior is inherently uncertain. The model correctly identifies the situation ("keep distance to the lead vehicle") but the trajectory diverges significantly, likely because the lead car's actual behavior over the prediction horizon was difficult to anticipate from a single frame. This is not a failure of the model so much as a demonstration of where single-frame prediction hits its limits. In production, you would feed a temporal sequence of frames rather than a single timestep.

The Chain-of-Causation reasoning is not just a debugging tool. It provides an auditable decision trace that regulators and safety engineers can inspect after the fact. Every trajectory comes with its justification.

Test 2: Visual Question Answering (VQA)

Alpamayo is not limited to trajectory prediction. Because it is built on a vision-language architecture, you can ask it arbitrary questions about the driving scene. We tested five questions on the construction zone clip (Clip 0).

Question Alpamayo Response
"What obstacles are in the scene?" "Construction equipment blocking right side of lane"
"Describe the traffic situation" Detailed description of construction zone, following lead car, lane constraints
"Is it safe to change lanes?" "No, could lead to collision with oncoming traffic"
"What is the speed limit?" "25 mph"
"Are there pedestrians?" Returns bounding box: [0.762, 0.419, 0.81, 0.594]

Analysis

The obstacle identification and traffic situation descriptions are grounded and specific. The model does not hallucinate objects that are not present, and its lane-change safety assessment is correct given the scene context. The speed limit response is interesting: the model appears to be reading a sign visible in the camera feed, though this should be verified against the actual signage in the AIAV dataset.

The pedestrian detection response is notable because it returns a bounding box in normalized coordinates rather than a natural language description. This suggests the model has been trained to output structured detection data when appropriate, which is far more useful for downstream systems than a text description of where a pedestrian might be.

The VQA capability transforms Alpamayo from a trajectory predictor into a general-purpose driving scene understanding system. You can query it about anything visible in the surround cameras and get actionable answers.

Test 3: Navigation-Conditioned Trajectory Prediction

One of Alpamayo's most powerful features is its ability to condition trajectory predictions on natural language navigation commands. Instead of simply predicting what the ego vehicle will do, you can tell it what the vehicle should do and get a trajectory that follows those instructions while respecting the physical constraints of the scene.

We tested three navigation commands on the same construction zone scene:

Navigation Command Expected Behavior
"Turn left at the next intersection in 30m" Trajectory curves left while avoiding construction equipment
"Continue straight ahead" Trajectory maintains current heading with obstacle avoidance
"Turn right onto the highway ramp in 50m" Trajectory curves right past the construction zone
Three navigation-conditioned trajectories showing left, straight, and right commands
Three trajectories from the same scene, conditioned on different navigation commands. The model adjusts its predicted path while maintaining safe clearance from obstacles.

Analysis

The navigation conditioning works as expected. Each command produces a geometrically distinct trajectory that follows the instruction while respecting scene constraints. The left-turn trajectory avoids the construction equipment; the straight trajectory nudges left as in the baseline CoC prediction; the right-turn trajectory plans a path that clears the construction zone before merging right.

This capability is critical for integration with route planners. A typical autonomous driving stack has a route planner that provides high-level navigation commands (turn left in 200m, take the second exit) and a motion planner that translates those into physical trajectories. Alpamayo can serve as that motion planner, accepting natural language commands rather than requiring a formalized waypoint interface.

Test 4: BEV Counterfactual Analysis

To stress-test the navigation conditioning, we compared three modes on the same scene, generating 8 trajectory samples for each:

Mode Condition Samples
With Navigation "Turn left at the next intersection in 30m" 8
Without Navigation Standard unconditional prediction 8
Counterfactual "Turn right at the next intersection in 30m" 8
Bird's eye view comparison of nav, no-nav, and counterfactual trajectory distributions
BEV comparison: navigation-conditioned (left), unconditional (center), and counterfactual (right) trajectory distributions. Each mode shows 8 sampled trajectories.

Analysis

This test confirms that the model genuinely conditions on navigation input rather than ignoring it. The left-turn and right-turn trajectory distributions are clearly separated in BEV space, with the unconditional prediction falling between them. The unconditional trajectories cluster around the CoC-predicted path (nudge left to avoid construction), while the counterfactual right-turn trajectories swing right despite the construction equipment, reflecting the model's attempt to follow the instruction while still maintaining some collision avoidance.

The counterfactual mode is particularly valuable for safety validation. By asking "what would the model predict if we gave it a dangerous instruction?", we can verify that the model still maintains safety constraints. In our test, the right-turn counterfactual trajectories do swing right, but they still attempt to clear the construction zone rather than driving directly into it. This suggests some degree of safety awareness even when following potentially hazardous instructions, though it is not a substitute for a dedicated safety layer.

Test 5: Uncertainty Distribution (32 Samples)

The final test generated 32 trajectory samples at temperature 0.8 to characterize the model's uncertainty distribution. Rather than a single "best guess" trajectory, this gives us a probability cloud that shows where the model thinks the vehicle might plausibly go.

32 trajectory samples showing uncertainty distribution with spread increasing at longer horizons
32 trajectory samples at temperature=0.8. The spread increases with prediction horizon, reflecting growing uncertainty about the future.

Analysis

The distribution exhibits the expected funnel shape: tight clustering in the near term (0-1 seconds) that fans out at longer horizons (3-5 seconds). This is physically correct. The ego vehicle's immediate future is well-determined by its current state, but uncertainty compounds over time as the behavior of other agents, traffic signals, and road geometry become increasingly unpredictable.

The practical value of this distribution is twofold. First, a downstream planner can use the spread as a confidence measure: if all 32 samples agree, the model is confident; if they diverge widely, the planner should be more cautious. Second, the distribution captures multimodal futures. At an intersection, for instance, you would expect to see distinct clusters for "go straight", "turn left", and "turn right" rather than a uniform spread, and Alpamayo produces exactly this kind of structured uncertainty.

A model that outputs a single trajectory is making a bet. A model that outputs a distribution is making a map of possibilities. For safety-critical applications, you want the map.

GPU Infrastructure Requirements

One of the most practical questions for teams evaluating Alpamayo is: what hardware do you actually need?

Minimum Viable Setup

Component Specification
GPU NVIDIA H100 SXM (80 GB HBM3) or equivalent
VRAM Usage ~20 GB at bf16 (10B params x 2 bytes)
Precision bfloat16 (native H100 support)
Framework HuggingFace Transformers + PyTorch 2.x
Headroom 60 GB for batching, KV cache, multi-model

Alternative GPU Options

The 20 GB VRAM footprint means Alpamayo can technically run on GPUs with less memory than the H100:

GPU VRAM Viability Notes
H100 SXM 80 GB Optimal 60 GB headroom for batching and multi-model
A100 80GB 80 GB Good Lower memory bandwidth (2 TB/s vs 3.35 TB/s), ~40% slower generation
A100 40GB 40 GB Viable 20 GB headroom, limited batch size
L40S 48 GB Viable PCIe form factor, lower bandwidth, suitable for non-real-time
RTX 4090 24 GB Tight 4 GB headroom at bf16; usable for development, not production
H200 141 GB Ideal for multi-model Run Alpamayo alongside perception and planning models

InferenceBench Recommendation

For production autonomous driving workloads, we recommend the H100 SXM or H200 as the baseline. The 60+ GB of headroom is not luxury; you will need it for batched inference across multiple camera frames, KV cache for multi-turn VQA, and potentially co-locating other models (object detection, segmentation, mapping) on the same GPU. Use InferenceBench's GPU comparison tool to model the cost per inference across providers and find the optimal price-performance point for your deployment scale.

Real-World Applications

Alpamayo's architecture (surround-view vision + language reasoning + trajectory prediction) is designed for autonomous vehicles, but the underlying pattern applies to any domain where you need to fuse multi-camera visual input with structured reasoning and physical action planning.

Autonomous Passenger Vehicles (L2-L5)

This is the primary target. Alpamayo can serve as the prediction and planning backbone of an autonomous driving stack, replacing or augmenting traditional motion planning modules. The CoC reasoning provides the explainability layer that regulators increasingly demand. At L2/L3 (driver assistance), the VQA capability can power natural language dashboards that explain the system's decisions to the human driver. At L4/L5 (full autonomy), the trajectory distribution output feeds directly into a safety-critical motion planner.

The navigation conditioning is especially relevant for ride-hailing: the route planner provides turn-by-turn instructions in natural language, and Alpamayo translates those into smooth, safe trajectories. No handcrafted waypoint interface needed.

Autonomous Trucking and Logistics

Long-haul trucking faces different challenges than passenger vehicles: longer prediction horizons, larger vehicle dynamics, and highway-dominated driving. Alpamayo's 32-sample uncertainty distribution is particularly valuable here because highway merging and lane changing require understanding the multimodal future behaviors of surrounding traffic. The CoC reasoning also provides the audit trail that fleet operators and insurance companies require.

For last-mile delivery, the VQA capability could answer questions like "Is this driveway blocked?" or "Can the truck fit under this bridge?" using the surround cameras, reducing the need for pre-mapped route constraints.

Construction and Mining Equipment

Autonomous haul trucks, excavators, and loaders operate in unstructured environments where traditional HD maps are unavailable or constantly changing. Alpamayo's ability to reason about obstacles ("construction equipment blocking the right side") translates directly to reasoning about rock piles, trenches, and other equipment. The 6-camera surround view maps well to the multi-camera rigs already standard on large mining equipment.

The VQA mode enables remote operators to query the vehicle's perception system: "What is blocking the haul road?" or "Is the dump zone clear?" This reduces the need for constant video monitoring by human operators.

Agricultural Robotics

Autonomous tractors, sprayers, and harvesters navigate fields with varying crop density, irrigation infrastructure, and terrain. The navigation conditioning allows a farm management system to issue commands like "Follow the row to the end, then turn left into the next row" in natural language. The model's obstacle reasoning handles irrigation pivots, fence posts, and wildlife in the same framework that handles construction equipment on a road.

HVAC and Building Automation

While not a driving application, the sensor fusion + reasoning pattern applies to smart building systems. Replace the six surround cameras with thermal cameras, CO2 sensors, and occupancy detectors. Replace trajectory prediction with HVAC control trajectories (temperature setpoints over time). The CoC reasoning becomes: "Reduce cooling in Zone 3 because occupancy dropped after the meeting ended at 3 PM." The VQA mode becomes: "Why is Zone 7 above setpoint?" The architecture is the same; only the domain-specific training data changes.

This is speculative but worth noting: the vision-language-action paradigm that Alpamayo embodies is not limited to vehicles. Any system that observes the physical world, reasons about it, and takes physical actions can benefit from this architecture.

Warehouse and Manufacturing Robots

Autonomous mobile robots (AMRs) in warehouses navigate dynamic environments with human workers, forklifts, and constantly changing inventory layouts. Alpamayo's multi-camera fusion handles the 360-degree awareness requirement, and its trajectory prediction accounts for the movements of other agents. The VQA capability enables warehouse management systems to query robot perception: "Is aisle 14 blocked?" or "How many pallets are on rack B7?"

In manufacturing, collaborative robots (cobots) working alongside humans need to predict human motion trajectories to maintain safety. The uncertainty distribution output directly serves this need: wider distribution means the cobot should slow down or stop.

Delivery Robots and Drones

Sidewalk delivery robots and urban delivery drones face the same perception challenges as autonomous vehicles but at a different scale. Pedestrian detection, obstacle avoidance, and navigation conditioning all apply directly. The 10B parameter count and 20 GB VRAM footprint are small enough to run on edge GPUs like the NVIDIA Jetson AGX Orin (with quantization), making onboard inference feasible for delivery robots without cloud connectivity.

Alpamayo represents a specific moment in the convergence of vision-language models and robotics. Here is how we see this trajectory evolving over the next decade:

Period Development Implication
2025-2026 VLMs for driving become standard research tools Every major AV company integrates VLM reasoning into their stack, at least for validation and testing
2026-2027 Edge deployment of 10B-class VLMs via quantization INT4/INT8 Alpamayo-class models run on Jetson Orin / automotive SoCs with 5 GB VRAM
2027-2028 Temporal VLMs process video sequences natively Single-frame limitations (like our Clip 2 result) disappear; models reason across time
2028-2029 Regulatory frameworks require explainable AV decisions CoC-style reasoning becomes mandatory, not optional; VLMs are the compliance path
2029-2030 Multi-agent VLMs coordinate vehicle fleets A single model reasons about the joint behavior of dozens of vehicles simultaneously
2030-2035 Unified perception-reasoning-action models replace modular AV stacks End-to-end VLMs handle the full autonomy pipeline; traditional HD maps, hand-tuned planners, and rule-based safety layers are phased out

The GPU economics of this transition are significant. Today's H100-class hardware handles a 10B VLM comfortably. By 2028, the models will likely be 50-100B parameters with video input, requiring H200 or Blackwell-class hardware for real-time inference. Planning your GPU infrastructure roadmap now, using tools like InferenceBench, ensures you are not caught off guard when the model sizes scale.

How to Reproduce Our Benchmark

Our test script is designed to be straightforward to run. Here is the essential setup:

Environment

# Create environment
conda create -n alpamayo python=3.10
conda activate alpamayo
pip install torch torchvision transformers accelerate
pip install matplotlib numpy Pillow

Download the Model

# Model is hosted on HuggingFace
from transformers import AutoModelForCausalLM, AutoProcessor

model_id = "nvidia/Alpamayo-1.5-10B"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

Run a Basic Trajectory Prediction

# Load surround-view images (6 cameras)
images = load_surround_cameras(clip_path)  # Your data loading function

# Chain-of-Causation prediction
inputs = processor(
    images=images,
    text="Predict the trajectory with chain-of-causation reasoning.",
    return_tensors="pt",
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
response = processor.decode(outputs[0], skip_special_tokens=True)
# response contains both CoC reasoning text and trajectory waypoints

Key Parameters to Tune

Parameter Our Setting Effect
temperature 0.7 (CoC), 0.8 (distribution) Higher = more diverse trajectory samples
max_new_tokens 512 Enough for CoC text + trajectory coordinates
num_samples 8 (BEV), 32 (distribution) More samples = better uncertainty estimate, slower inference
torch_dtype bfloat16 Half the VRAM of float32, negligible quality loss on H100

The complete test script (test_alpamayo_full.py) covers all five inference modes, AIAV data loading, and visualization utilities. To request access to the full benchmark script and dataset configuration, contact our support team. We will share the script along with setup instructions tailored to your hardware environment.

Conclusion

NVIDIA Alpamayo 1.5-10B is a genuinely novel contribution to autonomous driving research. The Chain-of-Causation reasoning is not a gimmick; it produces semantically meaningful explanations that align with the predicted trajectories. The VQA capability turns a prediction model into a general scene understanding system. The navigation conditioning and counterfactual analysis enable the kind of what-if reasoning that safety validation demands. And the probabilistic trajectory output provides the uncertainty quantification that responsible deployment requires.

From a GPU infrastructure perspective, the 20 GB VRAM footprint at bf16 makes this a practical model to deploy. A single H100 handles it with room to spare, and even an A100 40GB or RTX 4090 can run it for development purposes. The real infrastructure question is not "can I run Alpamayo?" but "what else do I need to run alongside it?" In a full autonomous driving stack, Alpamayo is one model among several, and understanding the total GPU budget across your entire model ensemble is where tools like InferenceBench become essential.

The Clip 2 result (minADE of 9.884m) is a useful reminder that no single model is the complete solution. Temporal context, sensor fusion beyond cameras, and traditional safety layers all remain necessary. But Alpamayo shows that vision-language models can reason about driving in a way that is both physically grounded and linguistically interpretable. That is the foundation on which the next generation of autonomous systems will be built.

Explore GPU costs for running Alpamayo and similar vision-language models at scale using the InferenceBench calculator. Compare H100, A100, and L40S pricing across 19 cloud providers to find the optimal deployment configuration for your autonomous driving workload.

Want to run this benchmark yourself? Request access to the full test script and we will get you set up.

More articles