Skip to content

NVIDIA Jetson AGX Orin: Deep Technical Reference for AV Deployment

Last Updated: 2026-03-22 Applicable SKUs: Jetson AGX Orin 32GB, 64GB, Industrial


Table of Contents

  1. SoC Architecture Overview
  2. CPU Subsystem
  3. GPU Subsystem
  4. Deep Learning Accelerator (NVDLA v2.0)
  5. TOPS Breakdown
  6. Memory Subsystem
  7. Power Modes and Performance Scaling
  8. JetPack SDK Versions
  9. TensorRT and DLA-Compatible Layers
  10. Multi-Process GPU Sharing (MPS)
  11. Isaac ROS Packages
  12. Thermal Management
  13. I/O and Peripheral Interfaces
  14. Real-World Inference Benchmarks
  15. Industrial Variant Specifications
  16. Competitive Landscape
  17. Jetson AGX Orin vs DRIVE Orin
  18. Successor: Jetson Thor
  19. Pricing and Availability
  20. AV Deployment Considerations

1. SoC Architecture Overview

The Jetson AGX Orin is built on the NVIDIA Orin SoC, a heterogeneous compute platform integrating multiple accelerator types on a single die. The SoC combines an Arm CPU complex, NVIDIA Ampere-architecture GPU, two second-generation Deep Learning Accelerators (NVDLA v2.0), a Programmable Vision Accelerator (PVA v2.0), video encode/decode engines, and an Image Signal Processor (ISP).

FeatureAGX Orin 32GBAGX Orin 64GB
Process NodeSamsung 8nmSamsung 8nm
SoCNVIDIA OrinNVIDIA Orin
GPU ArchitectureNVIDIA AmpereNVIDIA Ampere
AI Performance (Sparse INT8)200 TOPS275 TOPS
Module Dimensions100mm x 87mm100mm x 87mm
Pin CompatibilityAGX Xavier compatibleAGX Xavier compatible

The Orin SoC uses a unified memory architecture where CPU, GPU, DLA, PVA, and other engines share the same LPDDR5 DRAM pool. This eliminates data copy overhead between accelerators but requires careful memory bandwidth management when running concurrent workloads.


2. CPU Subsystem

SpecificationAGX Orin 32GBAGX Orin 64GB
Core ArchitectureArm Cortex-A78AEArm Cortex-A78AE
ISAARMv8.2-A (64-bit)ARMv8.2-A (64-bit)
Core Count8 cores12 cores
Cluster Configuration2 clusters x 4 cores3 clusters x 4 cores
Max Clock Frequency2.2 GHz2.2 GHz
L2 Cache2 MB (per cluster)2 MB (per cluster)
L3 Cache4 MB6 MB

The Cortex-A78AE is the automotive-enhanced variant of the A78, providing:

  • Lock-step mode support for safety-critical computations (split-lock configuration)
  • ECC on caches for data integrity
  • ARMv8.2 extensions including dot-product instructions beneficial for quantized inference pre/post-processing

The 12-core configuration on the 64GB variant provides 1.7x the CPU throughput of the 8-core Jetson AGX Xavier, relevant for non-GPU-accelerated pipeline stages such as point cloud preprocessing, ROS node orchestration, and sensor data marshalling.


3. GPU Subsystem

The Orin's GPU is based on the NVIDIA Ampere architecture, the same generation as the data-center A100/A30 but in a mobile-optimized configuration.

3.1 GPU Configuration

SpecificationAGX Orin 32GBAGX Orin 64GB
Graphics Processor Clusters (GPC)22
Texture Processor Clusters (TPC)78
Streaming Multiprocessors (SM)1416
CUDA Cores (128 per SM)17922048
3rd-Gen Tensor Cores (4 per SM)5664
L1 Cache per SM192 KB192 KB
L2 Cache4 MB4 MB
Max GPU Clock930.75 MHz1300 MHz

3.2 Compute Performance

PrecisionAGX Orin 32GBAGX Orin 64GB
FP32 (CUDA cores)~3.8 TFLOPS5.3 TFLOPS
FP16 (CUDA cores)~7.6 TFLOPS10.6 TFLOPS
FP16 (Tensor Cores)~60 TFLOPS85 TFLOPS
INT8 Dense (Tensor Cores)~60 TOPS85 TOPS
INT8 Sparse (Tensor Cores)~120 TOPS170 TOPS

3.3 Ampere Architecture Advantages for AV

  • Structured sparsity (2:4): The 3rd-generation Tensor Cores support fine-grained structured sparsity, doubling INT8 throughput for networks pruned to the 2:4 pattern. This is particularly relevant for deploying optimized perception models.
  • TF32 precision: Available for training/fine-tuning workflows directly on the device.
  • Asynchronous copy: Hardware-accelerated data movement from global memory to shared memory, reducing CUDA kernel latency.

4. Deep Learning Accelerator (NVDLA v2.0)

The Orin integrates two independent NVDLA v2.0 cores, representing a 9x performance improvement over the first-generation DLA found on Xavier.

4.1 DLA Architecture

SpecificationValue
DLA Cores2 (independently schedulable)
SRAM per DLA Core1 MiB dedicated
Supported PrecisionsINT8, FP16
INT8 Sparse TOPS (both DLAs)105 TOPS
INT8 Dense TOPS (both DLAs)~52.5 TOPS

4.2 DLA Power Efficiency

DLA performance per watt is 3-5x superior to the GPU, depending on the power mode and workload. This efficiency makes DLA critical for meeting the full 275 TOPS within the platform power envelope.

DLA contribution by power mode:

Power ModeDLA TOPSTotal TOPSDLA Contribution
MAXN (60W)10527538%
50W9220046%
30W9013169%
15W405474%

At lower power budgets, DLA becomes the dominant compute engine. This is a key architectural insight for AV systems that need to operate in reduced-power thermal states.

4.3 DLA Throughput Benchmarks (GPU + 2x DLA Combined)

ModelFPS (GPU + 2x DLA)
PeopleNet (ResNet18)346
TrafficCamNet425
FaceDetect-IR2,381
VehicleMakeNet3,600

5. TOPS Breakdown

The headline 275 Sparse INT8 TOPS for the AGX Orin 64GB is the sum of concurrent GPU Tensor Core and DLA execution:

275 TOPS (Sparse INT8) = 170 TOPS (GPU Tensor Cores, Sparse)
                        + 105 TOPS (2x NVDLA v2.0, Sparse)

5.1 Dense vs Sparse Performance

AcceleratorINT8 SparseINT8 DenseFP16
GPU Tensor Cores (64GB)170 TOPS85 TOPS85 TFLOPS
2x NVDLA v2.0105 TOPS~52.5 TOPS
GPU CUDA Cores (64GB)10.6 TFLOPS
Total275 TOPS~138 TOPS

Key consideration for AV deployment: The 275 TOPS figure assumes both structured sparsity (2:4 pruning) in all weight matrices AND concurrent GPU+DLA utilization. Real-world AV perception pipelines typically achieve a fraction of peak TOPS due to:

  • Not all layers being sparsity-compatible
  • Memory bandwidth bottlenecks on complex models
  • Pipeline stage serialization
  • DLA layer coverage gaps requiring GPU fallback

5.2 PVA v2.0 (Programmable Vision Accelerator)

The PVA is not included in the TOPS figure but provides additional compute for classical computer vision kernels:

  • Image warping and undistortion
  • Fast Fourier Transform (FFT)
  • Image pyramid generation
  • Stereo disparity computation
  • Harris/FAST feature detection
  • Optical flow

PVA is particularly useful for offloading camera preprocessing from the GPU in multi-camera AV perception stacks.


6. Memory Subsystem

SpecificationAGX Orin 32GBAGX Orin 64GB
Memory TypeLPDDR5LPDDR5
Capacity32 GB64 GB
Bus Width256-bit256-bit
Clock Speed3200 MHz3200 MHz
Data Rate6400 Mbps/pin6400 Mbps/pin
Bandwidth204.8 GB/s204.8 GB/s
ECC SupportNo (standard module)No (standard module)
Storage64 GB eMMC 5.164 GB eMMC 5.1

6.1 Memory Bandwidth Analysis for AV Workloads

At 204.8 GB/s, the memory bandwidth is shared across all SoC engines (CPU, GPU, DLA, PVA, video codecs, ISP). For a multi-sensor AV stack running concurrent LiDAR 3D detection, multi-camera perception, and sensor fusion:

  • BEVFusion (6-camera + LiDAR): Consumes significant bandwidth for feature map storage and BEV grid generation
  • Multi-camera backbone inference: Each camera stream requires feature extraction bandwidth
  • Point cloud processing: Voxelization and sparse convolution are memory-intensive

The unified memory architecture means there are no PCIe transfer bottlenecks between CPU and GPU (unlike discrete GPU systems), but total bandwidth is the hard constraint.

6.2 Memory Capacity Recommendations

  • 32 GB: Sufficient for single-pipeline perception (e.g., 1 LiDAR + 4 cameras) with moderate model complexity
  • 64 GB: Required for multi-pipeline AV stacks, BEVFusion with 6+ cameras, concurrent perception + planning + mapping, or development/profiling workloads

7. Power Modes and Performance Scaling

The Orin supports multiple pre-configured power modes via the nvpmodel utility, plus custom mode creation.

7.1 AGX Orin 64GB Power Modes

ParameterMAXN (Mode 0)50W (Mode 3)30W (Mode 2)15W (Mode 1)
Power BudgetNo cap (~60W)50W30W15W
Online CPU Cores121284
CPU Max Freq (MHz)2201.61497.617281113.6
GPU TPC Count8843
GPU Max Freq (MHz)1301828.75624.75420.75
DLA Cores2222
Memory Max Freq (MHz)3200320032002133
AI Performance (TOPS)275~200~131~54

7.2 AGX Orin 32GB Power Modes

ParameterMAXN (Mode 0)40W (Mode 3)30W (Mode 2)15W (Mode 1)
Power BudgetNo cap (~40W)40W30W15W
Online CPU Cores8884
CPU Max Freq (MHz)2188.81497.617281113.6
GPU TPC Count7743
GPU Max Freq (MHz)930.75828.75624.75420.75
DLA Cores2222
Memory Max Freq (MHz)3200320032002133
AI Performance (TOPS)200

7.3 Power Mode Selection for AV

  • MAXN: Development and benchmarking; maximum performance with no power cap
  • 50W: Recommended for production AV systems with active cooling; best balance of performance and thermal manageability
  • 30W: Viable for limited perception stacks (e.g., camera-only) or failover/limp-home modes
  • 15W: Standby/monitoring modes; insufficient for full AV perception

Custom power modes can be created to tune CPU/GPU frequencies independently, enabling application-specific optimization (e.g., higher GPU clocks with fewer CPU cores for inference-heavy workloads).


8. JetPack SDK Versions

JetPack is the comprehensive SDK for Jetson, bundling the BSP (Jetson Linux), CUDA toolkit, TensorRT, cuDNN, and AI frameworks.

8.1 JetPack 5.x Series (Legacy)

ComponentJetPack 5.1.1JetPack 5.1.2
Jetson Linux (L4T)R35.3.1R35.4.1
CUDA11.411.4
TensorRT8.5.28.5.2
cuDNN8.68.6
VPI2.32.3

8.2 JetPack 6.x Series (Current)

ComponentJetPack 6.0 (GA)JetPack 6.1JetPack 6.2
Jetson Linux (L4T)R36.3R36.4R36.4.3
CUDA12.212.512.6
TensorRT8.610.110.3
cuDNN8.99.19.3
VPI3.13.23.2
DLA Compiler3.143.173.1
DLFW24.0

8.3 Key JetPack 6 Features for AV

  • Upgradable compute stack: CUDA, TensorRT, cuDNN, DLA, and VPI can be upgraded independently without reflashing the entire Jetson Linux image.
  • Over-The-Air (OTA) updates: Supports field updates from JetPack 5 to JetPack 6 and incremental JetPack 6 updates.
  • Jetson Platform Services: Pre-built containerized services for AI application deployment, including video analytics, API gateway, and fleet management integration.
  • PREEMPT_RT kernel support: Real-time Linux kernel for deterministic latency, critical for AV control loops.
  • Container support: Native Docker and Kubernetes support for modular AV software deployment.
  • MPS support (JetPack 6.1+): Multi-Process Service for GPU sharing across concurrent inference processes.

9. TensorRT and DLA-Compatible Layers

9.1 TensorRT Capabilities

TensorRT is NVIDIA's inference optimization engine that performs:

  • Layer fusion: Combines sequential operations (Conv + BN + ReLU) into single kernels
  • Precision calibration: FP32 to FP16/INT8 quantization with calibration datasets
  • Kernel auto-tuning: Selects optimal CUDA kernels for each layer based on input dimensions
  • Dynamic tensor memory: Minimizes memory footprint through buffer reuse
  • DLA offloading: Automatically partitions networks between GPU and DLA

Current version on JetPack 6.2: TensorRT 10.3

9.2 DLA-Compatible Layers (TensorRT)

The following layers can execute on DLA. Unsupported layers automatically fall back to GPU.

Fully Supported on DLA

Layer TypeSupported OperationsConstraints
ConvolutionStandard, depthwise, groupedKernel [1,32], stride [1,8], padding [0,31], channels [1,8192]
DeconvolutionTransposed convolutionKernel [1,32], padding must be 0, no grouped deconv
Fully ConnectedDense layersSame constraints as convolution
ActivationReLU, Sigmoid, TanH, Clipped ReLU, Leaky ReLUTanH/Sigmoid auto-upgrade to FP16 in INT8 mode
PoolingMax, AverageWindow [1,8], stride [1,16], padding [0,7]
ElementWiseSum, Sub, Product, Max, Min, Div, PowBroadcasting supported (NCHW, NC11, N111)
ScaleUniform, Per-Channel, ElementWiseScale and shift only
ConcatenationAlong channel axis onlyMin 2 inputs, same spatial dims
LRNAcross-channelsWindow sizes: 3, 5, 7, 9 only
Parametric ReLUPReLUSlope must be build-time constant
SoftmaxOrin only (not Xavier)Axis dimension limit: 1024
ResizeNearest-neighbor, bilinearScale [1,32] nearest, [1,4] bilinear
SliceStatic slicing on CHW4D inputs only
ShuffleReshape/transpose4D tensors, batch dim excluded
ReduceMAX only4D tensors, dims [1,8192]
ComparisonEqual, Greater, LessINT8 precision only, requires Cast
UnaryABS, SIN, COS, ATANSIN/COS/ATAN require INT8 input

Notable DLA Limitations

  • No support for: GroupNorm, LayerNorm, InstanceNorm, attention mechanisms, dynamic shapes, batch sizes > 4096
  • No separate accelerator assignment for activation vs. parent layer
  • Formatting overhead: Transitions between GPU and DLA incur data reformatting cost. Minimizing GPU-DLA transitions is critical for performance
  • 1 MiB SRAM per DLA core (vs. 4 MiB shared on Xavier); larger activation maps spill to DRAM
  • No dynamic dimensions: All tensor shapes must be known at engine build time

9.3 DLA Deployment Best Practices for AV Models

  1. Profile before committing: Use trtexec --useDLACore=0 --allowGPUFallback to identify which layers run on DLA vs. GPU
  2. Minimize DLA-GPU transitions: Consecutive DLA-compatible layers execute efficiently; interleaved unsupported layers cause costly reformatting
  3. Use INT8 on DLA: DLA's INT8 throughput is significantly higher than FP16, and quantization-aware training (QAT) preserves accuracy
  4. Run both DLA cores: Schedule independent models or pipeline stages on DLA0 and DLA1 concurrently
  5. Combine GPU + DLA: Run the camera backbone on GPU while running the LiDAR detection head on DLA

10. Multi-Process GPU Sharing (MPS)

10.1 MPS Availability

JetPack VersionCUDA VersionMPS Support
JetPack 5.xCUDA 11.4Not supported
JetPack 6.0CUDA 12.2Not supported
JetPack 6.1CUDA 12.5Supported
JetPack 6.2CUDA 12.6Supported

MPS was historically unavailable on Tegra (Jetson) devices. Support was introduced with JetPack 6.1 / CUDA 12.5.

10.2 How MPS Works on Jetson

MPS shifts GPU sharing from temporal multiplexing (time-slicing, where processes take turns) to spatial sharing (concurrent execution on different SMs). This provides:

  • Reduced context-switch overhead
  • Better GPU utilization for bursty inference workloads
  • Lower tail latency for multi-model AV stacks

10.3 Enabling MPS

bash
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps-pipe
export CUDA_MPS_LOG_DIRECTORY=/tmp/mps-log
mkdir -p /tmp/mps-pipe /tmp/mps-log
sudo -E nvidia-cuda-mps-control -d

Thread percentage allocation per process:

bash
export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50  # Allocate 50% of GPU SMs

10.4 Known Limitations

  • Single-user constraint: Once an MPS server starts under one user, only that user's processes can access the GPU
  • Kubernetes issues: MPS within containerized K8s deployments has reported compatibility issues with the NVIDIA k8s-device-plugin
  • Thread percentage enforcement: At lower allocation thresholds (< 30%), priority enforcement is inconsistent on the integrated GPU
  • No hardware partitioning: Unlike NVIDIA MIG on data-center GPUs, MPS on Orin provides software-level isolation only

10.5 AV Multi-Process Architecture Implications

For AV stacks running concurrent perception models (e.g., camera detection + LiDAR detection + tracking + freespace), consider:

  • MPS for lightweight concurrent models: Good for running 2-3 small models simultaneously
  • DLA offloading preferred: Offload one pipeline to DLA and run another on GPU, achieving true hardware-level parallelism without MPS overhead
  • Batch consolidation: Where possible, batch multiple inference requests through a single process to avoid MPS overhead entirely

11. Isaac ROS Packages

NVIDIA Isaac ROS provides GPU-accelerated ROS 2 packages optimized for Jetson platforms. These maintain standard ROS 2 APIs while leveraging NVIDIA hardware acceleration.

11.1 Perception Packages Relevant to AV

PackageDescriptionAV Relevance
isaac_ros_visual_slam (cuVSLAM)GPU-accelerated Visual SLAMLocalization without GNSS
isaac_ros_nvblox3D scene reconstructionReal-time occupancy mapping
isaac_ros_detectnet2D object detection (DetectNet)Vehicle/pedestrian detection
isaac_ros_yolov8YOLOv8 inference pipelineReal-time 2D detection
isaac_ros_rt_detrRT-DETR detectionTransformer-based detection
isaac_ros_foundationpose6-DOF pose estimationObject pose tracking
isaac_ros_centerposeCenterPose 6D estimationObject pose from single image
isaac_ros_depth_image_procDepth image processingStereo/depth camera pipeline
isaac_ros_essESS stereo depthDense depth estimation
isaac_ros_foundationstereoFoundation stereo depthLearned stereo matching
isaac_ros_occupancy_grid_localizerLiDAR-based localizationMap-relative positioning
isaac_ros_image_pipelineCamera calibration/rectificationMulti-camera preprocessing
isaac_ros_h264_encoderH.264 video compressionBandwidth-efficient recording
isaac_ros_cumotionGPU-accelerated motion planningTrajectory generation

11.2 NITROS (NVIDIA Isaac Transport for ROS)

NITROS is a zero-copy GPU-accelerated transport layer for ROS 2 that:

  • Eliminates CPU-GPU memory copies between consecutive Isaac ROS nodes
  • Uses CUDA shared memory for inter-node data transfer
  • Provides 2-5x throughput improvement over standard ROS 2 message passing

11.3 Isaac Perceptor

A reference workflow combining multiple Isaac ROS packages for autonomous navigation:

  • Multi-camera 3D detection
  • Visual SLAM with loop closure
  • Real-time 3D reconstruction (nvblox)
  • Obstacle avoidance and path planning

Current release: Isaac ROS 4.2 (as of early 2026), supporting ROS 2 Humble and Jazzy.


12. Thermal Management

12.1 Thermal Specifications

ParameterValue
SoC Junction Temperature (Tj max)105 C
Thermal Trip (hardware reset)105 C
Recommended Operating Tj< 95 C for sustained operation
Module TDP (MAXN, 64GB)~60W
Module TDP (MAXN, 32GB)~40W
Industrial Variant TDPUp to 75W

12.2 Cooling Solution Design

Active fanned heatsinks are recommended for AV deployment due to:

  • Sustained high-power operation under MAXN or 50W modes
  • Enclosed vehicle cabin thermal environments
  • Ambient temperature variability in outdoor AV operations

NVIDIA's reference design uses a top-mounted heatsink with embedded fan, connected to the module via thermal interface material (TIM). Dynamic fan control adjusts speed based on SoC temperature thresholds configured in the thermal management configuration file.

Passive Cooling

Passive heatsinks (aluminum fin arrays, natural convection) are viable only for:

  • Lower power modes (15W-30W)
  • Open-air or well-ventilated enclosures
  • Low ambient temperature environments

Design Guidelines

  • Mounting: Heatsink should mount directly to carrier board for structural support and load relief on the SoM connector
  • Thermal gap pads: Embedded between module components and heatsink for maximum heat transmission
  • Contact pressure: Uniform contact to the Orin SoC die is critical; uneven pressure causes hotspots
  • Airflow: For active solutions, ensure intake/exhaust paths are not obstructed by enclosure geometry

12.3 Thermal Throttling Behavior

The Orin implements progressive thermal management:

  1. Software throttling: As Tj approaches warning threshold, nvpmodel dynamically reduces GPU/CPU clocks
  2. Hardware thermal trip: At 105 C, hardware forces system reset to prevent damage
  3. Fan ramp-up: Dynamic fan control increases speed proportionally with temperature rise

For AV systems, maintaining stable thermal performance is critical. Design the cooling solution to keep Tj below 90 C under sustained maximum load to avoid clock throttling that impacts inference latency consistency.


13. I/O and Peripheral Interfaces

13.1 High-Speed Interfaces

InterfaceSpecification
PCIe7 controllers, 22 lanes total, Gen 4 (16 Gbps/lane)
Ethernet1x GbE + 4x XFI (10GbE)
USBUSB 3.2 Gen 2 (10 Gbps)
NVMe StorageVia PCIe Gen 4 x4
DisplayDisplayPort 1.4a

13.2 Camera and Sensor Interfaces

InterfaceSpecification
MIPI CSI-216 lanes total
CSI ConfigurationUp to 6x 2-lane or 4x 4-lane
D-PHYv2.1 (up to 4.5 Gbps/lane, 40 Gbps aggregate)
C-PHYv2.0 (up to 164 Gbps aggregate)
Camera SupportUp to 6 cameras simultaneously

13.3 Video Encode/Decode (64GB Variant)

CapabilitySpecification
Encode2x 4K60, 4x 4K30, 8x 1080p60, 16x 1080p30 (H.265/H.264/AV1)
Decode1x 8K30, 3x 4K60, 7x 4K30, 11x 1080p60, 22x 1080p30 (H.265/H.264/VP9/AV1)

13.4 Low-Speed Interfaces

InterfaceCount
UART4
SPI3
I2C8
CAN (FD-capable)2
GPIOMultiple
I2S/TDM Audio4x DAP ports
DMIC2x PDM

13.5 AV-Relevant I/O Notes

  • CAN FD: Two CAN interfaces (LS and FD) for vehicle bus communication; sufficient for basic CAN integration but may need external CAN controllers for multi-bus AV architectures
  • PCIe Gen 4: 22 lanes enable connection of NVMe storage, additional Ethernet NICs, FPGA co-processors, or LiDAR interface cards
  • 10GbE via XFI: Four 10GbE interfaces support high-bandwidth sensor data ingestion (e.g., multiple LiDAR units, high-res cameras over Ethernet)
  • CSI cameras: Direct MIPI CSI connection for up to 6 cameras eliminates USB/Ethernet camera latency

14. Real-World Inference Benchmarks

14.1 MLPerf v3.1 Official Benchmarks (Jetson AGX Orin 64GB)

Tested with JetPack 5.1.1, TensorRT 8.5.2, CUDA 11.4:

ModelTaskSingle-Stream LatencyOffline ThroughputPower
ResNet-50Image Classification0.64 ms6,424 samples/s23.6W
RetinaNetObject Detection11.67 ms149 samples/s22.3W
BERT-LargeNLP (SQuAD)5.71 ms554 samples/s
RNN-TSpeech-to-Text94.01 ms1,170 samples/s
3D-UNetMedical Imaging4,371 ms0.51 samples/s

14.2 3D LiDAR Object Detection Benchmarks

PointPillars (TensorRT, Jetson AGX Orin)

PrecisionLatency (ms)FPSmAP (KITTI)
FP3232.91~3064.64
FP1618.27~55~64.5
INT814.77~68~63.8
Mixed (FP16:1)14.29~7064.47 (QAT)

Source: Mixed Precision PointPillars (arXiv:2601.12638), TensorRT 10.3

From the MDPI Sensors benchmark study (full pipeline, not TensorRT-optimized for all models):

DetectorFPS (AGX Orin)GPU UtilCPU UtilPower
PointPillar9.7~80%>60%~29W
SECOND5.21
CIA-SSD5.79
SE-SSD5.82
PointRCNN1.98
Part-A22.54
PV-RCNN2.27
FastPillars18

CenterPoint (NVIDIA CUDA-CenterPoint, TensorRT, AGX Orin)

Pipeline StageLatency (ms) FP16Latency (ms) INT8
Voxelization1.361.36
3D Backbone (Sparse Conv)22.322.3
RPN + Detection Head11.37.0
Decode + NMS4.44.4
Total~40.0~35.7

Effective throughput: ~23-28 FPS. The 3D sparse convolution backbone is the primary bottleneck, consuming ~56% of total latency.

Cross-platform comparison (CenterPoint total latency):

PlatformFP16 TotalMixed Total
Tesla A30 (data center)21.3 ms20.0 ms
Jetson AGX Orin40.0 ms35.7 ms

BEVFusion (NVIDIA CUDA-BEVFusion, TensorRT, AGX Orin)

Tested on nuScenes validation set (6019 samples):

ConfigurationFPS (AGX Orin)mAPNDS
ResNet50, FP161867.8970.98
ResNet50, FP16+INT8 (PTQ)2567.6670.81

For reference, Swin-Tiny BEVFusion on RTX 3090 (PyTorch FP32+FP16): 8.4 FPS, 68.52 mAP.

The number of LiDAR points per frame is the dominant factor affecting BEVFusion FPS. A lighter camera backbone (e.g., ResNet34 instead of ResNet50) reduces latency with minimal accuracy impact.

14.3 YOLOv8 2D Detection Benchmarks (TensorRT, AGX Orin)

ModelFP32 LatencyFP32 FPSINT8 LatencyINT8 FPS
YOLOv8n~2.0 ms~500~1.2 ms~830
YOLOv8s7.2 ms1393.2 ms313
YOLOv8m~12 ms~83~6 ms~167
YOLOv8l~18 ms~56~10 ms~100
YOLOv8x~25 ms~40~13 ms~75

Note: YOLOv8 shows +4 to +9 mAP improvement over YOLOv5 at similar runtime on AGX Orin with TensorRT FP16.

14.4 NVIDIA DRIVE AV Reference

The NVIDIA DRIVE AV perception pipeline achieved a 2.5x latency reduction by leveraging DLA for suitable network components, demonstrating the importance of GPU+DLA co-scheduling in production AV stacks.


15. Industrial Variant Specifications

The Jetson AGX Orin Industrial is designed for harsh-environment deployment including outdoor autonomous vehicles, agriculture, construction, aerospace, and energy applications.

15.1 Key Differences from Standard Module

ParameterAGX Orin 64GB (Standard)AGX Orin Industrial
AI Performance275 TOPS248 TOPS
Power Range15-60W15-75W
Memory64 GB LPDDR564 GB LPDDR5 + Inline ECC
Operating Temp (TTP)-25 C to +80 C-40 C to +85 C
Operational Shock50G, 11 ms
Non-Operational Shock140G, 2 ms140G, 2 ms
Operational Vibration5G
Non-Operational Vibration3G3G
Humidity Tolerance85 C / 85% RH, 1000 hrs powered
Operating Lifetime10 years; 87,000 hrs @ 85 C
Production Lifecycle10 years (through 2033)
Mechanical ProtectionStandardSoC corner bonding + component underfill

15.2 ECC Memory

The Industrial variant includes inline ECC on its LPDDR5 memory, which:

  • Detects and corrects single-bit errors
  • Detects double-bit errors
  • Reduces effective memory capacity by ~12.5% (ECC overhead)
  • Critical for safety-relevant AV compute where memory bit-flips could cause perception errors

15.3 TOPS Reduction Explanation

The Industrial variant's 248 TOPS (vs. 275 TOPS standard) results from slightly reduced clock frequencies to maintain reliability across the extended -40 C to +85 C temperature range. The 75W maximum power budget compensates with additional thermal headroom.

15.4 AV Deployment Recommendation

For production AV deployment, the Industrial variant is strongly recommended due to:

  • Extended temperature range covering outdoor operational extremes
  • ECC memory protecting against radiation-induced bit-flips
  • Vibration and shock ratings appropriate for vehicle-mounted compute
  • 10-year production lifecycle providing supply chain stability
  • Component underfill preventing solder joint failures under vibration

16. Competitive Landscape

16.1 Edge AI Compute Comparison

PlatformAI PerformancePowerArchitectureTarget Application
NVIDIA Jetson AGX Orin 64GB275 TOPS (INT8 Sparse)15-60WAmpere GPU + DLARobotics, AV (non-automotive)
NVIDIA DRIVE Orin254 TOPSSame SoC, automotive gradeL2-L4 automotive (ASIL-D)
Qualcomm Snapdragon Ride Elite300 TOPSKryo CPU + Adreno GPUL3 automotive ADAS
Qualcomm Snapdragon Ride Flex10-1000 TOPS (scalable)Kryo + AdrenoL1-L4 scalable
TI TDA4VM8 TOPS5-20WC7x DSP + MMAL2-L3 ADAS (camera+radar)
Hailo-826 TOPS2.5WDataflow architectureInference-only accelerator
Hailo-8L13 TOPS~1.5WDataflow architectureLow-power inference
Intel Movidius Myriad X4 TOPS~1.5WVPULow-power vision
Qualcomm RB515 TOPS~15WAI Engine + Hexagon DSPRobotics

16.2 Comparative Analysis for AV

NVIDIA Jetson AGX Orin Strengths:

  • Highest single-module TOPS in the Jetson class
  • Full CUDA ecosystem (TensorRT, cuDNN, CUDA kernels) for custom model development
  • Unified memory architecture eliminates host-device transfers
  • DLA + GPU heterogeneous compute for power-efficient inference
  • Comprehensive software stack (JetPack, Isaac ROS, DeepStream)
  • Active developer community and extensive documentation

NVIDIA Jetson AGX Orin Weaknesses:

  • Not automotive-safety certified (no ASIL-D; use DRIVE Orin for that)
  • Higher power consumption than specialized accelerators (Hailo, TDA4)
  • 8nm process vs. competitors on 4-5nm nodes
  • TOPS/watt lower than dedicated inference accelerators

Hailo-8 as Complementary Accelerator:

Hailo-8 modules (M.2 form factor) can be paired with Jetson AGX Orin via PCIe to add dedicated inference capacity for specific perception models, freeing the GPU for complex models like BEVFusion while Hailo handles simpler 2D detection tasks.

16.3 Market Position

NVIDIA holds an estimated 25-35% global share in autonomous driving compute (H1 2025). Qualcomm Snapdragon Ride is emerging in L2+ ADAS with ~5% share. TI TDA4 dominates the lower-TOPS L2 camera/radar ADAS segment.


17. Jetson AGX Orin vs DRIVE Orin

Both use the same Orin SoC, but target different markets:

AspectJetson AGX OrinDRIVE Orin
Target MarketRobotics, industrial, non-automotive AVAutomotive L2-L5
Safety CertificationNone (no ASIL)ISO 26262 ASIL-D systematic, ASIL-B random
Operating SystemJetson Linux (Ubuntu-based)DriveOS (safety-certified RTOS)
Software StackJetPack, Isaac ROSDRIVE SDK, DriveWorks
Functional SafetyNot designed for safety-criticalHardware lockstep, safety island, ECC
AvailabilityModule + dev kit, broad distributionAutomotive Tier-1 channel
Pricing$999 (32GB module)OEM/Tier-1 pricing

For airport airside AV operations: Jetson AGX Orin (Industrial) is appropriate because:

  • Airport airside is a controlled, geo-fenced environment with lower speed requirements
  • ASIL-D certification is typically not required for non-road autonomous vehicles
  • The JetPack/Isaac ROS software ecosystem provides faster development iteration
  • The Industrial variant's temperature/vibration ratings meet outdoor vehicle requirements

18. Successor: Jetson Thor

Jetson AGX Thor, powered by NVIDIA Blackwell architecture, became generally available in August 2025.

SpecificationAGX Orin 64GBAGX Thor
GPU ArchitectureAmpereBlackwell
AI Performance275 TOPS2,070 FP4 TFLOPS
AI ImprovementBaseline7.5x over Orin
Energy EfficiencyBaseline3.5x better than Orin
Memory64 GB128 GB
Power Range15-60W40-130W
Dev Kit Price$1,999$3,499

Thor is positioned for next-generation physical AI and foundation model inference at the edge. For AV stacks requiring transformer-based models (BEVFormer, UniAD, end-to-end driving models), Thor's increased memory and compute may be necessary.

However, AGX Orin remains the production-proven platform with broader ecosystem maturity and lower power requirements, making it suitable for AV deployments where the perception stack fits within 275 TOPS.


19. Pricing and Availability

ProductPrice (USD)Status
Jetson AGX Orin 32GB Module$999Production
Jetson AGX Orin 64GB Module~$1,599Production
Jetson AGX Orin 64GB Dev Kit$1,999Production
Jetson AGX Orin Industrial ModuleContact NVIDIAProduction (through 2033)

Modules are available through NVIDIA's authorized distribution network (Arrow, Mouser, DigiKey) and ecosystem partners (Connect Tech, Seeed Studio, etc.). Carrier boards from third-party vendors (Connect Tech, Stereolabs, etc.) provide ruggedized and application-specific form factors.


20. AV Deployment Considerations

20.1 Perception Stack Sizing

For a typical airport airside AV with 6 cameras + 1 LiDAR + 3 radars:

Pipeline StageModelEstimated Latency (AGX Orin)Accelerator
Camera 2D DetectionYOLOv8s (INT8)~3.2 msGPU
LiDAR 3D DetectionCenterPoint (INT8)~35 msGPU
Camera-LiDAR FusionBEVFusion (FP16+INT8)~40 msGPU
Freespace SegmentationCustom UNet (INT8)~5 msDLA
TrackingKalman/ByteTrack~1 msCPU
PlanningCustom planner~10 msCPU
Total Pipeline~50-80 ms (target)

The 64GB variant is recommended to accommodate BEVFusion's memory requirements and allow headroom for mapping/localization workloads.

  • Module: Jetson AGX Orin Industrial (64GB, ECC)
  • Power Mode: 50W for production, MAXN for development
  • Cooling: Active heatsink with ducted airflow, rated for +50 C ambient
  • Storage: NVMe SSD via PCIe Gen 4 x4 for data logging
  • Networking: 10GbE for LiDAR data ingestion; GbE for vehicle CAN gateway
  • Software: JetPack 6.2+, TensorRT 10.3, Isaac ROS for perception, custom ROS 2 nodes for planning/control

20.3 Key Risk Factors

  1. Memory bandwidth saturation: Multi-camera BEVFusion can saturate the 204.8 GB/s bandwidth; profile with nsys and tegrastats
  2. Thermal derating: In enclosed vehicle compute boxes, sustained thermal loads may trigger clock throttling; validate cooling under worst-case ambient
  3. DLA coverage gaps: Transformer-based models (attention layers) cannot run on DLA; future perception architectures trending toward transformers may underutilize DLA
  4. Supply chain: Plan for the Industrial variant's 10-year lifecycle; coordinate with NVIDIA for volume commitments
  5. No functional safety certification: For operations requiring safety certification, a separate safety controller (e.g., TI TDA4 or dedicated safety MCU) may be needed alongside the Orin for monitoring and fallback

Sources

Public research notes collected from public sources.