Skip to content

Cooperative V2X End-to-End Driving

Key Takeaway: Cooperative V2X should not stop at perception. The useful E2E target is a jointly optimized perception, mapping, occupancy, prediction, planning, and communication policy that decides what to share, when to trust it, and how it changes the ego trajectory. Current research is moving from cooperative perception benchmarks toward full cooperative planning with UniV2X, V2X-VLM, Coopernaut, V2X-Real, V2X-ReaLO, and M3CAD. Airside autonomy is a natural deployment domain because airports are bounded, low-speed, infrastructure-rich, and centrally governed.


Problem Definition

Classical cooperative perception asks: "Can infrastructure or another vehicle improve my detections?" Cooperative end-to-end driving asks a harder question:

multi-agent sensors + maps + messages + ego goal
  -> shared representation
  -> ego trajectory/action

The learned system must optimize planning performance under realistic communication limits, not only maximize mAP. A feature that improves detection but arrives too late, is untrusted, or is irrelevant to the route should not dominate the driving decision.


Method Taxonomy

Cooperation Level

LevelShared contentPlanning couplingExample methods/data
Late fusionBoxes, tracks, hazards, intent messagesPlanner consumes fused outputsDAIR-V2X TCLF, V2X-ReaLO late-fusion tracks
Intermediate fusionBEV features, sparse queries, compressed latent tokensPlanner consumes fused features or downstream perceptionV2X-ViT, Where2comm, FFNet, UniV2X
Early fusionRaw point clouds/imagesPlanner sees fused perception output after centralized processingResearch upper bound, high bandwidth
VLM/VLA fusionImages/features plus text scene descriptionsPlanner/VLA reasons over cooperative contextV2X-VLM
End-to-end cooperative policyCommunication and driving jointly optimizedLoss includes perception, mapping, occupancy, and planningUniV2X, M3CAD baselines

Agent Topology

TopologyDescriptionAirside analogy
V2VVehicles share perception or intentTug-to-tug, baggage train coordination
V2IVehicle fuses infrastructure sensorsStand pole sensors, terminal cameras, road-side LiDAR
I2IInfrastructure nodes fuse among themselvesMulti-stand ramp monitoring
V2N/MECVehicle and infrastructure send to edge serverAirport 5G/MEC cooperative perception
V2PPersonnel devices or wearables broadcast position/statusGround crew beacons, marshaller safety wearables

Planning Integration

Integration patternDescriptionStrengthRisk
Advisory inputCooperative output appears as extra obstacle/contextEasy retrofit into modular stacksPlanner may ignore or over-trust data
Fused BEV plannerE2E planner consumes fused BEV or occupancyBetter occlusion handlingSensitive to pose/time errors
Communication-aware plannerPlanner reasons about latency, confidence, and bandwidthMore deployableHarder training/evaluation
VLA cooperative reasonerVLM/VLA reasons over multi-view scenes and textUseful for semantic eventsHallucination and latency risks
Simplex-gated cooperative stackHigh-performance cooperative stack with onboard-only fallbackSafety case friendlyMore integration and evidence work

Key Research Threads

Coopernaut

Coopernaut demonstrated end-to-end driving with cooperative perception in networked vehicles. It showed that cross-vehicle perception can improve success rate in challenging scenarios and reduce bandwidth versus earlier V2VNet-style approaches. Its main value today is the framing: cooperation must be evaluated by driving outcomes, not only perception AP.

UniV2X

UniV2X is a unified end-to-end V2X cooperative driving framework. It integrates perception, online mapping, occupancy prediction, and planning across ego and infrastructure views. Its sparse-dense hybrid transmission design is important because airside deployments cannot assume unlimited bandwidth from every stand sensor to every vehicle.

V2X-VLM

V2X-VLM adds large vision-language models to vehicle-infrastructure cooperative autonomous driving. It combines vehicle and infrastructure camera views with text scene descriptions and uses contrastive alignment and distillation to improve trajectory planning. This is relevant for airside because cooperative text can encode operational context: pushback clearance, stand phase, jet blast warning, and ramp-control instructions.

V2X-Real and V2X-ReaLO

V2X-Real provides real-world multi-agent, multi-modal cooperative perception data with two vehicles and two infrastructure units. V2X-ReaLO shifts evaluation from offline cooperative AP to online replay with latency, synchronization, communication, and real-time fusion as first-class metrics. Airside V2X should adopt this online framing early.

M3CAD

M3CAD is a generic cooperative autonomous driving benchmark with 204 sequences and 30,000 frames, supporting object detection/tracking, mapping, motion forecasting, occupancy prediction, and path planning. Its importance is breadth: it moves cooperative autonomy toward multi-task planning benchmarks rather than isolated perception.


Relevance by Domain

Generic Road AV

V2X E2E driving is most useful for occluded intersections, emergency vehicles, blind merges, and infrastructure-assisted work zones. Public-road deployment faces fragmented infrastructure ownership and heterogeneous trust, so safety cases must assume unreliable participation.

Indoor Autonomy

Indoor fleets already rely on infrastructure and dispatch systems. Cooperative E2E concepts transfer to AMR/forklift interactions through shared maps, aisle occupancy, dock-door status, and WMS task messages. Wireless latency and localization drift still need explicit evaluation.

Outdoor Industrial Autonomy

Yards, ports, mining, and campuses are strong fits. They have bounded sites, private networks, central task systems, and recurring occlusions from trailers, containers, equipment, buildings, and stockpiles. Cooperative autonomy can improve both throughput and safety.

Airside Autonomy

Airports are one of the strongest fits for cooperative E2E autonomy:

  • The airport authority can govern infrastructure, maps, PKI, and network access.
  • Stand-level sensors can see around aircraft and GSE.
  • Operational messages carry safety-critical context that onboard sensors cannot infer.
  • Low speed makes 100 to 200 ms cooperative latency more tolerable than highway driving.
  • V2X can encode default-deny clearances for hold-short and movement-area boundaries.

Airside Architecture Pattern

Infrastructure sensors:
  stand LiDAR/cameras/radar, A-SMGCS, ADS-B, CCTV, FOD sensors

V2X / airport messages:
  cooperative features, object tracks, stand status, task assignment,
  pushback intent, jet blast warning, runway/taxiway clearance

Vehicle stack:
  onboard perception -> ego BEV / occupancy
  V2X receiver -> time/pose compensation -> trust scoring
  cooperative fusion -> future occupancy / intent
  E2E planner or VLA reasoner -> candidate trajectory
  Simplex safety gate -> control or fallback

Required Runtime Metadata

Every cooperative message used by the planner should carry:

  • Sensor capture timestamp.
  • Publish, receive, and planner-consume timestamps.
  • Source identity and certificate/trust state.
  • Coordinate frame and pose covariance.
  • Staleness deadline.
  • Compression/fusion mode.
  • Confidence and uncertainty.
  • Degradation/fallback policy if the message disappears.

Implementation and Evaluation Notes

Training

  • Train ego-only, late-fusion, intermediate-fusion, and cooperative-E2E baselines on the same scenario split.
  • Include communication dropout, latency jitter, pose noise, and malicious/stale messages during training and validation.
  • Use planning loss in addition to perception, mapping, and occupancy losses.
  • For VLM/VLA variants, include text-only and wrong-text controls to detect language leakage.
  • Keep a vehicle-only fallback policy trained and evaluated separately.

Evaluation

MetricWhy it matters
Planning score delta over ego-onlyShows whether cooperation changes actual driving behavior
Deadline-aware AP / occupancy IoUMeasures perception improvement before data becomes stale
Latency-to-impact curveShows how performance degrades at 50, 100, 200, 500 ms
Bandwidth per vehicle and per standDetermines network feasibility
Stale-message rejectionPrevents replayed clearances or old tracks from driving actions
Pose-error robustnessAirside infrastructure and vehicle frames drift
Fallback correctnessVehicle must remain safe without V2X
Safety-gate failuresAircraft/personnel/GSE/hazard-zone violations must be visible

Airside Test Scenarios

ScenarioCooperative value
Aircraft fuselage occludes ground crewInfrastructure view resolves occlusion
Pushback starts near ego routeV2X intent and swept-volume prediction alter ego plan
Service-road crossing behind terminal cornerStand or pole sensor gives early detection
FOD detected by another vehicleFleet-level report changes route and dispatches cleanup
Jet blast warning broadcastVehicle avoids invisible hazard
Ramp-control clearance revokedPlanner defaults to stop/hold
V2X link lost during stand approachSimplex fallback reduces speed and uses onboard-only policy

Failure Modes

Failure modeDescriptionMitigation
Stale cooperationPlanner uses features after the source scene has changedDeadline checks and feature-flow/time compensation
Pose misalignmentInfrastructure features are fused in the wrong map locationOnline calibration, pose covariance, map anchoring
Bandwidth collapseToo many sensors publish dense features at peak operationsWhere2comm-style selective sharing and DCC policies
False trustSpoofed or faulty source injects wrong object/clearancePKI, trust scoring, sensor cross-checks, misbehavior detection
Overfitting to cooperationPolicy cannot drive safely when V2X is absentMandatory ego-only fallback and V2X-loss scenarios
Planning-irrelevant perception gainmAP improves but route safety/progress does notPlanning-coupled metrics
Language hallucinationV2X-VLM invents context or over-trusts textText-only controls, structured messages, Simplex safety gate
Deadlock from polite policiesMultiple vehicles yield foreverManeuver coordination and task-level priority rules

DocumentRelevance
V2X Protocols AirsideAirside message standards and fallback behavior
Infrastructure-Cooperative PerceptionCooperative perception methods and airport deployment
Collaborative Fleet PerceptionFleet-level perception sharing
V2X-ReaLOOnline cooperative perception benchmark pattern
Fleet CoordinationMulti-agent and fleet dispatch context
Airside Multi-AgentAirside coordination scenarios
Evaluation Benchmarks: NAVSIM and Bench2DriveE2E evaluation methodology
Airside Autonomy Benchmark SpecDomain-specific cooperative benchmark track
VLM/VLA Reliability BenchmarksV2X-VLM and VLA reliability controls
Simplex Safety ArchitectureSafety-gated cooperative stack pattern

Sources

Public research notes collected from public sources.