Skip to content

Testing and Validation Methodology for Airside Autonomous Vehicles

A comprehensive guide to systematically testing and validating autonomous ground vehicles for airport airside operations. Covers the V-model testing framework, scenario-based testing with ASAM OpenSCENARIO 2.0, coverage metrics, statistical safety arguments, simulation-based V&V (SIL/HIL/VIL), shadow mode validation, regression testing, digital twin construction, airside-specific test protocols, and test infrastructure requirements. Designed to produce the certification evidence required by ISO 3691-4, EU Machinery Regulation 2023/1230, UL 4600, and anticipated FAA Advisory Circulars.


Table of Contents

  1. Testing Framework Overview
  2. Scenario-Based Testing
  3. Coverage Metrics
  4. Corner Case and Adversarial Testing
  5. Simulation-Based Verification and Validation
  6. Statistical Safety Arguments
  7. Shadow Mode Validation
  8. Regression Testing
  9. Digital Twin Validation for Airside
  10. Airside-Specific Test Protocols
  11. Test Infrastructure Requirements
  12. Key Findings Summary
  13. References

1. Testing Framework Overview

1.1 The V-Model for Autonomous Vehicles

The V-model is the industry-standard systems engineering process for safety-critical development. For autonomous vehicles, it maps the progression from requirements through design and implementation on the left side, with corresponding verification and validation activities mirrored on the right side. Each level on the left produces artifacts that are verified by the matching level on the right.

Requirements Analysis  ─────────────────────────────  Acceptance Testing
  (ODD, safety goals,                                   (Airport field trials,
   performance targets)                                   regulatory demonstration)
        │                                                       ▲
        ▼                                                       │
  System Design  ──────────────────────────────────  System Testing
  (Architecture, interfaces,                           (Full vehicle on test track,
   Simplex dual-stack)                                   SIL full-stack sim)
        │                                                       ▲
        ▼                                                       │
  Subsystem Design  ───────────────────────────────  Integration Testing
  (Perception, planning,                               (Subsystem interfaces,
   control, safety monitor)                              sensor fusion, HIL)
        │                                                       ▲
        ▼                                                       │
  Module Design  ──────────────────────────────────  Unit Testing
  (LiDAR segmentation,                                (Module-level tests,
   Frenet planner, GTSAM)                               code coverage, MC/DC)
        │                                                       ▲
        ▼                                                       │
  Implementation  ─────────────────────────────────  Code Review + Static Analysis
  (C++ nodelets, Python                                (MISRA C, cppcheck,
   scripts, model weights)                               clang-tidy, Polyspace)

1.2 Mapping to ISO 3691-4

ISO 3691-4:2023 Section 5 (Testing and Verification) requires specific test procedures that map to V-model levels as follows:

ISO 3691-4 RequirementV-Model LevelTest MethodEvidence Artifact
Hazard identification (Clause 4.1)RequirementsRisk assessment workshopHazard log, STPA analysis
Safety function specification (Clause 4.2)System designRequirements traceabilitySafety requirements spec
Personnel detection (Clause 4.3)Integration testingPhysical test with dummies, SiLDetection rate report
Emergency stop (Clause 4.4)System testingPhysical braking testsStopping distance data
Speed limiting (Clause 4.5)Unit testingSpeedometer calibrationSpeed verification log
Warning devices (Clause 4.6)Integration testingAudible/visual warning testsWarning test report
Environmental testing (Clause 4.7)System testingWeather matrix testsEnvironmental test report
Documentation (Clause 6)All levelsDocument reviewTechnical Construction File

1.3 Mapping to EU Machinery Regulation 2023/1230

The EU Machinery Regulation 2023/1230 (replacing Directive 2006/42/EC, effective 20 January 2027) introduces new requirements specifically for autonomous mobile machinery with AI components:

Regulation RequirementTesting Implication
Article 5: Conformity assessment for high-risk AIThird-party assessment required for autonomous vehicles (not self-certification)
Annex III, Section 1.1.2: Principles of safety integrationV-model evidence across all levels
Annex III, Section 1.2.1: Safety and reliability of control systemsPLd (ISO 13849-1) for safety-critical functions; systematic failure testing
Annex III, Section 1.3.7: Risks related to moving partsCollision avoidance testing at all operating speeds
Annex III, Section 1.6.1: MaintenanceTesting that maintenance procedures do not introduce unsafe states
AI-specific: Continuous learning systemsEvidence that model updates do not degrade safety (regression testing)
AI-specific: CybersecurityPenetration testing, adversarial input testing

1.4 Mapping to UL 4600 and ANSI/UL 3100

UL 4600 (Standard for Safety for the Evaluation of Autonomous Products) provides a framework that is complementary to ISO 3691-4 and is increasingly referenced for AV safety cases in North America.

UL 4600 TopicV-Model MappingKey Activities
Clause 7: Risk assessmentRequirementsODD definition, hazard analysis, SOTIF triggering conditions
Clause 8: Lifecycle managementAll levelsConfiguration management, change impact analysis
Clause 9: Sensor validationUnit / IntegrationSensor performance characterization, degradation testing
Clause 10: Software validationUnit / IntegrationCode coverage, static analysis, formal verification
Clause 11: Interaction safetySystem testingPersonnel detection, mixed traffic, teleoperation handoff
Clause 12: Operational safetyAcceptanceShadow mode results, intervention rates, field trial data
Clause 13: Continuous improvementPost-deploymentOTA update validation, regression gates, fleet monitoring

ANSI/UL 3100 (Standard for Safety for Autonomous Mobile Platforms) applies more directly to industrial autonomous vehicles and references ISO 3691-4 while adding North American-specific requirements for obstacle detection testing and pathway safety.

1.5 Testing Philosophy: Defense in Depth

No single testing method is sufficient for autonomous vehicle safety certification. The industry has converged on a defense-in-depth approach where multiple testing layers provide overlapping evidence:

┌─────────────────────────────────────────────────────────┐
│  Layer 5: Field Operations Monitoring (continuous)      │
│  Fleet telemetry, intervention tracking, incident review│
├─────────────────────────────────────────────────────────┤
│  Layer 4: Shadow Mode (50,000+ km before autonomous)    │
│  Real-world data, no safety risk, decision comparison   │
├─────────────────────────────────────────────────────────┤
│  Layer 3: Physical Testing (test track + airport)       │
│  Hardware validation, real sensor performance           │
├─────────────────────────────────────────────────────────┤
│  Layer 2: Hardware-in-the-Loop (10,000+ hours)          │
│  Real compute + simulated sensors, timing validation    │
├─────────────────────────────────────────────────────────┤
│  Layer 1: Software-in-the-Loop (10M+ scenario-km)       │
│  Statistical coverage, adversarial search, regression   │
├─────────────────────────────────────────────────────────┤
│  Layer 0: Code-Level Verification (continuous)          │
│  Static analysis, unit tests, MC/DC, formal methods     │
└─────────────────────────────────────────────────────────┘

Each layer catches different failure modes. Code-level verification catches logic errors and coding standard violations. SiL catches algorithmic failures across millions of scenarios. HiL catches timing issues and hardware interaction bugs. Physical testing validates real-world sensor performance. Shadow mode validates the full system in the real operational environment. Field monitoring catches long-tail scenarios that no other layer can reach.


2. Scenario-Based Testing

2.1 ASAM OpenSCENARIO 2.0 for Scenario Specification

ASAM OpenSCENARIO 2.0 (now known as OpenSCENARIO DSL) provides a domain-specific language for describing driving scenarios at multiple levels of abstraction. Unlike OpenSCENARIO 1.x (XML-based, concrete scenarios only), the DSL supports abstract and logical scenario definitions with constraint-based parameter variation.

Core concepts:

ConceptDescriptionExample
ActorEntity participating in the scenarioego_vehicle, ground_crew_member, aircraft_a320
ActionBehavior performed by an actordrive_to(stand_42), emergency_stop(), cross_path()
EventTrigger that initiates an actiondistance_to(ego, aircraft) < 5m, turnaround_phase == loading
ParameterVariable with range or distributionego_speed: [5..25] km/h, visibility: [50..1000] m
ConstraintRelationship between parametersego_speed <= max_speed_for_zone(zone)
CoverageMetric tracked over parameter spacecover(ego_speed, every(1, km/h))

Airside domain extensions (extending the taxonomy from airside-scenario-taxonomy.md):

// Airside actor types
type airside_actor inherits actor:
    role: airside_role

type airside_role: enum of [
    baggage_tractor, belt_loader, container_loader,
    catering_truck, fuel_truck, pushback_tug,
    gpu, lavatory_truck, water_truck,
    passenger_stairs, deicing_truck, follow_me_car,
    ground_crew_standing, ground_crew_crouching,
    marshaller, wing_walker, ramp_agent,
    aircraft_narrow_body, aircraft_wide_body,
    emergency_vehicle, maintenance_vehicle
]

// Airside environment
type airside_environment:
    zone: airside_zone
    weather: weather_condition
    lighting: lighting_condition
    surface: surface_condition
    temperature_c: float
    wind_speed_kt: float
    wind_direction_deg: float

type airside_zone: enum of [
    apron, service_road, taxiway_crossing,
    depot, maintenance_area, fuel_farm_perimeter,
    deicing_pad, cargo_area
]

// Scenario template
scenario airside_transit_with_crossing:
    ego: baggage_tractor
    environment: airside_environment
    
    do parallel:
        ego.drive_along(service_road_route)
        with:
            speed(ego) in [10..25] km/h
    
    do serial:
        wait elapsed(uniform(30, 120) s)
        aircraft_1.taxi_across(crossing_point)
        with:
            speed(aircraft_1) in [8..15] m/s
            distance_at_crossing(ego, aircraft_1) in [20..200] m
    
    cover(speed(ego), every(2.5, km/h))
    cover(distance_at_crossing, every(10, m))

2.2 Three-Tier Scenario Abstraction

Following ISO 34502 adapted for airside (see airside-scenario-taxonomy.md Section 2.1), each scenario is refined through three tiers:

Tier 1: Functional Scenario (natural language)

"Autonomous baggage tractor approaches a narrow-body aircraft stand while a belt loader is operating on the opposite side and ground crew are walking between the aircraft fuselage and adjacent GSE."

Tier 2: Logical Scenario (parameterized)

ParameterRangeDistribution
ego_speed5-15 km/hUniform
aircraft_typeA320, B737, A321Categorical (40%, 35%, 25%)
belt_loader_positionLeft/Right of aircraft noseUniform
num_ground_crew1-5Poisson(lambda=2)
crew_behaviorStanding, Walking, CrouchingCategorical (30%, 50%, 20%)
lightingDay, Dusk, NightCategorical (50%, 20%, 30%)
surface_conditionDry, Wet, IcyCategorical (60%, 30%, 10%)
ego_approach_angle-30 to +30 degrees from centerlineNormal(0, 10)

Tier 3: Concrete Scenario (executable)

ego_speed=11.2 km/h, aircraft=A320, belt_loader=Left, num_crew=3, crew_behavior=[Walking, Standing, Crouching], lighting=Dusk, surface=Wet, approach_angle=+7.3 degrees. Crew member #2 at position (12.4, -3.1) steps into ego path at t=4.7s.

2.3 Concrete Scenario Generation from Logical Scenarios

Given the parameterized logical scenario, concrete scenarios are generated through several complementary strategies:

Strategy 1: Grid sampling (deterministic coverage)

Discretize each parameter and sample the full grid. For 8 parameters with 5 levels each: 5^8 = 390,625 concrete scenarios per logical scenario. This is feasible in simulation but excessive -- use covering arrays (Section 2.4) to reduce.

Strategy 2: Random sampling (Monte Carlo)

Sample parameters from their specified distributions. Simple to implement but provides poor coverage of rare parameter combinations. Useful for initial exploration.

Strategy 3: Latin Hypercube Sampling (stratified)

Divide each parameter range into N equal strata, sample once from each stratum, then randomly pair across parameters. Ensures better coverage than pure Monte Carlo with the same number of samples.

Strategy 4: Importance sampling (Section 2.5)

Bias sampling toward regions of the parameter space where failures are more likely. Requires a prior estimate of failure probability (from previous test campaigns or expert judgment).

Strategy 5: Adversarial search (Section 4)

Use optimization algorithms (CMA-ES, Bayesian optimization) to actively search for failure-inducing parameter combinations.

python
import numpy as np
from scipy.stats import qmc

class ScenarioGenerator:
    """Generate concrete scenarios from logical scenario parameters."""
    
    def __init__(self, parameter_defs: dict):
        """
        parameter_defs: {
            'ego_speed': {'type': 'continuous', 'low': 5, 'high': 25, 'unit': 'km/h'},
            'lighting': {'type': 'categorical', 'values': ['day', 'dusk', 'night'],
                         'probabilities': [0.5, 0.2, 0.3]},
            'num_crew': {'type': 'discrete', 'low': 0, 'high': 8},
            ...
        }
        """
        self.params = parameter_defs
        self.continuous_params = {k: v for k, v in parameter_defs.items() 
                                  if v['type'] == 'continuous'}
        self.categorical_params = {k: v for k, v in parameter_defs.items() 
                                    if v['type'] == 'categorical'}
        self.discrete_params = {k: v for k, v in parameter_defs.items() 
                                 if v['type'] == 'discrete'}
    
    def latin_hypercube(self, n_samples: int, seed: int = 42) -> list[dict]:
        """Generate scenarios using Latin Hypercube Sampling."""
        n_continuous = len(self.continuous_params)
        sampler = qmc.LatinHypercube(d=n_continuous, seed=seed)
        unit_samples = sampler.random(n=n_samples)  # [0,1]^d
        
        scenarios = []
        for i in range(n_samples):
            scenario = {}
            # Map continuous parameters from [0,1] to their ranges
            for j, (name, pdef) in enumerate(self.continuous_params.items()):
                scenario[name] = pdef['low'] + unit_samples[i, j] * (pdef['high'] - pdef['low'])
            # Sample categorical parameters
            for name, pdef in self.categorical_params.items():
                scenario[name] = np.random.choice(pdef['values'], p=pdef['probabilities'])
            # Sample discrete parameters
            for name, pdef in self.discrete_params.items():
                scenario[name] = np.random.randint(pdef['low'], pdef['high'] + 1)
            scenarios.append(scenario)
        return scenarios
    
    def importance_sample(self, n_samples: int, failure_prior: callable,
                          oversampling_factor: float = 5.0) -> list[dict]:
        """
        Sample with bias toward high-failure-probability regions.
        failure_prior: function(scenario_dict) -> float [0,1] estimated failure probability
        """
        # Generate candidate pool
        candidates = self.latin_hypercube(int(n_samples * oversampling_factor))
        # Score each candidate
        scores = np.array([failure_prior(s) for s in candidates])
        # Normalize to probability distribution
        probs = scores / scores.sum()
        # Sample without replacement proportional to failure probability
        indices = np.random.choice(len(candidates), size=n_samples, 
                                   replace=False, p=probs)
        return [candidates[i] for i in indices]
    
    def grid_sample(self, levels_per_param: int = 5) -> list[dict]:
        """Full grid sampling (use with caution -- combinatorial explosion)."""
        import itertools
        grids = {}
        for name, pdef in self.continuous_params.items():
            grids[name] = np.linspace(pdef['low'], pdef['high'], levels_per_param).tolist()
        for name, pdef in self.categorical_params.items():
            grids[name] = pdef['values']
        for name, pdef in self.discrete_params.items():
            grids[name] = list(range(pdef['low'], pdef['high'] + 1))
        
        keys = list(grids.keys())
        values = [grids[k] for k in keys]
        scenarios = [dict(zip(keys, combo)) for combo in itertools.product(*values)]
        return scenarios

2.4 N-Wise Parameter Combination (Covering Arrays)

Full combinatorial testing is intractable for scenarios with many parameters. Covering arrays provide a mathematically principled way to reduce the number of test cases while guaranteeing that every combination of N parameters is tested at least once.

Definitions:

  • Pairwise (2-wise) covering array: Every pair of parameter values appears in at least one test case. For k parameters with v values each, the minimum size is approximately v^2 * ln(k), which is O(v^2 * log(k)) -- much smaller than v^k.
  • 3-wise covering array: Every triple of parameter values appears. Stronger coverage but more test cases.
  • N-wise covering array: Every N-tuple of parameter values appears.

Empirical justification for pairwise testing: Studies by Kuhn, Wallace, and Gallo (NIST, 2004) analyzed 329 software failures and found that 93% of bugs were triggered by the interaction of at most 3 parameters, and 98% by at most 4. This suggests that 3-wise or 4-wise covering arrays capture nearly all parameter-interaction faults.

Example for airside scenario testing:

Consider a scenario with 8 parameters, each discretized to 5 levels:

  • Full combinatorial: 5^8 = 390,625 test cases
  • Pairwise covering array: approximately 50-80 test cases (>4,800x reduction)
  • 3-wise covering array: approximately 250-400 test cases (>970x reduction)
  • 4-wise covering array: approximately 800-1,500 test cases (>260x reduction)

Tool support:

ToolCapabilityLicense
ACTS (NIST)Up to 6-wise, mixed-level, constraint supportFree (US govt)
CAgenPairwise and higher, large parameter spacesOpen source
JennyPairwise, fast for large numbers of parametersOpen source
Pairwise onlineBrowser-based pairwise generationFree
PICT (Microsoft)Pairwise and N-wise with constraintsOpen source

Constraint handling: Not all parameter combinations are physically valid. For example, "icy surface" with "temperature = 30C" is impossible. Covering array tools support constraints to exclude invalid combinations:

# PICT constraint syntax example
IF [surface] = "icy" THEN [temperature_c] <= 0;
IF [zone] = "depot" THEN [aircraft_state] = "none";
IF [turnaround_phase] = "pushback" THEN [aircraft_state] IN {"engines_starting", "pushback"};

2.5 Importance Sampling for Rare Events

Most airside scenarios are nominal (vehicle drives along service road, nobody crosses, weather is clear). The critical scenarios -- those that test safety boundaries -- are rare in the natural distribution. Testing uniformly across the parameter space wastes most of the test budget on uninteresting scenarios.

Importance sampling biases the test distribution toward high-risk regions:

Step 1: Define the natural distribution p(x) over scenario parameters x. This represents the frequency of scenarios in actual airport operations. Sources: fleet telemetry data, airport operations statistics, turnaround timing databases.

Step 2: Define the target distribution q(x) that oversamples dangerous regions. Approaches:

  • Expert-defined risk weighting: Multiply natural probability by a risk factor based on hazard analysis. For example, weight "ground crew crouching behind GSE" 10x relative to its natural frequency.
  • Failure-probability weighting: Use results from initial test campaigns to estimate P(failure|x) and set q(x) proportional to p(x) * P(failure|x).
  • Cross-entropy method (CEM): Iteratively update q(x) toward the distribution over scenarios that cause failures. Start with p(x), run tests, identify failures, fit a new distribution to failure-inducing scenarios, repeat.

Step 3: Correct for bias. When computing aggregate metrics (e.g., overall failure rate), weight each test result by the importance weight w(x) = p(x) / q(x) to obtain unbiased estimates:

Unbiased failure rate = (1/N) * sum_i [ w(x_i) * indicator(failure at x_i) ]
                      = (1/N) * sum_i [ p(x_i)/q(x_i) * indicator(failure at x_i) ]

Variance reduction: Importance sampling can dramatically reduce the variance of rare-event probability estimates. For an event with probability 10^-6 under p(x), naive Monte Carlo requires ~10^8 samples for a reliable estimate. With a well-chosen q(x), the same precision may require only 10^3 to 10^4 samples.

2.6 Critical Scenario Identification

Beyond sampling, we need principled methods to identify which scenarios are most critical for safety validation.

Responsibility-Sensitive Safety (RSS) violation detection:

RSS (Shalev-Shwartz et al., 2017) defines formal rules for safe longitudinal and lateral behavior. A scenario is critical if the AV's planned trajectory would violate an RSS constraint even with correct behavior. This indicates a scenario where the safe response set is small or empty.

For airside operations, RSS parameters are adapted (see ../runtime-assurance/simplex-safety-architecture.md Section 2):

  • Minimum longitudinal safe distance from aircraft: 3-5 m (vs. 1-2 m for cars)
  • Maximum response time: 0.5 s (slow speed, safety-critical)
  • Maximum comfortable deceleration: 3 m/s^2 (loaded baggage tractor)
  • Maximum emergency deceleration: 5-6 m/s^2 (depends on surface, load)

Time-to-collision (TTC) metrics:

MetricDefinitionCritical Threshold
TTCTime until collision assuming constant velocities< 3 s
TTC*Time until collision assuming constant accelerations< 3 s
PET (Post-Encroachment Time)Time between one actor leaving a conflict point and another arriving< 1.5 s
DRAC (Deceleration Rate to Avoid Collision)Required deceleration to prevent collision> 3 m/s^2
Minimum distanceClosest point of approach< 2 m (GSE), < 5 m (aircraft)

Scenario criticality score:

python
def scenario_criticality(scenario_result: dict) -> float:
    """
    Compute a criticality score [0, 1] for a completed scenario.
    Higher score = more critical (closer to safety boundary).
    """
    weights = {
        'ttc': 0.25,
        'min_distance': 0.25,
        'rss_margin': 0.20,
        'decel_required': 0.15,
        'speed_at_closest': 0.15,
    }
    
    scores = {}
    
    # TTC: critical if < 3s, maximum criticality at 0s
    ttc = scenario_result.get('min_ttc', float('inf'))
    scores['ttc'] = max(0, 1 - ttc / 3.0)
    
    # Minimum distance: critical if < 5m (aircraft) or < 2m (GSE/personnel)
    min_dist = scenario_result.get('min_distance', float('inf'))
    threshold = 5.0 if scenario_result.get('closest_actor_type') == 'aircraft' else 2.0
    scores['min_distance'] = max(0, 1 - min_dist / threshold)
    
    # RSS margin: critical if negative (violation)
    rss_margin = scenario_result.get('rss_longitudinal_margin', float('inf'))
    scores['rss_margin'] = max(0, min(1, -rss_margin / 2.0)) if rss_margin < 0 else 0
    
    # Required deceleration: critical if > 3 m/s^2
    decel = scenario_result.get('max_decel_required', 0)
    scores['decel_required'] = min(1, max(0, (decel - 1.0) / 4.0))
    
    # Speed at closest approach: higher speed = more critical
    speed = scenario_result.get('speed_at_closest_approach', 0)
    scores['speed_at_closest'] = min(1, speed / 25.0)  # normalize to max operating speed
    
    return sum(weights[k] * scores[k] for k in weights)

3. Coverage Metrics

3.1 ODD Coverage Analysis

The Operational Design Domain (ODD) defines the conditions under which the AV is designed to operate. ODD coverage measures what fraction of the ODD has been tested.

ODD dimension decomposition:

ODD DimensionSub-dimensionsTestable Values
GeographyAirport zones (apron, service road, taxiway crossing, depot)4 zone types, per-airport variants
Time of dayDawn, Day, Dusk, Night4 levels
WeatherClear, Rain (light/heavy), Fog, Snow, De-icing spray6 conditions
SurfaceDry concrete, Wet concrete, Standing water, Ice/frost, Oil/fuel spill5 conditions
Temperature-20C to +50C8 intervals
Traffic density0 (empty) to 20+ GSE in apron area5 levels
Personnel density0 to 15+ ground crew in detection range5 levels
Aircraft presenceNone, Parked (cold), Parked (APU), Engines starting, Taxiing5 states
Sensor healthAll nominal, Single LiDAR degraded, Multi-LiDAR degraded, Camera fallback4 states
CommunicationFull connectivity, Degraded, No connectivity3 states

Total ODD cells: 4 * 4 * 6 * 5 * 8 * 5 * 5 * 5 * 4 * 3 = 14,400,000

ODD coverage metric:

python
def compute_odd_coverage(test_results: list[dict], odd_grid: dict) -> dict:
    """
    Compute ODD coverage from test results.
    
    test_results: list of scenario results with ODD dimension values
    odd_grid: dict mapping dimension names to lists of discretized values
    
    Returns: {
        'overall_coverage': float,  # fraction of ODD cells with at least 1 test
        'per_dimension': dict,       # coverage per dimension
        'uncovered_cells': list,     # ODD cells with zero tests
        'weakly_covered': list,      # ODD cells with < min_tests
    }
    """
    from itertools import product
    
    # Build set of all ODD cells
    dimensions = list(odd_grid.keys())
    all_cells = set()
    for combo in product(*[odd_grid[d] for d in dimensions]):
        all_cells.add(combo)
    
    # Map test results to ODD cells
    covered_cells = {}  # cell -> count
    for result in test_results:
        cell = tuple(result.get(d) for d in dimensions)
        covered_cells[cell] = covered_cells.get(cell, 0) + 1
    
    # Compute coverage
    n_total = len(all_cells)
    n_covered = len(covered_cells)
    
    # Per-dimension coverage (marginal)
    per_dim_coverage = {}
    for i, dim in enumerate(dimensions):
        dim_values = set(odd_grid[dim])
        tested_values = set(cell[i] for cell in covered_cells.keys())
        per_dim_coverage[dim] = len(tested_values) / len(dim_values)
    
    # Identify uncovered and weakly covered
    min_tests_per_cell = 3  # minimum for any statistical confidence
    uncovered = [cell for cell in all_cells if cell not in covered_cells]
    weakly_covered = [cell for cell, count in covered_cells.items() 
                      if count < min_tests_per_cell]
    
    return {
        'overall_coverage': n_covered / n_total,
        'per_dimension': per_dim_coverage,
        'n_total_cells': n_total,
        'n_covered_cells': n_covered,
        'uncovered_cells': uncovered[:100],  # truncate for display
        'n_uncovered': len(uncovered),
        'weakly_covered_cells': weakly_covered[:100],
        'n_weakly_covered': len(weakly_covered),
    }

Practical guidance: Achieving 100% ODD coverage is infeasible because many cells represent physically impossible or extremely rare combinations. A more realistic target:

Coverage LevelTargetInterpretation
Dimension marginal coverage100%Every value of every dimension tested in at least one scenario
Pairwise cell coverage>95%Every pair of dimension values tested together
3-wise cell coverage>80%Every triple of dimension values tested together
Full cell coverage>10%At least 10% of all cells have at least one test
Safety-critical cell coverage100%All cells identified as high-risk in hazard analysis tested

3.2 Scenario Space Coverage

The airside scenario taxonomy defines 115 functional scenarios across 8 categories (see airside-scenario-taxonomy.md). Scenario space coverage tracks how many of these have been tested and at what depth.

CategoryFunctional ScenariosTarget Logical ScenariosTarget Concrete Scenarios
Transit operations1470 (5 per functional)7,000 (100 per logical)
Stand approach12606,000
Turnaround support18909,000
Personnel interaction2211011,000
GSE interaction15757,500
Environmental hazards16808,000
Emergency situations10505,000
Multi-vehicle coordination8404,000
Total11557557,500

This is a minimum test suite for SiL simulation. Physical testing will cover a fraction (see Section 10).

Scenario coverage tracking:

python
class ScenarioCoverageTracker:
    """Track scenario test coverage against the airside taxonomy."""
    
    def __init__(self, taxonomy: dict):
        """
        taxonomy: {
            'category_name': {
                'functional_scenarios': [
                    {
                        'id': 'FS-TR-001',
                        'description': 'Nominal transit on service road',
                        'logical_scenarios': [...],
                        'risk_level': 'low',
                    },
                    ...
                ],
            },
            ...
        }
        """
        self.taxonomy = taxonomy
        self.test_results = {}  # scenario_id -> list of test results
    
    def record_test(self, scenario_id: str, result: dict):
        """Record a test execution against a scenario."""
        if scenario_id not in self.test_results:
            self.test_results[scenario_id] = []
        self.test_results[scenario_id].append(result)
    
    def coverage_report(self) -> dict:
        """Generate coverage report."""
        report = {'categories': {}, 'summary': {}}
        total_fs = 0
        covered_fs = 0
        total_tests = 0
        total_failures = 0
        
        for cat_name, cat_data in self.taxonomy.items():
            cat_covered = 0
            cat_total = len(cat_data['functional_scenarios'])
            cat_tests = 0
            cat_failures = 0
            
            for fs in cat_data['functional_scenarios']:
                total_fs += 1
                results = self.test_results.get(fs['id'], [])
                cat_tests += len(results)
                failures = [r for r in results if not r.get('passed', True)]
                cat_failures += len(failures)
                if len(results) > 0:
                    covered_fs += 1
                    cat_covered += 1
            
            total_tests += cat_tests
            total_failures += cat_failures
            
            report['categories'][cat_name] = {
                'total_functional': cat_total,
                'covered_functional': cat_covered,
                'coverage_pct': 100 * cat_covered / cat_total if cat_total > 0 else 0,
                'total_tests': cat_tests,
                'failures': cat_failures,
                'pass_rate': 100 * (cat_tests - cat_failures) / cat_tests if cat_tests > 0 else 0,
            }
        
        report['summary'] = {
            'total_functional_scenarios': total_fs,
            'covered_functional_scenarios': covered_fs,
            'overall_coverage_pct': 100 * covered_fs / total_fs if total_fs > 0 else 0,
            'total_test_executions': total_tests,
            'total_failures': total_failures,
            'overall_pass_rate': 100 * (total_tests - total_failures) / total_tests if total_tests > 0 else 0,
        }
        return report

3.3 Code Coverage for Safety-Critical Paths

Code coverage metrics for safety-critical C++ nodelets in the reference airside AV stack ROS stack:

MetricDescriptionTarget (ASIL-B)Target (Non-safety)Tool
Statement coverage% of code statements executed100%>80%gcov, lcov
Branch coverage% of control flow branches taken100%>70%gcov, lcov
MC/DC (Modified Condition/Decision Coverage)Each condition independently affects the decisionRequiredNot requiredBullseyeCoverage, VectorCAST
Function coverage% of functions called100%>90%gcov

MC/DC requirement (ISO 26262 Part 6): For ASIL-B and above, MC/DC is required for safety-critical code. MC/DC demands that every condition in a Boolean expression has been shown to independently affect the outcome.

Example: For the expression if (obstacle_detected && speed > 0 && !emergency_override):

  • MC/DC requires 4 test cases minimum (for 3 conditions)
  • Each condition must flip the outcome while others are held constant

Safety-critical paths in reference airside AV stack:

PackageSafety CriticalityCoverage TargetMC/DC Required
airside_safety (e-stop, watchdog)ASIL-B100% statement + MC/DCYes
airside_perception (obstacle detection)ASIL-B100% statement + MC/DCYes
airside_nav (speed limiting, geofence)ASIL-B100% statement, branchYes (safety checks only)
airside_nav (Frenet planner core)ASIL-A>90% statement, >80% branchNo
airside_localization (GTSAM)ASIL-A>90% statementNo
airside_control (Stanley, low-level)ASIL-B100% statement + MC/DCYes (actuator commands)
Other packagesQM>80% statementNo

3.4 Perception Coverage

Perception testing requires coverage across all object types, distances, environmental conditions, and sensor configurations. The coverage matrix:

Object type x distance coverage:

Object Type0-10 m10-30 m30-50 m50-100 m100-200 m
Ground crew (standing)RequiredRequiredRequiredRequiredDesired
Ground crew (crouching)RequiredRequiredRequiredDesiredN/A
Narrow-body aircraftRequiredRequiredRequiredRequiredRequired
Wide-body aircraftRequiredRequiredRequiredRequiredRequired
Baggage tractorRequiredRequiredRequiredRequiredDesired
Belt loaderRequiredRequiredRequiredRequiredDesired
Container loaderRequiredRequiredRequiredDesiredN/A
Fuel truckRequiredRequiredRequiredRequiredDesired
FOD (>10 cm)RequiredRequiredDesiredN/AN/A
Emergency vehicleRequiredRequiredRequiredRequiredRequired
Dolly train (3-5 dollies)RequiredRequiredRequiredRequiredDesired

Lighting x weather perception matrix:

ConditionClearLight RainHeavy RainFog (<200m vis)SnowDe-icing Spray
Day (>10k lux)BaselineRequiredRequiredRequiredRequiredRequired
Dusk (100-10k lux)RequiredRequiredRequiredRequiredDesiredDesired
Night (<100 lux)RequiredRequiredRequiredRequiredDesiredDesired
Night + Apron LightsRequiredRequiredDesiredDesiredDesiredDesired

Perception performance metrics per cell:

MetricAbbreviationTarget (Personnel)Target (Aircraft)Target (GSE)Target (FOD)
Average PrecisionAP>90%>95%>90%>70%
Recall at 0m-30mR@30>99%>99%>95%>85%
False positive rateFPR<1%<0.1%<2%<5%
Localization error (3D)LE<0.3 m<0.5 m<0.3 m<0.5 m
Detection latencyDL<100 ms<100 ms<100 ms<200 ms

3.5 N-Sigma Statistical Confidence Arguments

For safety-critical metrics, we need not just point estimates but statistical confidence bounds. The N-sigma framework quantifies how confident we are that the true performance meets the requirement.

Binomial confidence interval for detection rate:

Given N test instances and k successful detections:

  • Point estimate: p_hat = k/N
  • Lower 95% confidence bound (Clopper-Pearson exact): beta_inv(alpha/2, k, N-k+1)

Example: 985 detections out of 1000 test cases:

  • Point estimate: 98.5%
  • 95% lower bound: 97.4%
  • 99% lower bound: 97.0%

If the requirement is 95% detection rate, we can claim with 99% confidence that the true rate exceeds 97%, which exceeds the requirement with margin.

Required sample sizes for detection rate claims:

True Detection RateTarget ClaimConfidenceRequired N (no failures allowed)
99%95%95%59
99%99%95%299
99.9%99%95%2,995
99.9%99.9%95%2,995
99.9%99.9%99%4,603

Formula (zero-failure test):

N = ceil(ln(1 - C) / ln(R))

where:
  C = confidence level (e.g., 0.95)
  R = reliability claim (e.g., 0.99)
  N = required number of tests with zero failures
python
import math
from scipy import stats

def required_samples_zero_failure(reliability: float, confidence: float) -> int:
    """
    Compute minimum test count for reliability claim with zero failures.
    
    Based on: N = ceil(ln(1 - C) / ln(R))
    
    Example: required_samples_zero_failure(0.99, 0.95) = 299
    """
    if reliability >= 1.0 or reliability <= 0.0:
        raise ValueError("Reliability must be in (0, 1)")
    if confidence >= 1.0 or confidence <= 0.0:
        raise ValueError("Confidence must be in (0, 1)")
    return math.ceil(math.log(1 - confidence) / math.log(reliability))


def detection_rate_confidence_interval(k: int, n: int, confidence: float = 0.95) -> tuple:
    """
    Clopper-Pearson exact confidence interval for detection rate.
    
    k: number of successful detections
    n: total test instances
    confidence: confidence level
    
    Returns (lower_bound, upper_bound)
    """
    alpha = 1 - confidence
    if k == 0:
        lower = 0.0
    else:
        lower = stats.beta.ppf(alpha / 2, k, n - k + 1)
    if k == n:
        upper = 1.0
    else:
        upper = stats.beta.ppf(1 - alpha / 2, k + 1, n - k)
    return (lower, upper)


# Example usage:
# For personnel detection: 985 detections out of 1000 tests
# lower, upper = detection_rate_confidence_interval(985, 1000, 0.95)
# print(f"95% CI: [{lower:.4f}, {upper:.4f}]")
# Output: 95% CI: [0.9741, 0.9924]

4. Corner Case and Adversarial Testing

4.1 Search-Based Testing

Search-based testing (SBT) treats scenario generation as an optimization problem: find the scenario parameters that maximize a criticality metric (Section 2.6) or trigger a specific failure mode.

Covariance Matrix Adaptation Evolution Strategy (CMA-ES):

CMA-ES is a derivative-free optimization algorithm particularly well-suited for scenario search because:

  • It works with continuous parameter spaces (speed, position, timing)
  • It adapts its search distribution based on successful mutations
  • It does not require gradient information (the simulator is a black box)
  • It handles multimodal landscapes (multiple distinct failure modes)

Algorithm outline for airside scenario search:

python
import cma
import numpy as np

class AdversarialScenarioSearch:
    """Use CMA-ES to find failure-inducing airside scenarios."""
    
    def __init__(self, simulator, parameter_bounds: dict):
        """
        simulator: callable(scenario_params) -> scenario_result
        parameter_bounds: {'ego_speed': (5, 25), 'crew_x': (-20, 20), ...}
        """
        self.simulator = simulator
        self.bounds = parameter_bounds
        self.param_names = list(parameter_bounds.keys())
        self.lower = np.array([parameter_bounds[p][0] for p in self.param_names])
        self.upper = np.array([parameter_bounds[p][1] for p in self.param_names])
    
    def _normalize(self, x):
        """Map from [lower, upper] to [0, 1]."""
        return (x - self.lower) / (self.upper - self.lower)
    
    def _denormalize(self, x_norm):
        """Map from [0, 1] to [lower, upper]."""
        return self.lower + x_norm * (self.upper - self.lower)
    
    def _objective(self, x_norm):
        """
        Objective to MAXIMIZE (CMA-ES minimizes, so negate).
        Returns negative criticality (higher criticality = more dangerous).
        """
        x = self._denormalize(np.clip(x_norm, 0, 1))
        params = dict(zip(self.param_names, x))
        result = self.simulator(params)
        criticality = scenario_criticality(result)
        # Return negative because CMA-ES minimizes
        return -criticality
    
    def search(self, n_generations: int = 100, population_size: int = 20,
               sigma0: float = 0.3) -> list[dict]:
        """
        Run CMA-ES search for critical scenarios.
        Returns list of discovered critical scenarios sorted by criticality.
        """
        dim = len(self.param_names)
        x0 = np.full(dim, 0.5)  # start at center of parameter space
        
        es = cma.CMAEvolutionStrategy(x0, sigma0, {
            'bounds': [0, 1],
            'popsize': population_size,
            'maxiter': n_generations,
            'seed': 42,
        })
        
        critical_scenarios = []
        
        while not es.stop():
            solutions = es.ask()
            fitnesses = [self._objective(s) for s in solutions]
            es.tell(solutions, fitnesses)
            
            # Record scenarios with criticality > threshold
            for sol, fit in zip(solutions, fitnesses):
                criticality = -fit
                if criticality > 0.7:  # threshold for "critical"
                    params = dict(zip(self.param_names,
                                      self._denormalize(np.clip(sol, 0, 1))))
                    critical_scenarios.append({
                        'params': params,
                        'criticality': criticality,
                    })
        
        # Sort by criticality (most dangerous first)
        critical_scenarios.sort(key=lambda x: x['criticality'], reverse=True)
        return critical_scenarios

Bayesian Optimization alternative: For expensive simulations (e.g., full-fidelity HIL tests that take minutes per run), Bayesian Optimization with Gaussian Process surrogates is more sample-efficient than CMA-ES. Libraries: BoTorch, GPyOpt, Ax.

4.2 Adversarial Object Placement

Test the perception system's robustness to adversarial configurations of objects that exploit known sensor weaknesses.

LiDAR adversarial scenarios:

Adversarial ConfigurationExpected Failure ModeTest Method
Personnel crouching behind low GSE (belt loader ramp)Partial occlusion below LiDAR scan planeSiL + Physical (mannequin)
Highly reflective aircraft fuselage creating ghost pointsFalse positive obstacles from specular reflectionSiL + Physical (calibration target)
Transparent objects (glass partition, plastic barrier)Missed detection (LiDAR passes through)Physical (actual objects)
Personnel wearing high-visibility vests at nightRetroreflective saturation causing range errorsSiL + Physical (mannequin with vest)
FOD at edge of LiDAR beam (minimum range/max angle)Missed detection in coverage gapSiL + Physical (placed objects)
Personnel directly behind baggage dolly trainFull occlusion by dolly train (up to 15 m long)SiL + Physical
Jet exhaust distorting LiDAR beam pathRefraction causing range errors or missed returnsSiL (physics-based)
Accumulated water/ice on LiDAR lensDegraded point cloud densityPhysical (controlled contamination)

Camera adversarial scenarios (for camera fallback mode):

Adversarial ConfigurationExpected Failure ModeTest Method
Bright apron lights causing camera flareWashed-out regions hiding personnelSiL + Physical (night testing)
Shadows from aircraft wings creating false edgesFalse positive obstaclesSiL
Wet surface reflections duplicating objectsDouble-counting actorsSiL + Physical
High-contrast stripes on ground (safety markings)Depth estimation errorsSiL + Physical
De-icing fluid on lensBlurred or distorted imagePhysical (spray test)

4.3 LLM-Based Scenario Generation

Large language models can generate edge case scenarios from natural language safety requirements, exploiting their broad knowledge of aviation operations and failure modes.

Approach:

  1. Provide the LLM with the safety requirements document, the ODD definition, and the hazard catalog
  2. Prompt it to generate scenarios that could violate each safety requirement
  3. Parse generated scenarios into the OpenSCENARIO DSL format
  4. Filter and validate for physical plausibility
  5. Execute in simulation

Prompt template:

You are a safety engineer testing an autonomous baggage tractor operating 
on an airport apron. The vehicle uses 4-8 RoboSense LiDARs for perception, 
operates at 5-25 km/h, and must maintain 3m clearance from aircraft and 2m 
from personnel.

Given this safety requirement:
"{requirement}"

Generate 10 concrete scenarios that could cause the vehicle to violate this 
requirement. For each scenario, specify:
1. Initial positions and velocities of all actors
2. Environmental conditions (weather, lighting, surface)
3. The specific sequence of events that creates the hazard
4. Why this scenario is challenging for the perception/planning system

Focus on scenarios that exploit:
- Sensor limitations (occlusion, reflections, range limits)
- Unusual actor behaviors (unexpected movements, unusual positions)
- Environmental edge cases (jet blast, de-icing, night + rain)
- Timing coincidences (multiple events happening simultaneously)

Validation pipeline:

LLM-generated scenario (natural language)
    ↓ Parse to structured format
Logical scenario parameters
    ↓ Physical plausibility check
    ↓ (reject impossible: e.g., crouching personnel at 100 km/h)
Valid logical scenario
    ↓ Instantiate to concrete
Concrete scenario (executable)
    ↓ Run in SiL
Test result
    ↓ If failure: add to adversarial scenario database
    ↓ If pass: record as coverage evidence

Empirical results from literature: Tian et al. (2024) found that GPT-4-generated driving scenarios discovered 15-30% more failure modes than random scenario generation with the same computational budget. The LLM-generated scenarios were particularly effective at finding multi-actor interaction failures that random sampling rarely produces.

4.4 Metamorphic Testing

Metamorphic testing defines relationships between scenarios that should hold if the system is correct. If the relationship is violated, a bug is detected without needing an oracle for the absolute correctness of each test.

Metamorphic relations for airside AVs:

IDRelationDescriptionImplementation
MR1Speed monotonicityIf safe at speed v, should be safe at speed v' < v (all else equal)Run scenario at decreasing speeds; flag if failure appears at lower speed
MR2Distance monotonicityIf safe with obstacle at distance d, should be safe at d' > dMove obstacle farther; flag if failure appears at greater distance
MR3Visibility monotonicityIf safe in fog (200m visibility), should be safe in clear conditionsImprove visibility; flag if failure appears in better conditions
MR4Sensor additionAdding a sensor should not decrease detection rateRun with N and N+1 LiDARs; flag if detection drops
MR5Object sizeIf detecting a container loader, should detect a larger fuel truckReplace small object with larger; flag if detection drops
MR6SymmetryPerformance should be similar for left vs. right approachMirror scenario; flag if significant performance difference
MR7Temporal invarianceReplaying the same sensor data should produce the same decisionReplay twice; flag if decisions differ (indicates non-determinism)
MR8Additive safetyAdding a safety constraint should not make the system less safeEnable additional RSS check; flag if new failures appear

Metamorphic test executor:

python
class MetamorphicTestRunner:
    """Run metamorphic tests by transforming scenarios and checking relations."""
    
    def __init__(self, simulator):
        self.simulator = simulator
        self.violations = []
    
    def test_speed_monotonicity(self, base_scenario: dict, 
                                 speed_reductions: list[float]) -> list[dict]:
        """MR1: Reducing speed should not cause new failures."""
        base_result = self.simulator(base_scenario)
        violations = []
        
        for delta_v in speed_reductions:
            modified = base_scenario.copy()
            modified['ego_speed'] = base_scenario['ego_speed'] - delta_v
            if modified['ego_speed'] <= 0:
                continue
            
            modified_result = self.simulator(modified)
            
            if (base_result.get('passed', True) and 
                not modified_result.get('passed', True)):
                violations.append({
                    'relation': 'MR1_speed_monotonicity',
                    'base_speed': base_scenario['ego_speed'],
                    'modified_speed': modified['ego_speed'],
                    'base_passed': True,
                    'modified_passed': False,
                    'severity': 'high',
                    'description': (f"System passed at {base_scenario['ego_speed']:.1f} "
                                    f"km/h but FAILED at lower speed "
                                    f"{modified['ego_speed']:.1f} km/h"),
                })
        
        self.violations.extend(violations)
        return violations
    
    def test_symmetry(self, base_scenario: dict, 
                       mirror_axis: str = 'lateral') -> list[dict]:
        """MR6: Mirrored scenario should produce similar results."""
        base_result = self.simulator(base_scenario)
        
        mirrored = base_scenario.copy()
        if mirror_axis == 'lateral':
            # Flip Y coordinates of all actors
            for key in mirrored:
                if key.endswith('_y'):
                    mirrored[key] = -mirrored[key]
        
        mirror_result = self.simulator(mirrored)
        
        violations = []
        # Check if pass/fail status differs
        if base_result.get('passed') != mirror_result.get('passed'):
            violations.append({
                'relation': 'MR6_symmetry',
                'base_passed': base_result.get('passed'),
                'mirror_passed': mirror_result.get('passed'),
                'severity': 'medium',
                'description': f"Asymmetric behavior: base={'pass' if base_result.get('passed') else 'fail'}, "
                               f"mirror={'pass' if mirror_result.get('passed') else 'fail'}",
            })
        
        # Check if min_distance differs significantly
        base_dist = base_result.get('min_distance', 0)
        mirror_dist = mirror_result.get('min_distance', 0)
        if abs(base_dist - mirror_dist) > 0.5:  # 0.5m threshold
            violations.append({
                'relation': 'MR6_symmetry_distance',
                'base_min_distance': base_dist,
                'mirror_min_distance': mirror_dist,
                'severity': 'low',
                'description': f"Distance asymmetry: {abs(base_dist - mirror_dist):.2f}m",
            })
        
        self.violations.extend(violations)
        return violations

4.5 Fuzzing Perception Inputs

Fuzzing injects random perturbations into sensor data to test perception robustness. Unlike adversarial attacks (which are optimized to fool the model), fuzzing tests resilience to random noise and corruption.

Point cloud fuzzing strategies:

StrategyDescriptionSimulates
Random point dropoutRemove N% of points randomlySensor degradation, rain absorption
Structured dropoutRemove all points in a cone/sectorSingle beam failure, LiDAR sector blocked
Gaussian noise injectionAdd N(0, sigma) to XYZ coordinatesVibration, temperature-induced error
Ghost point injectionAdd random clusters of pointsMultipath reflection, lens contamination
Intensity perturbationRandomize intensity channelSurface material variation
Temporal jitterShift timestamps by random offsetsClock synchronization errors
Point cloud duplicationDuplicate a section of the scan offset by dx,dy,dzMechanical vibration between scans

Expected behavior under fuzzing:

  • Safe degradation: The system should detect input corruption (via OOD detection, see ../runtime-assurance/simplex-safety-architecture.md Section 3) and enter a degraded operating mode (reduce speed or stop) rather than making dangerous decisions based on corrupted data.
  • No crashes: The software must not segfault, hang, or produce undefined behavior under any fuzzed input.
  • Bounded output: Planning outputs must remain within physically valid bounds (speed within limits, steering within mechanical range) regardless of input corruption.

5. Simulation-Based Verification and Validation

5.1 Software-in-the-Loop (SIL)

SIL testing runs the complete AV software stack (perception, localization, planning, control) against simulated sensor data in a simulated environment. No real hardware is involved.

Architecture:

┌──────────────────────────────────────────────────────┐
│                    SIL Test Harness                    │
│                                                        │
│  ┌─────────────┐    ┌─────────────────────────────┐   │
│  │  Simulator   │    │    AV Software Stack         │   │
│  │  (CARLA /    │───>│    (ROS Noetic nodes)       │   │
│  │   Isaac Sim) │    │                             │   │
│  │             │<───│  Perception → Planning →     │   │
│  │  Simulated   │    │  Control → /cmd_twist       │   │
│  │  Sensors:    │    │                             │   │
│  │  - LiDAR x4-8│    │  Safety Monitor             │   │
│  │  - Camera x4 │    │  Simplex Arbitrator         │   │
│  │  - IMU       │    └─────────────────────────────┘   │
│  │  - GPS       │                                      │
│  │             │    ┌─────────────────────────────┐   │
│  │  Simulated   │    │    Test Oracle               │   │
│  │  Actors:     │    │    - Collision detection     │   │
│  │  - Aircraft  │    │    - Clearance violation     │   │
│  │  - GSE       │    │    - TTC computation         │   │
│  │  - Personnel │    │    - RSS check               │   │
│  │  - FOD       │    │    - Geofence check          │   │
│  └─────────────┘    └─────────────────────────────┘   │
│                                                        │
│  Test Controller: scenario execution, metrics, logging │
└──────────────────────────────────────────────────────┘

SIL configuration for reference airside AV stack:

ComponentSIL ImplementationNotes
LiDAR sensorsCARLA ray-cast LiDAR (4-8 sensors, matching RoboSense HELIOS/RSBP specs)Configure channels, range, rotation rate to match real sensors
Camera sensorsCARLA RGB cameras (if using camera fallback mode)Match resolution, FoV, mounting position
IMUSimulated 500 Hz IMU with configurable noise modelMatch real IMU noise characteristics
GPSSimulated RTK-GPS with configurable accuracy and dropoutInclude multipath effects near buildings
ROS interfaceCARLA ROS bridge (publishes to standard ROS topics)Topic remapping to match reference airside AV stack namespace
EnvironmentCustom airport map in CARLA (see simulators-for-airside.md Section 1)Import from AMDB + custom 3D assets
ActorsScripted via OpenSCENARIO or CARLA Python APICustom blueprints for GSE and aircraft

SIL test execution pipeline:

bash
# 1. Launch simulator with airport environment
carla_server --map=AirsideTestAirport --quality-level=Epic &

# 2. Launch ROS bridge
roslaunch carla_ros_bridge carla_ros_bridge.launch \
    host:=localhost port:=2000 &

# 3. Launch AV stack (production or shadow, depending on test)
roslaunch airside_bringup sil_test.launch \
    stack:=production \
    record_bag:=true &

# 4. Execute scenario
python3 run_scenario.py \
    --scenario=scenarios/personnel_crossing.yaml \
    --num_runs=100 \
    --output_dir=results/personnel_crossing/

# 5. Analyze results
python3 analyze_results.py \
    --results_dir=results/personnel_crossing/ \
    --metrics=collision,ttc,min_distance,rss_violation \
    --report=reports/personnel_crossing.html

SIL test volume targets:

Test CategoryScenariosRuns per ScenarioTotal RunsEstimated Time
Nominal operations (115 functional)11510011,500~48 hours
Parameterized sweep (575 logical)5755028,750~120 hours
Adversarial search (per campaign)1,00011,000~4 hours
Regression suite (golden scenarios)100101,000~4 hours
Monte Carlo (statistical confidence)10,000110,000~42 hours
Total per release~52,250~218 hours

At a typical SIL throughput of 5-10x real-time (scenario includes setup, execution, teardown), a single workstation with a capable GPU can process approximately 250 scenarios per hour, or about 6,000 per day. The full test suite requires approximately 9 computation-days per release, which is parallelizable across multiple machines.

5.2 Hardware-in-the-Loop (HIL)

HIL testing uses the real Orin compute hardware running the AV software stack, but replaces real sensors with simulated sensor data injected at the hardware interface level. This validates:

  • Real-time performance on actual compute hardware
  • Timing behavior under load
  • GPU memory management
  • Sensor driver compatibility
  • Thermal behavior during sustained operation

HIL architecture:

┌───────────────────────────────────────────────────────────┐
│                     HIL Test Bench                         │
│                                                             │
│  ┌──────────────┐     ┌────────────────────────────────┐  │
│  │  Sensor       │     │   Real NVIDIA Orin              │  │
│  │  Simulation   │────>│   (Jetson AGX Orin 64GB)       │  │
│  │  Workstation  │     │                                │  │
│  │               │     │   Running:                     │  │
│  │  - LiDAR      │UDP  │   - ROS Noetic                 │  │
│  │    point cloud │────>│   - airside_perception         │  │
│  │    generator   │     │   - airside_nav                │  │
│  │               │     │   - airside_localization        │  │
│  │  - Camera     │MIPI │   - airside_safety             │  │
│  │    frame       │CSI  │   - airside_control            │  │
│  │    injector    │────>│                                │  │
│  │               │     │   Output: /av_nav/cmd_twist    │  │
│  │  - IMU/GPS    │UART │     (captured, not sent to     │  │
│  │    emulator   │────>│      actuators)                │  │
│  └──────────────┘     └────────────────────────────────┘  │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  HIL Controller                                       │  │
│  │  - Scenario orchestration                             │  │
│  │  - Timing measurement (sensor-to-command latency)     │  │
│  │  - GPU/CPU utilization monitoring                     │  │
│  │  - Thermal monitoring                                 │  │
│  │  - Result validation                                  │  │
│  └──────────────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────────────┘

HIL-specific test focus areas:

Test FocusWhat It ValidatesPass Criteria
End-to-end latencySensor input to control output time<100 ms (10 Hz control loop)
Perception throughputFrames processed per second>10 FPS for all LiDAR processing
GPU memoryPeak VRAM usage under worst case<80% of 64 GB (leave headroom)
CPU utilizationAll cores under max load<85% sustained
Thermal throttlingPerformance under thermal stressNo throttling at 40C ambient (max expected apron temperature)
Watchdog timingSafety watchdog triggers on timeoutTriggers within 200 ms of stack freeze
Sensor dropout handlingResponse to sudden sensor lossDetects within 100 ms, enters degraded mode
Multi-LiDAR syncPoint cloud alignment across 4-8 sensors<5 ms inter-sensor sync error

HIL test volume: 1,000 hours minimum, focusing on timing-critical scenarios and degraded sensor modes. This is not about statistical scenario coverage (SIL handles that) but about hardware-specific validation.

5.3 Vehicle-in-the-Loop (VIL)

VIL testing uses the real vehicle on a test track with a combination of real physical obstacles and injected virtual scenarios. The vehicle drives physically, but additional virtual actors (aircraft, GSE, personnel) are overlaid onto the real sensor data.

VIL approaches:

ApproachDescriptionFidelityCost
Physical obstacles onlyMannequins, dummy GSE, foam aircraft mockupHighest for physical interaction$50-100K test track setup
Augmented reality injectionReal driving + virtual actors injected into point cloud before perceptionHigh (real vehicle dynamics, simulated actors)$20-50K software development
Scenario injection via V2XReal driving + virtual actors communicated via simulated V2X messagesMedium (tests fusion, not raw perception)$10-20K

VIL test track requirements (see Section 11 for full details):

The test track must include physical representations of:

  • Aircraft nose mockup (foam/inflatable, correct dimensions)
  • GSE obstacles at various heights
  • Pedestrian mannequins (articulated, motorized for crossing scenarios)
  • Surface markings (stand centerline, safety zone boundaries)
  • Apron lighting (adjustable intensity for day/night testing)

5.4 Sim-to-Real Gap Quantification

The value of simulation-based testing depends critically on how well simulation matches reality. The sim-to-real gap must be measured and bounded.

Domain gap metrics:

MetricWhat It MeasuresHow to Compute
Point cloud density ratioLiDAR return density sim vs. realPoints per m^3 at matched distances
Intensity distribution KL-divergenceLiDAR intensity realismKL(p_real_intensity
Detection AP gapPerception accuracy differenceAP_real - AP_sim on matched scenarios
Planning trajectory deviationPlanning behavior differenceL2 distance between sim and real planned paths for same scenario
FID (Frechet Inception Distance)Overall visual similarity (camera)FID between real and simulated image distributions
Chamfer distance3D point cloud shape similarityAverage nearest-neighbor distance between real and simulated clouds

Sim-to-real gap reduction strategies:

  1. Sensor model calibration: Measure real sensor characteristics (noise, dropout, beam divergence, intensity response) and configure the simulator to match. For RoboSense HELIOS: measure actual beam pattern, range-dependent noise profile, return rate on different materials.

  2. Domain randomization: During SIL testing, randomly vary simulation parameters (lighting, surface reflectance, point cloud noise) to train/test across a distribution that brackets reality.

  3. Real-data replay augmentation: Record real sensor data, replay through the stack, then perturb (add actors, change weather) to create semi-real scenarios that preserve the real sensor characteristics while varying the scenario content.

Sim-to-real transfer performance targets:

MetricAcceptable GapAction if Exceeded
Detection AP (personnel)<5%Recalibrate sensor model, add domain randomization
Detection AP (aircraft)❤️%Recalibrate reflectance model
Planning trajectory L2<0.3 mCheck vehicle dynamics model, friction parameters
E-stop braking distance<10%Physical braking tests override sim results
Localization drift<0.1 m over 100 mCheck GPS/IMU noise model calibration

5.5 Simulator Selection for Airside

Based on the evaluation in simulators-for-airside.md:

SimulatorBest ForAirside ReadinessIntegration Effort
CARLA 0.9.16 (UE5)SIL scenario testing, large-scale Monte CarloMedium (needs custom airport map, GSE models)2-4 weeks with ROS bridge
NVIDIA Isaac SimHIL sensor simulation, digital twinMedium (Omniverse-based, good RoboSense model)3-6 weeks
GazeboUnit/integration tests, basic scenario replayLow (limited visuals, basic physics)1-2 weeks (familiar to ROS users)
NVIDIA DRIVE SimProduction-grade SIL/HILHigh (enterprise, requires partnership)2-3 months

Recommended multi-simulator strategy:

  • Gazebo for CI/CD unit and integration tests (fast, lightweight, runs in Docker)
  • CARLA for scenario-based SIL testing (best balance of fidelity and throughput)
  • Isaac Sim for digital twin construction and HIL sensor injection (best LiDAR models)
  • Physical test track for VIL and final validation (irreplaceable for certification)

6. Statistical Safety Arguments

6.1 The Zhao-Weng Theorem

The fundamental question of AV safety testing: "How many tests do we need to run to claim the system is safe?"

The Zhao-Weng formulation (adapted from reliability engineering) provides the answer for the zero-failure case:

N = -ln(1 - C) / (1 - R)

where:
  N = number of test runs required (all must pass)
  C = confidence level (probability that the claim is correct)
  R = reliability level (probability the system succeeds in any single run)

Derivation: If the true failure probability is p = 1 - R, the probability of observing zero failures in N independent tests is (1-p)^N = R^N. We want P(true failure rate <= p) >= C, which gives R^N <= 1 - C, hence N >= ln(1-C) / ln(R). For small p, ln(R) = ln(1-p) approximately equals -p, so N approximately equals -ln(1-C) / p.

Key numbers for airside AV certification:

Reliability (R)Confidence (C)Required NInterpretation
99% (1 failure per 100 runs)95%299Minimum for prototype validation
99.9%95%2,995Design target for nominal scenarios
99.9%99%4,603Strong evidence for nominal operation
99.99%95%29,956Target for safety-critical functions
99.99%99%46,050High confidence for personnel detection
99.999%95%299,572Target for catastrophic failure modes
99.999%99%460,515Demonstration of aircraft collision avoidance
python
import math

def zhao_weng_sample_size(reliability: float, confidence: float) -> int:
    """
    Zhao-Weng theorem: required test runs for reliability demonstration.
    
    N = ceil(-ln(1 - C) / -ln(R))
      = ceil(-ln(1 - C) / (1 - R))  [approximation for R close to 1]
    
    Args:
        reliability: target reliability R (e.g., 0.999 for 99.9%)
        confidence: confidence level C (e.g., 0.95 for 95%)
    
    Returns:
        Minimum number of tests that must ALL pass.
    """
    # Exact formula
    N_exact = math.ceil(math.log(1 - confidence) / math.log(reliability))
    # Approximate formula (for R close to 1)
    N_approx = math.ceil(-math.log(1 - confidence) / (1 - reliability))
    return N_exact

def print_sample_size_table():
    """Print sample size requirements for various reliability/confidence levels."""
    reliabilities = [0.99, 0.999, 0.9999, 0.99999]
    confidences = [0.90, 0.95, 0.99]
    
    print(f"{'Reliability':<15} ", end="")
    for c in confidences:
        print(f"{'C=' + str(c):<12}", end="")
    print()
    print("-" * 50)
    
    for r in reliabilities:
        print(f"R={r:<13} ", end="")
        for c in confidences:
            n = zhao_weng_sample_size(r, c)
            print(f"{n:<12,}", end="")
        print()

# Output:
# Reliability      C=0.9       C=0.95      C=0.99
# --------------------------------------------------
# R=0.99           230         299         459
# R=0.999          2,302       2,995       4,603
# R=0.9999         23,025      29,956      46,050
# R=0.99999        230,258     299,572     460,515

6.2 RSS Worst-Case Formal Arguments

Responsibility-Sensitive Safety (RSS) provides formal, mathematical safety guarantees that complement statistical testing. Where statistical arguments say "we tested enough and didn't see failures," RSS arguments say "under these assumptions, a collision is physically impossible."

RSS for airside (adapted from ../runtime-assurance/simplex-safety-architecture.md Section 2):

The RSS safe longitudinal distance for airside operations:

d_safe = v_ego * rho + v_ego^2 / (2 * a_max_brake) + v_other^2 / (2 * a_min_brake_other)

where:
  v_ego = ego vehicle speed
  v_other = other actor speed (towards ego)
  rho = response time (0.5 s for airside)
  a_max_brake = maximum braking deceleration of ego (5 m/s^2)
  a_min_brake_other = minimum assumed braking of other actor (0 m/s^2 for worst case)

For a baggage tractor at 15 km/h (4.17 m/s) approaching a stationary obstacle:

d_safe = 4.17 * 0.5 + 4.17^2 / (2 * 5) + 0
       = 2.085 + 1.74
       = 3.82 m

If the perception system detects the obstacle at >3.82 m and the planner respects the RSS constraint, a collision is formally impossible under the stated assumptions (response time, braking capability).

Formal safety argument structure:

  1. Assumption A1: Perception detects all obstacles at range >= D_detect
  2. Assumption A2: Vehicle can decelerate at >= a_max_brake on all surfaces
  3. Assumption A3: Response time is bounded by rho
  4. Claim C1: If A1, A2, A3 hold and planner follows RSS, then d_closest >= d_safe > 0
  5. Evidence: Physical braking tests validate A2. Perception testing validates A1. HIL timing tests validate A3.
  6. Residual risk: Assumptions may be violated (sensor failure, ice, software hang). Mitigated by Simplex architecture (safety controller as fallback).

6.3 Bayesian Safety Estimation

Bayesian methods combine prior knowledge (from simulation) with field data (from physical testing and operations) to produce a posterior safety estimate.

Prior from simulation:

Run N_sim simulated scenarios, observe k_sim failures. Prior failure rate: p_sim = k_sim / N_sim (with uncertainty from Beta distribution) Prior: Beta(k_sim + 1, N_sim - k_sim + 1)

Posterior from field data:

Observe N_field physical test runs with k_field failures. Posterior: Beta(k_sim + k_field + 1, N_sim + N_field - k_sim - k_field + 1)

Discounting simulation evidence:

Simulation evidence is weaker than physical evidence because of the sim-to-real gap. Apply a discount factor gamma in (0, 1] to simulation counts:

Effective prior: Beta(gamma * k_sim + 1, gamma * (N_sim - k_sim) + 1)
Posterior: Beta(gamma * k_sim + k_field + 1, gamma * (N_sim - k_sim) + N_field - k_field + 1)

Typical discount factors:

  • gamma = 0.01 for low-fidelity simulation (Gazebo basic)
  • gamma = 0.05-0.10 for medium-fidelity (CARLA with calibrated sensors)
  • gamma = 0.10-0.30 for high-fidelity (digital twin with validated sensor models)
  • gamma = 1.0 for real-data replay (no discount, it is real data)
python
from scipy import stats
import numpy as np

class BayesianSafetyEstimator:
    """Bayesian estimation of failure rate combining sim and field data."""
    
    def __init__(self, sim_runs: int, sim_failures: int, 
                 discount_factor: float = 0.1):
        """
        sim_runs: total simulation test runs
        sim_failures: number of failures in simulation
        discount_factor: weight of simulation evidence (0-1)
        """
        self.alpha_prior = discount_factor * sim_failures + 1
        self.beta_prior = discount_factor * (sim_runs - sim_failures) + 1
        self.field_runs = 0
        self.field_failures = 0
    
    def update_with_field_data(self, field_runs: int, field_failures: int):
        """Incorporate physical test/operational data."""
        self.field_runs += field_runs
        self.field_failures += field_failures
    
    @property
    def posterior_alpha(self):
        return self.alpha_prior + self.field_failures
    
    @property
    def posterior_beta(self):
        return self.beta_prior + self.field_runs - self.field_failures
    
    def failure_rate_estimate(self) -> dict:
        """Compute posterior failure rate statistics."""
        a = self.posterior_alpha
        b = self.posterior_beta
        dist = stats.beta(a, b)
        
        return {
            'mean': dist.mean(),
            'median': dist.median(),
            'mode': (a - 1) / (a + b - 2) if a > 1 and b > 1 else 0,
            'std': dist.std(),
            'ci_95_upper': dist.ppf(0.95),
            'ci_99_upper': dist.ppf(0.99),
            'p_below_target': dist.cdf(1e-4),  # P(failure rate < 10^-4)
        }
    
    def required_additional_field_tests(self, target_failure_rate: float,
                                         target_confidence: float) -> int:
        """
        How many more field tests (with zero failures) are needed to 
        demonstrate failure rate < target at given confidence.
        """
        for n in range(0, 1_000_000):
            a = self.posterior_alpha
            b = self.posterior_beta + n
            if stats.beta(a, b).cdf(target_failure_rate) >= target_confidence:
                return n
        return -1  # infeasible

# Example:
# 50,000 SIL runs, 5 failures, discount factor 0.1
# estimator = BayesianSafetyEstimator(50000, 5, discount_factor=0.1)
#
# After 2,000 field km with 0 failures (assuming 1 scenario per km):
# estimator.update_with_field_data(2000, 0)
# result = estimator.failure_rate_estimate()
# print(f"Mean failure rate: {result['mean']:.6f}")
# print(f"95% upper bound:  {result['ci_95_upper']:.6f}")
# print(f"P(rate < 10^-4):  {result['p_below_target']:.4f}")

6.4 Mileage Equivalence

A key question for regulators: "How many simulated miles are equivalent to one real mile?"

There is no universal answer. The equivalence depends on:

  • Simulation fidelity (sensor model accuracy, physics engine quality)
  • Scenario diversity (are the simulated miles interesting, or just straight-line driving?)
  • Validation status (has the simulator been validated against real data?)

Framework for mileage equivalence claims:

Simulation LevelEquivalence RatioJustification
Low fidelity (basic physics, no sensor models)1000:1Only validates logic, not perception
Medium fidelity (calibrated sensors, realistic physics)100:1 to 50:1Validated sensor models, scenario-relevant
High fidelity (digital twin, validated sensor + environment)20:1 to 10:1Demonstrated <5% AP gap to real data
Real-data replay with augmentation5:1 to 2:1Real sensor data, varied scenarios
Physical closed-course testing1:1Real hardware, real environment
Physical on-airport operations1:1Highest fidelity, actual ODD

Waymo's approach: Waymo has driven 20+ million miles on public roads and tens of billions of miles in simulation. They do not publish a formal equivalence ratio but use simulation for three distinct purposes: (1) testing new software before real-world deployment, (2) reproducing and investigating real-world events, (3) generating scenarios that are too dangerous or rare for real-world testing.

Practical implication for airside: To achieve the equivalent of 10,000 physical airport-km for certification:

  • Need 10,000 physical km on airport (TractEasy precedent: 1-6 years)
  • OR 10,000 physical km on test track (mapped as 1:1)
  • AND 500,000-1,000,000 simulated km in validated digital twin (at 50:1-100:1 ratio)
  • AND 50,000+ physical km in shadow mode (no direct contribution to safety claim, but validates simulation fidelity)

6.5 The RAND Study: Why Simulation is Essential

The RAND Corporation's 2016 study "Driving to Safety" (Kalra & Paddock) established the foundational argument for why real-world mileage alone cannot prove AV safety:

Key numbers:

  • US human crash rate: approximately 1.09 fatalities per 100 million miles
  • To demonstrate with 95% confidence that an AV's fatality rate is below the human rate (i.e., AV is at least as safe as a human): need approximately 275 million miles with zero fatalities
  • To demonstrate the AV is 20% better than a human at 95% confidence with 80% power: need approximately 11 billion miles
  • With a fleet of 100 vehicles driving 24/7: 11 billion miles takes over 500 years

For airport airside operations, the challenge is even more acute:

  • The human incident rate on aprons is much higher than road driving (27,000 incidents/year across the industry)
  • But the number of operating vehicles is much smaller (tens, not millions)
  • And the operating hours per vehicle are lower (8-16 hours/day, not 24/7)
  • Therefore: accumulating statistically significant miles through real operations alone would take decades

This is why simulation is not a nice-to-have but a mathematical necessity. The only feasible path to a rigorous safety argument for airside AVs combines:

  1. Formal methods (RSS) for worst-case guarantees
  2. Simulation for statistical coverage of the scenario space
  3. Physical testing for sim-to-real validation
  4. Shadow mode for real-world ODD validation
  5. Bayesian combination of all evidence sources

7. Shadow Mode Validation

7.1 Shadow Mode Architecture

Shadow mode runs the AV software stack in parallel with a human operator who has actual control of the vehicle. The AV system processes real sensor data and makes decisions, but those decisions are recorded rather than executed. This provides real-world testing with zero safety risk.

For the reference airside AV stack Simplex architecture (see ../runtime-assurance/simplex-safety-architecture.md Section 4), shadow mode is a natural first step:

Real Sensors ──┬──> Production Stack ──> Actuators (human controls)

               └──> Shadow Stack ──> Decisions logged (NOT executed)

                         └──> Compared against human actions

7.2 Intervention Rate Metrics

The primary metric for shadow mode evaluation is the hypothetical intervention rate: how often would a human operator have needed to intervene if the AV had been in control?

Metric definitions:

MetricDefinitionFormulaTarget (pre-autonomous)
Interventions per hour (IPH)Rate of required human interventionsN_interventions / total_hours<0.1 IPH
Interventions per km (IPK)Distance-normalized intervention rateN_interventions / total_km<0.01 IPK
Miles between interventions (MBI)Average distance between interventionstotal_km / N_interventions>100 km
Critical intervention rateRate of interventions that prevented a safety incidentN_critical / total_hours<0.01 per hour
False positive intervention rateRate of interventions where AV was actually correctN_false_positive / N_interventionsTrack but no target

Intervention classification taxonomy:

CategorySeverityDefinitionExample
Critical safetyHighAV would have caused collision or clearance violationPlanning to drive through personnel
Near-missMedium-HighAV would have come dangerously close<1 m from obstacle, TTC < 1 s
Comfort/efficiencyMediumAV path is safe but suboptimalUnnecessary hard braking, wide detour
NavigationLowAV would have taken wrong routeMissing a turn, entering wrong stand
False alarmInfoAV stopped/slowed for phantom obstacleGhost detection causing unnecessary stop
Operator preferenceInfoHuman chose differently but AV was acceptableSlightly different line through apron

7.3 Disagreement Analysis

Systematically categorize every instance where the AV's decision differs from the human operator's action.

Disagreement detection thresholds:

VariableThreshold for "Disagreement"Threshold for "Critical Disagreement"
Speeddelta_v
Steeringdelta_steer
Stop/goAV moving when human stopped (or vice versa)AV moving when human emergency-stopped
PathLateral deviation > 0.5 mAV path intersects obstacle

Disagreement analysis pipeline:

python
class DisagreementAnalyzer:
    """Analyze shadow mode disagreements between AV and human operator."""
    
    def __init__(self):
        self.disagreements = []
        self.total_frames = 0
    
    def analyze_frame(self, timestamp: float, 
                       human_cmd: dict, av_cmd: dict,
                       scene_context: dict) -> dict | None:
        """
        Compare human and AV commands for a single frame.
        
        human_cmd: {'linear_x': float, 'angular_z': float, 'estop': bool}
        av_cmd:    {'linear_x': float, 'angular_z': float, 'estop': bool}
        scene_context: {'nearest_obstacle_dist': float, 'nearest_obstacle_type': str, ...}
        
        Returns disagreement record if threshold exceeded, else None.
        """
        self.total_frames += 1
        
        speed_diff = av_cmd['linear_x'] - human_cmd['linear_x']
        steer_diff = av_cmd['angular_z'] - human_cmd['angular_z']
        
        # Check for critical disagreement
        is_critical = False
        if human_cmd.get('estop', False) and not av_cmd.get('estop', False):
            is_critical = True
            category = 'critical_safety'
        elif (human_cmd['linear_x'] < 0.1 and av_cmd['linear_x'] > 1.0):
            is_critical = True
            category = 'critical_safety'
        elif abs(speed_diff) > 2.0:  # km/h
            category = 'speed_disagreement'
        elif abs(steer_diff) > 0.087:  # ~5 degrees
            category = 'steering_disagreement'
        else:
            return None  # No significant disagreement
        
        record = {
            'timestamp': timestamp,
            'category': category,
            'is_critical': is_critical,
            'human_speed': human_cmd['linear_x'],
            'av_speed': av_cmd['linear_x'],
            'speed_diff': speed_diff,
            'human_steer': human_cmd['angular_z'],
            'av_steer': av_cmd['angular_z'],
            'steer_diff': steer_diff,
            'nearest_obstacle': scene_context.get('nearest_obstacle_dist'),
            'obstacle_type': scene_context.get('nearest_obstacle_type'),
        }
        self.disagreements.append(record)
        return record
    
    def summary_report(self) -> dict:
        """Generate summary of all disagreements."""
        if self.total_frames == 0:
            return {'error': 'No frames analyzed'}
        
        categories = {}
        for d in self.disagreements:
            cat = d['category']
            categories[cat] = categories.get(cat, 0) + 1
        
        critical = [d for d in self.disagreements if d['is_critical']]
        
        return {
            'total_frames': self.total_frames,
            'total_disagreements': len(self.disagreements),
            'disagreement_rate': len(self.disagreements) / self.total_frames,
            'critical_disagreements': len(critical),
            'critical_rate': len(critical) / self.total_frames,
            'by_category': categories,
            'agreement_rate': 1 - len(self.disagreements) / self.total_frames,
        }

7.4 Shadow-to-Autonomous Transition Criteria

The decision to transition from shadow mode to supervised autonomous operation requires meeting quantitative thresholds:

Phase gate criteria (aligned with ../runtime-assurance/simplex-safety-architecture.md Section 4.3):

GateFromToCriteriaMinimum Duration
G1Shadow modeSupervised SimplexAgreement rate >85%, zero critical disagreements in last 1,000 km, all golden scenarios pass in SiL3 months
G2Supervised SimplexFull SimplexAgreement rate >95%, zero safety violations in last 5,000 km, operator intervention rate <0.1/hour3 months
G3Full SimplexPrimary (shadow becomes primary)>99% shadow driving time, zero collisions in last 10,000 km, regression suite 100% pass6 months
G4PrimaryUnsupervised (no safety operator)Regulatory approval obtained, insurance coverage confirmed, intervention rate <0.01/hour for 6+ months12+ months

Gate review process:

  1. Data collection: Automated metrics from fleet telemetry
  2. Engineering review: Safety team reviews all critical disagreements and near-misses
  3. Independent assessment: Third-party auditor reviews evidence (required for G3 and G4)
  4. Stakeholder sign-off: Airport authority, airline partners, insurance provider
  5. Regulatory notification: Inform FAA/CAA/EASA of phase transition (required for G4)

8. Regression Testing

8.1 Test Suite Management

A regression test suite for an airside AV system with 115+ functional scenarios requires disciplined management.

Test suite structure:

test_suite/
├── golden/                    # Must-pass scenarios (never deleted)
│   ├── personnel_detection/   # 20 scenarios
│   ├── aircraft_clearance/    # 15 scenarios
│   ├── emergency_stop/        # 15 scenarios
│   ├── gse_interaction/       # 15 scenarios
│   ├── environmental/         # 15 scenarios
│   ├── geofence/              # 10 scenarios
│   └── multi_vehicle/         # 10 scenarios
│   └── total: 100 golden scenarios
├── parameterized/             # Auto-generated from logical scenarios
│   ├── transit/               # 500 scenarios
│   ├── approach/              # 400 scenarios
│   ├── turnaround/            # 600 scenarios
│   └── ...
│   └── total: ~5,000 parameterized scenarios
├── adversarial/               # Discovered by search (grows over time)
│   ├── discovered_2025Q1/     # 50 scenarios
│   ├── discovered_2025Q2/     # 75 scenarios
│   └── ...
│   └── total: growing (target 500+ by certification)
├── replay/                    # Real-world data replays
│   ├── incidents/             # All incidents/near-misses
│   ├── interesting/           # Flagged by operators
│   └── random_sample/         # Statistical sample of normal ops
│   └── total: growing (target 1,000+ by certification)
└── metamorphic/               # Metamorphic relation tests
    ├── speed_monotonicity/    # 200 scenario pairs
    ├── symmetry/              # 100 scenario pairs
    └── ...
    └── total: ~500 relation checks

Test lifecycle:

EventGoldenParameterizedAdversarialReplay
New software releaseRun all (must pass 100%)Run all (track pass rate)Run all (must pass 100%)Run all (track divergence)
Perception model updateRun allRun perception-relevantRun allRun perception-relevant
Planner parameter changeRun safety-relevantRun planner-relevantRun allRun planner-relevant
New scenario discoveredAdd to adversarialMay generate new parameterizedAdd from searchAdd from fleet data
Scenario failsInvestigate; if valid bug, fix. If scenario invalid, remove. Never remove valid failing test.SameSameSame

8.2 Perception Regression

Track perception metrics across model versions to detect regressions:

MetricTracked PerRegression ThresholdAction
mAP (overall)Object class, distance band>1% dropBlock release, investigate
AP (personnel)Distance, lighting, weather>0.5% dropBlock release (safety-critical)
AP (aircraft)Distance, aircraft type>0.5% dropBlock release (safety-critical)
NDS (nuScenes Detection Score)Overall>1% dropWarning, review
Recall @ 30m (personnel)Lighting condition>0.5% dropBlock release
FPR (false positive rate)Object class>50% increaseWarning, review
Inference latencyOverall, per-component>10% increaseBlock release (timing-critical)
GPU memoryPeak usage>10% increaseWarning (may affect other modules)

Perception regression CI pipeline:

yaml
# .github/workflows/perception-regression.yml (conceptual)
name: Perception Regression
on:
  pull_request:
    paths:
      - 'airside_perception/**'
      - 'models/**'

jobs:
  regression:
    runs-on: [self-hosted, gpu, orin]  # or cloud GPU
    steps:
      - name: Build perception stack
        run: catkin build airside_perception
      
      - name: Run evaluation on validation set
        run: |
          python3 evaluate_perception.py \
            --model=models/latest \
            --dataset=datasets/airside_val_v3 \
            --output=results/perception_eval.json
      
      - name: Compare against baseline
        run: |
          python3 compare_metrics.py \
            --current=results/perception_eval.json \
            --baseline=baselines/perception_v2.3.json \
            --thresholds=config/regression_thresholds.yaml
      
      - name: Gate decision
        run: |
          python3 gate_decision.py \
            --comparison=results/comparison.json \
            --fail-on=safety_critical_regression

8.3 Planning Regression

Track planning metrics across planner updates:

MetricDefinitionRegression Threshold
Collision rate% of scenarios with collisionAny increase from 0% (zero tolerance)
Min clearance (aircraft)Minimum distance to aircraft across all scenarios<5 m (absolute threshold)
Min clearance (personnel)Minimum distance to personnel<2 m (absolute threshold)
Average TTCMean time-to-collision in near-miss scenarios>10% decrease
Comfort: max jerkMaximum longitudinal jerk>3 m/s^3
Comfort: max lateral accelerationMaximum lateral acceleration>2 m/s^2
Mission completion rate% of scenarios completed successfully>1% decrease
Average mission timeTime to complete standard missions>10% increase
E-stop rate% of scenarios requiring emergency stop>50% increase
Path efficiencyRatio of actual path length to optimal>5% decrease

8.4 CI/CD Integration

Automated test execution on every code change:

Test tiers and execution triggers:

TierTestsTriggerDurationEnvironment
T0: Smoke10 golden scenarios, build checkEvery commit15 minutesDocker container
T1: UnitAll unit tests + static analysisEvery PR30 minutesDocker container
T2: Integration100 golden scenarios in GazeboPR merge to develop2 hoursGPU server
T3: Full regressionAll 5,000+ scenarios in CARLAWeekly + pre-release24 hoursGPU cluster (4+ nodes)
T4: ExtendedAdversarial search + Monte CarloMonthly + pre-certification72 hoursGPU cluster

Golden scenario gate: No software release may proceed to physical testing unless 100% of golden scenarios pass at the T2 level. This is a hard gate with no exceptions.

8.5 Golden Scenarios

Golden scenarios are a curated set of must-pass scenarios that represent the most safety-critical situations. They are:

  1. Immutable: Once added, a golden scenario is never removed (only deprecated if the ODD changes)
  2. Representative: Cover each hazard category from the airside taxonomy
  3. Reproducible: Fully specified concrete scenarios with deterministic execution
  4. Gating: Any golden scenario failure blocks deployment

Golden scenario selection criteria:

Selection CriterionDescription
Real-world incidentDerived from an actual incident or near-miss (highest priority)
Adversarial discoveryFound by search-based testing as failure-inducing
Hazard coverageEnsures each of the 115 functional scenarios has at least one golden representative
Regulatory requirementSpecifically required by ISO 3691-4, UL 4600, or airport authority
High-consequenceInvolves aircraft proximity, personnel safety, or geofence violation

Initial golden scenario set (100 scenarios):

CategoryCountExamples
Personnel detection and avoidance20Crouching person behind GSE, person emerging from under aircraft, group crossing path, person in blind spot
Aircraft clearance15Approach to narrow-body stand, approach to wide-body stand, aircraft pushback in progress, engine start during approach
Emergency stop15Sudden obstacle at 5m/10m/20m/30m, e-stop on wet surface, e-stop on slope, e-stop with loaded dolly train
GSE interaction15Belt loader crossing path, fuel truck right-of-way, follow-me car leading, parallel approach with another tractor
Environmental15Heavy rain, fog <100m, night + no apron lights, de-icing spray, jet blast zone transit
Geofence and navigation10Approach to geofence boundary, taxiway crossing with aircraft, depot entry/exit transition
Multi-vehicle coordination10Two AVs approaching same stand, AV yielding to manual GSE, convoy operation, conflicting paths at intersection

9. Digital Twin Validation for Airside

9.1 Airport Digital Twin Construction

A digital twin is a high-fidelity virtual replica of a specific airport that enables scenario-based testing with realistic geometry, materials, and environmental conditions.

Data sources for digital twin construction:

Data SourceWhat It ProvidesAccuracyCostAvailability
AMDB (Aeronautical Mapping Database)Apron layout, taxiway geometry, stand positions+/-0.5 m at bestFree (FAA for 500+ US airports)Public
HD survey (LiDAR/photogrammetry)3D point cloud of apron area, building facades, static objects+/-0.02-0.05 m$20-50K per airportCustom survey required
As-built CAD drawingsBuilding dimensions, infrastructure layoutVaries (may be outdated)Airport authority providesRequest from airport
Satellite/aerial imageryGround texture, surface markings, layout verification+/-0.3-1.0 m$500-5K (commercial providers)Readily available
AIP chartsTaxiway designations, stand numbers, surface typesAuthoritative but low-resFree (national AIPs)Public

Construction pipeline:

AMDB data (FAA)
    ↓ Parse AMXM GML → extract geometry
Base layout (apron polygons, taxiway centerlines, stand locations)
    ↓ Overlay HD survey point cloud
Refined geometry (±0.05 m accuracy)
    ↓ Material assignment (concrete, asphalt, markings)
Textured environment
    ↓ Import 3D models (aircraft, GSE, buildings)
Populated scene
    ↓ Add dynamic actors (OpenSCENARIO)
Executable digital twin
    ↓ Validate against real data (drive same route, compare sensor output)
Validated digital twin

Estimated construction cost:

ComponentFirst AirportAdditional Airport (same cluster)
AMDB data acquisition and parsing$5-10K$2-5K
HD survey$20-50K$15-30K
3D environment creation$15-30K$10-20K
3D asset modeling (aircraft, GSE)$10-20K (reusable)$2-5K (customization only)
Sensor model calibration$10-15K$5-10K
Validation against real data$5-10K$3-5K
Total$65-135K$37-75K

9.2 Injecting Real-World Events

The value of a digital twin increases dramatically when it can replay real operational events:

NOTAM injection:

Parse real NOTAMs (Notices to Air Missions) and apply their effects to the simulation:

NOTAM TypeSimulation EffectExample
Taxiway closureRemove taxiway from available routes, add construction zone"TWY A CLSD BTN A3 AND A5 FOR MAINT"
Stand closureMark stand as unavailable, may block access"STAND 42 CLSD"
Lighting outageDisable apron lights in affected area"APRON FLOOD LGTG U/S RWY 27R APRON"
Construction activityAdd construction vehicles, barriers, personnel"CONSTRUCTION IN PROGRESS APRON EAST"
Wildlife hazardAdd wildlife actors to simulation"BIRD ACTIVITY RPTD APRON AREA"

A-CDM timeline injection:

Import real A-CDM (Airport Collaborative Decision Making) data to drive realistic turnaround timing:

  • TOBT (Target Off-Block Time): Drives pushback scheduling
  • AIBT (Actual In-Block Time): Triggers arrival service sequence
  • ELDT (Estimated Landing Time): Pre-positions baggage tractors

Using real A-CDM data from a partner airport ensures that the simulation reproduces realistic timing pressures, simultaneous stand operations, and fleet utilization patterns.

Weather replay:

Import historical METAR/TAF data to reproduce actual weather conditions:

METAR EGLL 121150Z 24012G22KT 3000 -RA SCT010 BKN015 08/06 Q1012
→ Simulation: Wind 240° at 12kt gusting 22kt, 3km visibility, 
  light rain, scattered clouds at 1000ft, temperature 8°C

9.3 Sensor Simulation Fidelity Requirements

For simulation results to be accepted as certification evidence, the sensor simulation must meet minimum fidelity requirements:

SensorFidelity RequirementValidation Method
LiDAR point cloud<5% point density difference vs. real at matched rangeDrive same route real and sim, compare point counts per m^3
LiDAR intensityKL-divergence < 0.1 between real and sim intensity distributionsStatistical comparison on matched surfaces
LiDAR dropout/noiseReplicate range-dependent noise profile (measured)Compare noise statistics at 10m, 30m, 50m, 100m
LiDAR beam patternMatch real sensor beam elevation angles within 0.05°Measure with calibration target
Camera imageFID < 50 between real and sim images from same viewpointPaired image comparison
IMU noiseAllan variance within 10% of real sensor specCompare Allan variance plots
GPS accuracyReproduce RTK fix/float/no-fix transitions from real dataReplay GPS conditions in sim

Validation protocol:

  1. Drive the real vehicle through a standardized route at the target airport
  2. Record all sensor data (rosbag) with centimeter-accurate ground truth (RTK base station + overhead tracking)
  3. Reproduce the identical route in the digital twin
  4. Record simulated sensor data
  5. Compare real vs. simulated data using the metrics above
  6. If any metric exceeds threshold, refine the sensor model and repeat

10. Airside-Specific Test Protocols

10.1 Aircraft Proximity Testing

Aircraft are the highest-consequence obstacles on the apron. A collision with an aircraft can cause millions of dollars in damage ($250K average, up to $35M per engine or $139M+ structural per IATA Ground Damage Database).

Test matrix:

Test IDDescriptionSpeedApproach AngleAircraft StatePass Criterion
AP-001Head-on approach to narrow-body nose5 km/h0° (centerline)Parked, engines offStop >=3 m from nose
AP-002Head-on approach to wide-body nose5 km/h0° (centerline)Parked, engines offStop >=3 m from nose
AP-003Angled approach to narrow-body (left 30°)10 km/h+30°Parked, APU onStop >=3 m from fuselage
AP-004Angled approach to narrow-body (right 30°)10 km/h-30°Parked, APU onStop >=3 m from fuselage
AP-005Approach to wing tip10 km/h90°Parked, engines offStop >=5 m from wing tip
AP-006Approach to engine nacelle5 km/h45°Engines startingStop >=10 m (jet blast zone)
AP-007Transit behind taxiing aircraft15 km/hFollowingTaxiing outMaintain >=50 m clearance
AP-008Crossing path of taxiing aircraft15 km/hPerpendicularTaxiing inYield, do not enter crossing until >200 m clear
AP-009Approach during pushback5 km/hVariousPushback in progressStop and wait until pushback complete
AP-010Emergency stop near aircraft15 km/hVariousAnyStop within braking distance, no contact

Execution: AP-001 through AP-006 can be tested physically with a foam/inflatable aircraft mockup. AP-007 through AP-009 require SiL or VIL with injected virtual aircraft. AP-010 is tested both physically (with mockup) and in SiL (for comprehensive speed/angle matrix).

10.2 Jet Blast Scenario Validation

Jet blast is an airside-specific hazard with no equivalent in road driving. Engine exhaust can reach 50+ m/s at close range, posing both direct physical danger and sensor interference (LiDAR distortion from thermal gradient, camera image distortion).

Jet blast test matrix:

Test IDEngine StateWind ConditionAV PositionPass Criterion
JB-001Idle (low power)Calm30 m behindAV detects jet blast zone (thermal/anemometer), maintains position
JB-002Breakaway (high power)Calm50 m behindAV detects increased blast, retreats to safe distance
JB-003IdleCrosswind 15 kt20 m lateralAV detects deflected blast cone, adjusts route
JB-004Taxi powerHeadwind 20 kt30 m behindAV detects extended blast zone
JB-005IdleCalmTransit route crosses blast zoneAV routes around blast zone
JB-006Engine start (unexpected)Calm15 m behindAV detects engine start, emergency retreat

Sensor validation under jet blast:

SensorExpected DegradationTest MethodPass Criterion
LiDARRange errors from thermal refraction, increased noiseSiL with physics-based thermal modelOOD detection triggers within 500 ms
CameraHeat shimmer distortionSiL + limited physicalVLM or distortion detector flags anomaly
4D radarMinimal effect (RF penetrates thermal gradient)SiL + physical<2% AP degradation
Thermal cameraStrong signal (heat plume visible)PhysicalJet blast boundary detectable at 100+ m

10.3 Night and Adverse Weather Test Matrix

Test IDTimeWeatherVisibilitySurfaceKey Test Focus
NW-001NightClear>1 kmDryBaseline night perception (apron lights on)
NW-002NightClear>1 kmDryNight perception (apron lights OFF -- power failure)
NW-003NightLight rain500 m-1 kmWetPersonnel detection with reflections
NW-004NightHeavy rain200-500 mStanding waterLiDAR rain filtering, reduced speed
NW-005DayDense fog<200 mWetCamera/LiDAR range reduction
NW-006DaySnow500 mSnow-coveredSurface marking occlusion, reduced friction
NW-007DayDe-icing sprayVariableIce/chemicalSensor contamination, reduced detection
NW-008DawnSun glare>1 kmDryCamera saturation (if cameras active)
NW-009NightFrost>1 kmIcyExtended braking distance
NW-010DayCrosswind 30 kt>1 kmDryVehicle stability, trajectory deviation

Execution method by test:

Test IDPhysicalSiLHILNotes
NW-001Yes (real night)YesYesMust test at actual airport after dark
NW-002Difficult (need airport cooperation)YesYesSiL primary; physical if possible
NW-003-004Opportunistic (real weather)YesYesPhysical when weather occurs naturally
NW-005Rare (need real fog)YesYesSiL primary; physical when fog occurs
NW-006Seasonal (winter only)YesYesPhysical at cold-weather airports
NW-007Controlled (spray test vehicle)YesYesPhysical spray on LiDAR lens
NW-008Yes (dawn testing)YesYesSchedule test at sunrise
NW-009SeasonalYesYesPhysical at cold-weather airports
NW-010OpportunisticYesYesPhysical when wind occurs

10.4 Personnel Interaction Scenarios

Personnel safety is the most critical requirement. Ground crew operate in close proximity to moving vehicles, often in low-visibility conditions, and may behave unpredictably.

Test matrix:

Test IDPersonnel BehaviorEnvironmentDistanceSpeedPass Criterion
PI-001Standing in pathDay, clear20 m10 km/hDetect at 20m, stop >2m away
PI-002Walking across pathDay, clear15 m15 km/hDetect, brake, stop >2m away
PI-003Running across pathDay, clear15 m15 km/hDetect, emergency brake, stop >1m
PI-004Crouching behind GSEDay, clear10 m5 km/hDetect when visible, stop >2m
PI-005Emerging from under aircraftDay, clear8 m5 km/hDetect upon emergence, stop >2m
PI-006Group of 5+ personnelDay, clear20 m10 km/hDetect all individuals, stop >3m from nearest
PI-007Standing in pathNight, no lights15 m10 km/hDetect (thermal/LiDAR), stop >2m
PI-008Walking, wearing hi-visNight, apron lights20 m10 km/hDetect, track, avoid
PI-009Walking, dark clothingNight, apron lights15 m10 km/hDetect, track, avoid
PI-010Personnel on loading platformDay, clear5 m5 km/hDetect elevated person, maintain clearance
PI-011Marshaller giving signalsDay, clear20 m5 km/hDetect marshaller, follow (if gesture recognition active)
PI-012Person lying on groundDay, clear15 m5 km/hDetect low-profile person, stop

Physical test equipment:

  • Articulated pedestrian mannequin (ISO 19206-2 compliant, height 1.8 m, width 0.5 m)
  • Child-sized mannequin (height 1.1 m) for crouching person simulation
  • Motorized mannequin cart (programmable speed 1-8 km/h, programmable direction)
  • Hi-vis vest set (standard airport ground crew PPE)
  • Thermal mannequin or heated target (for thermal camera validation)

10.5 Mixed Fleet Testing

Airport aprons operate mixed fleets of autonomous and manually-driven GSE. Testing must validate safe interaction.

Mixed fleet test scenarios:

Test IDScenarioAV RoleManual GSE BehaviorPass Criterion
MF-001Head-on encounter on service roadEgoApproaching at 15 km/hAV yields or maintains lane
MF-002Intersection without priorityEgoApproaching from rightAV yields to right-of-way
MF-003Manual GSE cuts in frontEgoMerges 10 m aheadAV reduces speed, maintains clearance
MF-004Manual GSE stops suddenlyFollowingEmergency stopAV stops with >2 m clearance
MF-005Convoy with manual leaderFollowingLeader vehicleAV follows at safe distance, matches speed
MF-006Parallel approach to adjacent standEgoManual tractor at next standAV maintains lateral clearance
MF-007Manual GSE ignores AVEgoDoes not yield, drives throughAV performs evasive maneuver or stops
MF-008Manual GSE reverses unexpectedlyEgoReverses at 5 km/hAV detects reverse motion, stops

10.6 Emergency Stop Validation

Emergency stop (e-stop) is the most safety-critical function. ISO 3691-4 requires specific testing.

Braking distance test matrix:

Test IDSpeed (km/h)SurfaceLoadSlopeTarget Braking DistanceMethod
ES-0015Dry concreteUnloaded0%<1.0 mPhysical
ES-00210Dry concreteUnloaded0%<2.5 mPhysical
ES-00315Dry concreteUnloaded0%<5.0 mPhysical
ES-00425Dry concreteUnloaded0%<12.0 mPhysical
ES-00515Dry concreteFull load (3-dolly train)0%<8.0 mPhysical
ES-00615Wet concreteUnloaded0%<7.5 mPhysical
ES-00715Wet concreteFull load0%<12.0 mPhysical
ES-00815Icy surfaceUnloaded0%<15.0 mPhysical (winter)
ES-00915Dry concreteUnloaded+3% (uphill)<4.5 mPhysical
ES-01015Dry concreteUnloaded-3% (downhill)<6.5 mPhysical

E-stop activation methods to test:

Activation MethodTestPass Criterion
Physical e-stop button on vehiclePress while driving at 15 km/hFull stop within specified distance
Remote e-stop (wireless)Activate from 50 m awayFull stop within specified distance + 200 ms latency
Software e-stop (safety monitor)Inject virtual obstacle at 5 mFull stop before obstacle
Watchdog timeoutKill perception nodeE-stop within 200 ms of heartbeat loss
Communication lossDisable all wirelessE-stop within configured timeout (e.g., 5 s)

Test repetitions per configuration: Minimum 30 runs per test ID to establish statistical confidence on braking distance (mean and 99th percentile).

10.7 Geofence Boundary Testing

The AV must never leave its authorized operating zone. Geofence violations could result in runway incursion (catastrophic) or entry into restricted areas.

Geofence test matrix:

Test IDScenarioSpeedApproach AnglePass Criterion
GF-001Approach geofence boundary head-on15 km/h90° to boundaryStop >1 m before boundary
GF-002Approach geofence boundary at angle15 km/h45° to boundaryStop >1 m before boundary
GF-003Approach geofence boundary at high speed25 km/h90° to boundaryStop before boundary
GF-004GPS dropout near boundary10 km/h90° to boundaryDetect GPS loss, stop (not rely on dead reckoning to cross)
GF-005Route planned through geofenceN/AN/APlanner rejects route, replans
GF-006Dynamic geofence update (NOTAM)10 km/hToward new restrictionGeofence updates, AV reroutes
GF-007Geofence near taxiway crossing10 km/hApproach crossingAV crosses only when cleared, does not violate surrounding geofence

11. Test Infrastructure Requirements

11.1 Test Track Layout

A dedicated test track for airside AV validation should replicate the key features of an airport apron environment.

Minimum test track elements:

┌─────────────────────────────────────────────────────────────┐
│                    Test Track Layout                         │
│                    (minimum 100m x 60m)                      │
│                                                              │
│  ┌──────────────────────────────────────┐                    │
│  │        Mock Apron Area               │                    │
│  │   ┌─────┐    ┌─────┐    ┌─────┐    │                    │
│  │   │Stand│    │Stand│    │Stand│    │                    │
│  │   │  1  │    │  2  │    │  3  │    │                    │
│  │   └──┬──┘    └──┬──┘    └──┬──┘    │                    │
│  │      │          │          │        │                    │
│  │   [Aircraft] [Aircraft] [Aircraft]  │                    │
│  │   [Mockup]   [Mockup]   [Mockup]   │                    │
│  │      │          │          │        │                    │
│  │   Service road ═══════════════════  │                    │
│  │                                      │                    │
│  │   [GSE       ] [Mannequin] [GSE   ] │                    │
│  │   [Obstacles ] [Targets  ] [Obst. ] │                    │
│  └──────────────────────────────────────┘                    │
│                                                              │
│  ═══════ Service Road (straight, 200m) ═══════               │
│                                                              │
│  ┌──────────────┐    ┌──────────────┐                       │
│  │ Taxiway       │    │ Mock Depot   │                       │
│  │ Crossing      │    │ (start/end)  │                       │
│  │ Simulation    │    │              │                       │
│  └──────────────┘    └──────────────┘                       │
│                                                              │
│  ┌──────────────┐                                           │
│  │ Lighting      │                                           │
│  │ Array         │  (adjustable: 0 to 500 lux)              │
│  └──────────────┘                                           │
└─────────────────────────────────────────────────────────────┘

Physical mock-ups required:

ItemSpecificationEstimated CostPurpose
Aircraft nose mockupFoam/inflatable, A320 dimensions, correct reflectivity$5-15KAircraft proximity testing
Aircraft wing section10 m span, correct height, reflective surface$3-8KWing clearance testing
GSE obstaclesFoam belt loader, baggage tractor, fuel truck (static)$2-5K eachObstacle avoidance testing
Pedestrian mannequins (x5)ISO 19206-2 compliant, articulated$2-3K eachPersonnel detection testing
Motorized mannequin cart (x2)Programmable 1-8 km/h, direction control$5-10K eachDynamic personnel testing
Surface markingsAirport-standard stand markings, safety zone paint$5-10KNavigation, geofence testing
Portable lighting arrayLED flood lights, adjustable 0-500 lux$3-5KNight/lighting testing
Rain simulation systemSprinkler array for controlled rain conditions$5-10KWet weather testing
FOD objectsStandard FOD items (tools, debris, bags) at various sizes$500-1KFOD detection testing

Total test track setup cost: $50-100K (excluding land and site preparation)

11.2 Ground Truth Instrumentation

To validate AV perception and localization, high-accuracy ground truth is essential.

InstrumentPurposeAccuracyCost
RTK-GPS base station + roverVehicle position ground truth+/-0.02 m$5-15K
Overhead tracking cameras (4K, 60fps)Bird's-eye view, actor positions+/-0.1 m at 10 m height$10-20K (4-camera system)
Radar speed gunVehicle speed verification+/-0.5 km/h$500-2K
Laser distance meter (mounted)Braking distance measurement+/-0.001 m$1-3K
Time-synchronized data loggerCorrelate ground truth with AV data<1 ms sync$5-10K
Weather stationWind, temperature, humidity, rain rateMeteorological grade$2-5K
Surface friction testerFriction coefficient measurement+/-0.01 mu$3-8K (rental)

Total instrumentation cost: $30-65K

11.3 Test Vehicle Configuration

The test vehicle must be configured for both safe testing and comprehensive data collection:

Required test configuration:

FeatureDescriptionPurpose
Remote emergency stopWireless e-stop with 100+ m range, <200 ms latencySafety during testing
Data loggingFull rosbag recording (all topics, all sensors)Post-test analysis
Ground truth integrationRTK-GPS rover, sync to vehicle clockLocalization validation
Speed limiter (hardware)Configurable maximum speed (5/10/15/25 km/h)Enforce speed limits during test phases
Perception debug outputReal-time visualization of detections, planned pathTest engineer monitoring
Remote monitoringWireless video + telemetry to control stationTest oversight
Quick-release ballastSimulate loaded/unloaded conditionsLoad variation testing
Sensor health monitoringReal-time status of all sensorsDetect sensor issues during test

11.4 Test Campaign Cost Estimates

Per-airport certification test campaign:

ActivityDurationPersonnelCost
Test track setup2-4 weeks2-3 engineers$50-100K (one-time)
SiL testing (52,000+ scenarios)2-3 weeks (computation)1-2 engineers$10-20K (compute + labor)
HIL testing (1,000 hours)4-6 weeks1-2 engineers$15-30K
Physical test track (500 scenarios)2-3 weeks2-3 engineers + safety personnel$25-50K
Shadow mode (50,000 km)3-12 months1 safety operator per vehicle$80-200K
On-airport testing (10,000 km)2-6 months1-2 engineers + safety operator$40-100K
Third-party assessment4-8 weeksExternal auditor$30-80K
Documentation and reporting2-4 weeks1-2 engineers$10-20K
Total first airport12-24 months$260-600K
Total additional airport6-12 months$130-350K

Cost reduction for subsequent airports:

  • Test track infrastructure is reusable (transport to new airport or build permanent facility)
  • SiL test suite is reusable (only airport-specific scenarios need creation)
  • HIL test rig is reusable
  • Shadow mode duration may be reduced if technology matures
  • Regulatory precedent may streamline approval

12. Key Findings Summary

#FindingImplication
1RAND study proves real-world mileage alone is insufficient: 11 billion miles needed to prove 20% better than human at 95% confidenceSimulation is a mathematical necessity, not optional
2Zhao-Weng theorem sets sample sizes: 299,572 zero-failure tests needed for 99.999% reliability at 95% confidenceSiL must achieve 300K+ scenario runs for safety-critical claims
3Pairwise covering arrays reduce test cases by >4,800x while capturing 93% of parameter-interaction faultsUse NIST ACTS tool for systematic scenario generation
4Importance sampling can reduce rare-event sample sizes by 10,000x compared to naive Monte CarloBias test distribution toward high-risk scenarios; correct with importance weights
5CMA-ES discovers critical scenarios that random testing misses: evolutionary search finds multi-actor failure modes 15-30% more effectively than random generationRun adversarial scenario search campaigns monthly
6LLM-generated scenarios find 15-30% more failure modes than random scenario generation at equal computational costUse GPT-4/Claude to generate edge cases from safety requirements
7Metamorphic testing catches 5-10% of bugs that oracle-based testing misses: violations of monotonicity relations (slower should be safer) reveal logic errorsAdd metamorphic relations to regression suite
8Sim-to-real AP gap must be <5% for certification evidence: larger gap invalidates simulation resultsInvest $10-15K per airport in sensor model calibration
9Bayesian estimation with sim+field data produces tighter bounds than either alone: discount factor 0.1 for CARLA-quality simulationFormally combine sim and field evidence in safety case
10Shadow mode requires 50,000+ km before supervised autonomous: industry consensus (TractEasy, Waymo, Cruise precedents)Plan 3-12 months of shadow mode per airport
11100 golden scenarios gate every deployment: must-pass set covering all hazard categories with zero tolerance for failureCurate and never delete golden scenarios
12Physical e-stop testing needs 30+ repetitions per configuration for statistical confidence on braking distanceBudget 2-3 weeks of test track time for e-stop alone
13First-airport certification costs $260-600K over 12-24 months: drops to $130-350K for additional airportsFront-load infrastructure investment for multi-airport scaling
14Multi-simulator strategy is optimal: Gazebo (CI/CD) + CARLA (SiL) + Isaac Sim (digital twin) + physical testingNo single simulator covers all V&V needs
15Aircraft proximity is highest-consequence test category: $250K average damage, up to $139M+ structural, zero toleranceDedicate 15 golden scenarios and 1,000+ SiL runs to aircraft clearance
16Jet blast has no road-driving equivalent: requires custom airside test protocols with thermal sensor validationInvest in physics-based jet blast simulation model
17Digital twin construction costs $65-135K per airport but enables unlimited safe testing of dangerous scenariosROI positive if replacing even 10% of physical test hours
18Ground truth instrumentation costs $30-65K (RTK base, overhead cameras, weather station, friction tester)Essential one-time investment for all physical testing
19MC/DC coverage required for ASIL-B safety-critical code (ISO 26262 Part 6)Applies to airside_safety, airside_perception detection path, airside_control actuator commands
204-wise covering arrays capture 98% of parameter-interaction faults with ~1,500 test cases (vs. 390K full combinatorial)Use 4-wise as default for critical scenarios; pairwise for lower-risk

13. References

Standards

  1. ISO 3691-4:2023 -- Industrial trucks -- Safety requirements and verification -- Part 4: Driverless industrial trucks and their systems
  2. ISO 34502:2022 -- Road vehicles -- Test scenarios for automated driving systems -- Scenario based safety evaluation framework
  3. ISO 21448:2022 (SOTIF) -- Road vehicles -- Safety of the intended functionality
  4. ISO 26262:2018 -- Road vehicles -- Functional safety (Parts 1-12)
  5. ISO 13849-1:2023 -- Safety of machinery -- Safety-related parts of control systems -- Part 1: General principles for design
  6. ISO 12100:2010 -- Safety of machinery -- General principles for design -- Risk assessment and risk reduction
  7. EU Machinery Regulation 2023/1230 -- Regulation of the European Parliament and of the Council on machinery (replacing Directive 2006/42/EC)
  8. UL 4600:2023 -- Standard for Safety for the Evaluation of Autonomous Products (3rd edition)
  9. ANSI/UL 3100 -- Standard for Safety for Autonomous Mobile Platforms
  10. ASAM OpenSCENARIO DSL (v2.0) -- Domain-Specific Language for driving scenario description

Academic Papers

  1. Kalra, N. & Paddock, S.M. (2016) -- "Driving to Safety: How Many Miles of Driving Would It Take to Demonstrate Autonomous Vehicle Reliability?" RAND Corporation. RR-1478-RC.
  2. Shalev-Shwartz, S., Shammah, S., & Shashua, A. (2017) -- "On a Formal Model of Safe and Scalable Self-driving Cars." Mobileye/Intel. arXiv:1708.06374.
  3. Kuhn, D.R., Wallace, D.R., & Gallo, A.M. (2004) -- "Software Fault Interactions and Implications for Software Testing." IEEE Transactions on Software Engineering, 30(6), 418-421.
  4. Tian, Y. et al. (2024) -- "LLM-Driven Scenario Generation for Autonomous Driving Testing." IEEE Intelligent Vehicles Symposium.
  5. Zhao, D. et al. (2017) -- "Accelerated Evaluation of Automated Vehicles Safety in Lane-Change Scenarios Based on Importance Sampling Techniques." IEEE TITS.
  6. Corso, A. et al. (2021) -- "A Survey of Algorithms for Black-Box Safety Validation of Cyber-Physical Systems." JAIR.
  7. Sun, Z., Guo, L., & Zhao, D. (2021) -- "Statistical Safety Assessment for Autonomous Vehicles." Springer.
  8. Fremont, D.J. et al. (2020) -- "Scenic: A Language for Scenario Specification and Data Generation." Machine Learning Journal.
  9. Hansen, N. (2016) -- "The CMA Evolution Strategy: A Tutorial." arXiv:1604.00772.
  10. Weng, B. et al. (2022) -- "Model-Based Safety Testing of Automated Driving Systems: A Systematic Literature Review." IEEE Access.

Industry Reports and Tools

  1. NIST ACTS -- Automated Combinatorial Testing for Software (covering array generation tool). https://csrc.nist.gov/projects/automated-combinatorial-testing-for-software
  2. IATA Ground Damage Database -- Annual reports on aircraft ground damage from GSE collisions
  3. Flight Safety Foundation GAP -- Ground Accident Prevention programme data
  4. Waymo Safety Report (2023-2024) -- Published methodology for AV safety validation
  5. TractEasy Safety Case -- Publicly referenced certification approach for airport baggage tractors (1-6 years per approval, >95% mission success)
  6. CARLA Simulator -- Open-source autonomous driving simulator. https://carla.org
  7. NVIDIA Isaac Sim -- Robotics simulation platform. https://developer.nvidia.com/isaac-sim
  8. Euro NCAP AEB Test Protocol -- Autonomous Emergency Braking test methodology (adapted for pedestrian mannequin testing)
  9. ISO 19206-2:2018 -- Test devices for target vehicles, vulnerable road users and other objects -- Part 2: Requirements for pedestrian targets
  1. airside-scenario-taxonomy.md -- 115 functional scenarios, 566 logical, ~5,400 concrete across 8 categories
  2. ../standards-certification/certification-guide.md -- ISO 3691-4, UL 4600, AMLAS, ISO 26262, CE marking, FAA approval process
  3. ../runtime-assurance/simplex-safety-architecture.md -- Dual-stack architecture, RSS for airside, OOD detection, shadow mode logging
  4. ../standards-certification/functional-safety-software.md -- MISRA C, ISO 26262 Part 6, static analysis pipeline
  5. ../../70-operations-domains/airside/operations/fod-and-jetblast.md -- FOD detection and jet blast hazard analysis
  6. ../safety-case/failure-modes-analysis.md -- Perception, world model, and planning failure taxonomy
  7. ../../30-autonomy-stack/simulation/simulators-for-airside.md -- CARLA, Isaac Sim, Gazebo evaluation for airside environments
  8. ../../30-autonomy-stack/simulation/airport-digital-twins.md -- Airport digital twin construction and validation
  9. ../../90-synthesis/decisions/design-spec.md -- 891-line Simplex architecture design specification

Public research notes collected from public sources.