Skip to content

SLAM Timing Health Dashboard

Last updated: 2026-05-09

Purpose

This dashboard specification defines fleet observability for SLAM, localization, map-building, and sensor-fusion timing health. It turns PTP/PHC status, sensor timestamps, ROS time, TF/message-filter behavior, latency, jitter, and localization integrity into operational panels and alerts. The dashboard must catch timing regressions during canary, route expansion, map publication, and incident response before they become silent wrong-pose or stale-obstacle failures.

This dashboard supports the perception-SLAM fleet data contract, map hygiene operational monitoring, time-sync fault injection, timestamp shift sweep, sensor dropout, latency, and jitter stress, replay time semantics validation, and map-localization release gates for timing health.

Monitoring Goals

GoalSignal
Detect clock discipline degradationPTP state, grandmaster ID, PHC offset, system-to-PHC offset, path delay
Detect sensor timestamp faultsSensor timestamp source, stamp age, inter-sensor skew, fallback mode
Detect replay/data quality issues/clock validity, MCAP/rosbag metadata, message order, replay determinism tags
Detect fusion timing lossTF failures, message-filter drops, queue age, stale/future rejects
Detect latency and jitter riskSource-to-output latency, inter-arrival jitter, callback age, executor delay
Detect localization integrity riskPose covariance/protection level, residuals, relocalization failures, alert-limit approach
Protect map publicationSource-session timing health, map tile quarantine, timing health tag coverage
Support incidentsJoin vehicle, route, map, calibration, build, bag/MCAP, timing telemetry, and operator response

Required Setup

ItemRequirement
Telemetry producersVehicle clock service, sensor drivers, ROS nodes, localization, map runtime, recorder, and fleet uploader
Schema registryVersioned custom fields for timing, fusion, localization, map timing, and evidence IDs
Threshold inputsTimestamp sweep, timing fault injection, latency/jitter stress, and SLAM integrity release thresholds
Alert routingVehicle response, fleet SRE, mapping, maintenance, release manager, and safety escalation owners
Incident joinsVehicle ID, route, map, calibration, build, bag/MCAP, event ID, and operator action correlation

Telemetry Schema

FieldTypeNotes
time.ptp.stateenumlistening, slave, master, fault, uncalibrated, holdover
time.ptp.grandmaster_idstringRequired for failover and site clock audits
time.ptp.offset_from_master_nsint64From ptp4l or equivalent
time.ptp.mean_path_delay_nsint64Track path-delay spikes and asymmetry proxies
time.phc.system_offset_nsint64Host system clock to PHC offset from phc2sys
time.phc.frequency_ppbdoubleClock servo correction trend
sensor.<name>.timestamp_sourceenumptp, gnss, pps, host_receive, internal, unknown
sensor.<name>.stamp_age_msdoubleHost receive or processing time minus message stamp
sensor.<name>.inter_arrival_jitter_msdoubleRolling p95/p99 per topic
sensor.<name>.dropout_ratedoubleMissing messages over expected rate
fusion.tf.lookup_failuresintBy frame pair and reason
fusion.filter.drop_countintBy filter instance and drop reason
fusion.filter.queue_age_msdoubleOldest queued message age
localization.pose.covariance_xydouble arrayOr derived error ellipse
localization.integrity.protection_level_mdoubleIf implemented by stack
localization.residual.scan_matchdoubleNDT/ICP/factor residual or equivalent
localization.relocalization_eventsintInclude map/tile/frame context
map.source_timing_healthenumgreen, yellow, red, unknown
map.tile_timing_quarantineboolTrue if timing evidence blocks use/publication
replay.time_policyenumsim_time, wall_time, mixed, invalid
evidence.mcap_idstringImmutable bag/MCAP reference
release.idstringRelease/canary cohort correlation

Use OpenTelemetry semantic conventions where standard fields fit, and publish a versioned schema for custom robotics fields so dashboards can evolve without silently changing meaning.

Metrics

MetricDefinitionOperational use
Clock offset bandPTP/PHC/system offset classified green, yellow, red, or unknownDrives timing health state
Sensor stamp ageProcessing or receive time minus source stamp by sensorDetects stale/future data
Inter-sensor skewPairwise skew for synchronized physical events or frame groupsDetects fusion timing risk
Topic freshnessRate, dropout, inter-arrival jitter, and last-message ageDetects sensor or middleware degradation
TF/filter healthLookup failures, extrapolation direction, sync misses, and queue ageDetects fusion fail-closed behavior
Localization integrity marginProtection level or covariance margin to alert limitDetects wrong-pose risk
Runtime latency tailSource-to-output p95, p99, p99.9, burst maximumDetects tail latency that can consume stale data
Map timing coverageFraction of source sessions and active tiles with green timing provenanceControls map publication and quarantine

Dashboard Views

ViewPanels
Fleet timing overviewVehicles by timing state, PTP lock, offset bands, active alerts, canary cohort
Clock disciplineGrandmaster identity, PTP state dwell, PHC/system offset, path delay, frequency correction
Sensor timestamp healthTimestamp source, stamp age, skew matrix, dropout, jitter, fallback events
Fusion timingTF failures, message-filter drops, sync match rate, queue age, stale/future rejects
Localization integrityResiduals, covariance/protection level, relocalization, unsafe-with-alert, unsafe-without-alert
Latency budgetSource-to-output latency p50/p95/p99/p99.9, executor delay, CPU/GPU/logging load
Map timing provenanceSource-session timing health by tile, quarantine state, map bundle/canary rollout
Replay evidenceBags/MCAPs by time policy validity, /clock anomalies, conversion warnings
Incident drilldownVehicle, route, time window, bag/MCAP, map/calibration/build IDs, timing trace, safety action

Alerts

AlertTrigger patternSeverityRequired action
PTP unlockPTP state not locked for longer than approved holdoverP1 or P0 if moving in autonomous modeDegrade or controlled stop depending on route risk
Grandmaster changeUnexpected GM identity or offset step on failoverP1Compare against allowlist and watch localization residuals
PHC driftSystem-to-PHC offset or frequency correction exceeds yellow/red bandP1/P0Route hold if persistent; maintenance ticket
Sensor timestamp fallbackAny safety-relevant sensor switches to host/internal/unknown timeP0Controlled stop or remove modality; block map-building use
Stamp age highMessage age exceeds validated stale-data thresholdP1/P0Reject stale data, degrade, or stop
Inter-sensor skew highSkew exceeds timestamp sweep green/yellow/red thresholdP1/P0Degrade fusion; open timing fault event
TF failure spikePast/future extrapolation or missing transform exceeds thresholdP1Investigate frame/time source; pause release promotion
Filter drop spikeSync or transform filter drop rate exceeds validation envelopeP1Inspect sensor skew, queue size, and topic rate
Localization integrity alertProtection level, residual, or covariance crosses red thresholdP0Controlled stop/remote assist; preserve incident evidence
Timing-red map sourceMap tile built from timing-red/unknown source sessionP0 for map opsQuarantine tile and block publication
Replay invalid evidenceRelease run uses mixed/invalid replay time policyP0 for releaseInvalidate release metric until replay is corrected

Operational Runbook

StateVehicle operationFleet/map operation
GreenContinue mission and canary promotion if other gates passUse logs for release and map-building if data contract is complete
YellowContinue only inside approved ODD; reduce speed if configuredWatch canary, prevent automatic map publication, create reliability ticket
RedControlled stop, remote-assist, or route hold according to safety budgetQuarantine logs/maps, open incident, block release promotion
UnknownTreat as yellow for availability, red for release evidenceExclude from release claims and map publication until telemetry is repaired

Pass and Fail Rules

RulePass conditionFail condition
Dashboard coverageEvery release/canary vehicle publishes required timing, fusion, localization, and map fieldsMissing required field for active release cohort
Alert alignmentGreen/yellow/red thresholds match validated timestamp sweep and integrity gatesDashboard threshold differs from release evidence without approval
Incident joinTiming telemetry joins to vehicle, route, map, calibration, build, bag/MCAP, and operator eventIncident cannot reconstruct timing state
Map quarantineTiming-red or unknown source sessions are blocked from automatic map publicationTiming-red data reaches active map without review
Alert latencyDashboard alert and on-vehicle action are visible within operational SLAAlert arrives too late for operator response
False alarm managementYellow alerts have review queue and suppression rules with owner approvalAlert fatigue causes unacknowledged timing regressions

Evidence and Retention

EvidenceRetention rule
P0 timing/localization eventPreserve raw bag/MCAP, clock telemetry, diagnostics, map/build/calibration IDs, operator notes
Canary timing alertPreserve dashboard trace and event window through release review period
Map timing quarantinePreserve source-session logs, tile IDs, review decision, and release disposition
Replay invalidationPreserve invalid replay manifest and corrected rerun for audit trail
Maintenance timing ticketPreserve sensor serial, firmware, cabling/NIC/clock changes, and post-repair timing check

Sources

Public research notes collected from public sources.