Skip to content

ROS 2 Timing Diagnostics and Observability

Last updated: 2026-05-09

Why It Matters

Timing faults in SLAM and fusion rarely announce themselves as timing faults. They show up as pose instability, intermittent object jumps, TF extrapolation, or planning oscillation. The runtime needs first-class timing observability: topic age, period, transport delay, executor backlog, callback duration, TF lookup health, message-filter drops, and recorder health.

ROS 2 provides building blocks through /diagnostics, diagnostic_updater, Topic Statistics, QoS events, ros2_tracing, and CLI tools. Autoware adds topic state monitors and localization diagnostics. The deployment task is to wire these into one acceptance and operations contract.

Observability Contract

LayerRequired signalsTooling
Application stampHeader age, stamp regression, tuple skew, freshness gate result.Node metrics, /diagnostics, fusion debug topics.
MiddlewareSource timestamp, receive timestamp, sequence gaps, QoS deadline/liveliness.rclcpp::MessageInfo, RMW metadata, QoS event callbacks.
Topic flowPeriod, rate, age, missing topic, low frequency, timeout.Topic Statistics, Autoware topic state monitor, ros2 topic hz/bw/info -v.
ExecutorCallback queue delay, callback duration, timer jitter, dropped work.ros2_tracing/LTTng, custom tracepoints.
TFTransform age, lookup latency, extrapolation, missing frame, message-filter drops.tf2 tools, diagnostics, message-filter metrics.
RecorderBag cache size, write latency, dropped messages, split/snapshot timing.rosbag2 diagnostics and recorder service state.
Fleetp50/p95/p99 age and latency by route, sensor, software version, and host.Metrics pipeline and incident dashboards.

Minimum Metrics

MetricUnitScopeAlerting use
topic_period_ms{topic}msSubscription sideDetect low or bursty rate.
topic_age_ms{topic}msSubscription sideDetect stale data.
source_to_receive_ms{topic}msMiddlewareDetect network/RMW delay.
callback_duration_ms{node,callback}msExecutorDetect compute overrun.
callback_queue_delay_ms{node,callback}msExecutorDetect starvation/backpressure.
tf_lookup_fail_count{reason}countTF consumerDetect missing/stale transforms.
message_filter_drop_count{reason}countFusion synchronizerDetect starvation and queue overflow.
clock_jump_count{node}countTime-sensitive nodesVerify replay seek handling.
bag_write_latency_msmsRecorderDetect logging interference and evidence gaps.

Diagnostic Severity Mapping

SeverityTiming conditionVehicle behavior
OKAll freshness, period, TF, and execution budgets are within nominal envelope.Normal autonomy.
WARNp95 age/period or callback duration exceeds warning threshold but output remains within safety budget.Continue with degraded confidence, log event, increase monitoring.
ERROROutput is stale, missing, or produced after a required deadline.Block downstream trust, request fallback behavior, capture incident clip.
STALEDiagnostic publisher itself is late or missing.Treat component as unknown health; do not assume OK.

Diagnostics should report both state and numbers. "NDT delayed" without sensor_points_delay_time_sec, input stamp, and current threshold is not enough for incident triage.

Tooling Pattern

Use fast CLI checks during development:

bash
ros2 topic info -v /localization/kinematic_state
ros2 topic hz /sensing/lidar/top/pointcloud_raw
ros2 topic bw /sensing/lidar/top/pointcloud_raw
ros2 topic echo /diagnostics
ros2 topic echo /statistics

Use trace runs for executor and callback timing:

bash
ros2 trace -s slam_fusion_timing

Use Autoware monitors for operational topic health:

MonitorDetects
autoware_topic_state_monitorNot received, low frequency, significantly low frequency, timeout.
NDT diagnosticsScan delay, transform success, point count, matching score, execution time.
EKF diagnosticsPose/twist delay, no update, gate rejection, state validity.

Failure Modes

Failure modeSymptomControl
Only rate is monitoredTopic remains 10 Hz but messages are 500 ms old.Monitor both period and age.
Average hides tailsMean callback time looks fine while p99 misses deadlines.Export histograms or rolling p95/p99.
Diagnostics backlog/diagnostics arrives late and reports old OK state.Monitor diagnostic message age and publisher liveliness.
CLI observer perturbs systemExtra reliable subscriber changes network or CPU load.Use low-impact metrics paths and profile observer overhead.
Trace disabled in releaseTiming fault cannot be root-caused after incident.Keep low-overhead tracepoints compiled and enable capture on trigger.
Fleet dashboard loses time baseCross-host latency charts are negative or impossible.Include host clock offset and time-source metadata.

Acceptance Checks

  • Every SLAM/fusion input has rate and age monitoring, not rate alone.
  • Every fusion synchronizer reports tuple skew, tuple age, drops, and queue depth.
  • Every localization output has a freshness gate before planning/control consumption.
  • ros2_tracing or equivalent tracepoints can measure callback duration and executor delay in a representative run.
  • Diagnostic messages older than their own freshness budget are treated as stale health, not OK health.
  • Fault injection for delayed sensor, missing TF, executor stall, and recorder disk pressure produces distinct diagnostic signatures.

Sources

Public research notes collected from public sources.