Skip to content

Data Catalog, Lineage, and Quality Operations

Last updated: 2026-05-09

Why It Matters

Fleet data becomes useful only when engineers can answer three questions quickly: what does this dataset contain, where did it come from, and is it fit for the model or safety decision being made? A catalog without lineage is a search index. Lineage without quality checks is an audit trail for bad data. Quality checks without ownership decay into dashboards nobody trusts.

This page covers operational controls for curated fleet data products: raw logs, processed events, labels, features, replay sets, training splits, and evaluation datasets.

Operating Model

  1. Define data products with named owners: raw bag archive, normalized telemetry, object labels, scenario clips, model training tables, and evaluation tables.
  2. Store large analytical datasets in snapshot-capable tables. Use Apache Iceberg snapshots, schema evolution, partition evolution, and retention policies to preserve reproducibility without freezing all storage forever.
  3. Emit lineage events from each pipeline step. OpenLineage concepts of runs, jobs, datasets, and facets map cleanly to bag extraction, decoding, label import, feature generation, and training-set assembly.
  4. Attach quality rules to the catalog entry, not only to the pipeline code. Rules should cover completeness, timestamp monotonicity, frame drops, calibration presence, label validity, class balance, schema compatibility, and privacy filters.
  5. Promote data by state: raw, decoded, validated, curated, approved_for_training, approved_for_safety_evidence, deprecated.
  6. Review quality exceptions weekly with data owners and release blockers daily during model-release windows.

Evidence Artifacts

ArtifactMinimum contentsOwner
Catalog entryDataset purpose, schema, ODD scope, owner, retention, access classData platform
Lineage graphSource datasets, pipeline run IDs, code version, parameters, outputsData platform
Iceberg snapshot recordTable snapshot ID, schema ID, partition spec ID, branch/tag if usedData engineer
Quality reportRule results, sample counts, failure rows, waived failures, trendData quality owner
Data contractRequired fields, units, coordinate frames, timing assumptions, valid rangesProducer and consumer
Label-schema recordTaxonomy, label versions, ontology references, compatibility notesLabel operations
Approval decisionAccepted use, restrictions, expiry, approvers, downstream consumersData steward

Acceptance Checks

  • Every training and evaluation dataset resolves to immutable source snapshots.
  • Every derived dataset has machine-readable lineage back to raw logs, labels, and processing code.
  • Quality checks run before promotion and store both pass/fail status and failure samples.
  • Schema changes are reviewed for downstream model, feature, replay, and safety evidence impact.
  • Catalog entries identify the data owner, business purpose, access restrictions, retention class, and approved uses.
  • Data used in release evidence is marked approved_for_safety_evidence, not only approved_for_training.
  • Waivers have an owner, expiry date, scope, and measurable containment rule.

Failure Modes

Failure modeConsequenceControl
Dataset name reused for mutable contentsModel release cannot be reproducedRequire snapshot IDs in manifests
Pipeline lineage stops at a staging tableRoot cause analysis cannot trace bad labels or corrupted logsEmit lineage at every materialization boundary
Quality checks live only in notebooksFailures are not enforced in productionMove checks into scheduled pipeline gates
Schema evolution breaks consumersTraining jobs silently drop or misread fieldsData contract review before schema promotion
Catalog has owner gapsExceptions are never resolvedBlock promotion for ownerless data products
Quality rules ignore ODD slicesDataset passes globally but misses airport-specific defectsRequire zone, weather, lighting, sensor, and vehicle slices
Retention deletes evidence inputsSafety case cannot be reconstructedLock release evidence snapshots under retention hold
  • 50-cloud-fleet/data-platform/fleet-data-pipeline.md
  • 50-cloud-fleet/data-platform/perception-slam-fleet-data-contract.md
  • 50-cloud-fleet/data-platform/data-engine-from-bags.md
  • 50-cloud-fleet/mlops/data-flywheel-airside.md
  • 50-cloud-fleet/data-governance/fleet-data-privacy-governance.md
  • 60-safety-validation/safety-case/safety-case-evidence-traceability.md
  • 60-safety-validation/verification-validation/perception-slam-statistical-validity-protocol.md

Sources

Public research notes collected from public sources.