Skip to content

Evaluation, Calibration, and Data Leakage: First Principles

Evaluation, Calibration, and Data Leakage: First Principles curated visual

Visual: evaluation split firewall showing train/validation/test/calibration partitions, leakage paths, reliability diagram, and uncertainty interval.

Scope

ML evaluation turns model behavior into evidence. Calibration decides whether a reported confidence means what it says. Data leakage decides whether the evidence is real or inflated. This note explains first-principles metrics, calibration math, leakage modes, benchmark contamination, and AV-specific evaluation practices.

Related local notes:

1. The First-Principles Rule

An evaluation is valid only relative to a claim:

text
claim = model + task + operating domain + metric + protocol + uncertainty

Examples:

text
"Improves cone detection on night airside logs by 5 AP"
"Reduces future occupancy false negatives in rain"
"Keeps p99 latency below 50 ms on Orin"
"Calibrated 95% risk score covers 95% of realized hazards"

A metric without a claim is just a number.

2. Dataset Splits Are Part of the Model Contract

Random frame splits are often invalid for AV data because adjacent frames are near duplicates. The split should match the generalization claim.

ClaimSplit should hold out
same route, new timedate/time blocks
new routeroute IDs
new airport/sitesite IDs
new weatherweather condition
new sensor rigvehicle or calibration ID
new map versionmap version
rare event robustnessevent families
future predictionfuture clips and labels

The evaluation protocol should store:

text
dataset version
split IDs
label version
map version
sensor calibration IDs
time sync assumptions
preprocessing commit
model checkpoint
thresholds and temperatures

Without this provenance, a reported metric is hard to audit.

3. Metrics Are Estimators

A metric estimates some underlying operational quantity.

For binary detection:

text
precision = TP / (TP + FP)
recall    = TP / (TP + FN)
F1        = 2 * precision * recall / (precision + recall)

For probabilistic prediction:

text
negative_log_likelihood = -mean_i log p_theta(y_i | x_i)
brier_score = mean_i sum_k (p_ik - 1[y_i = k])^2

For calibration:

text
confidence = max_k p_theta(k | x)
accuracy   = 1[predicted_label = true_label]

The metric is meaningful only if the matching rules, thresholds, labels, and aggregation match the decision being made.

For AV perception, average precision can hide exactly the failure that matters: a rare small obstacle, a night pedestrian, an aircraft edge, or a wet-pavement reflection.

4. Calibration

A model is calibrated if events assigned confidence p happen with frequency p.

For classification:

text
P(correct | confidence = 0.8) = 0.8

This is different from accuracy. A model can be accurate but overconfident, or less accurate but better calibrated.

Reliability bins approximate calibration:

text
bin B_m = examples with confidence in interval m
acc(B_m)  = mean correctness in bin
conf(B_m) = mean confidence in bin

Expected calibration error:

text
ECE = sum_m |B_m| / n * |acc(B_m) - conf(B_m)|

ECE is useful but imperfect. It depends on binning and can hide class-specific or scenario-specific miscalibration.

5. Temperature Scaling

Guo et al. showed that a simple post-hoc temperature often calibrates modern classifiers well.

Given logits z, calibrated probabilities are:

text
p_k = softmax(z_k / T)

Where:

  • T > 1 softens probabilities and reduces overconfidence.
  • T < 1 sharpens probabilities.
  • T is fitted on a held-out calibration set by minimizing NLL.

Implementation:

python
class TemperatureScaler(nn.Module):
    def __init__(self):
        self.log_t = nn.Parameter(torch.zeros(()))

    def forward(self, logits):
        temperature = self.log_t.exp().clamp(min=1e-3)
        return logits / temperature

def fit_temperature(logits_val, labels_val):
    scaler = TemperatureScaler()
    opt = torch.optim.LBFGS(scaler.parameters(), lr=0.1, max_iter=50)

    def closure():
        opt.zero_grad()
        loss = F.cross_entropy(scaler(logits_val), labels_val)
        loss.backward()
        return loss

    opt.step(closure)
    return scaler

Temperature scaling does not fix ranking, recall, class confusion, or OOD failure. It only rescales confidence on data similar to the calibration set.

6. Calibration for Detection, Occupancy, and Forecasting

AV outputs are often structured, not single-label classification.

Detection

Calibrate object scores by:

  • class
  • range bucket
  • object size
  • occlusion level
  • weather/time condition
  • sensor modality

The same 0.8 score may mean different reliability at 8 m versus 80 m.

Occupancy

For occupancy probability:

text
P(cell occupied | predicted probability = p) should equal p

Important slices:

  • near field versus far field
  • dynamic versus static cells
  • free-space boundaries
  • occluded regions
  • drivable area
  • high-consequence zones around aircraft or people

Forecasting

For trajectory or future occupancy uncertainty, evaluate:

  • coverage of prediction sets
  • NLL of realized future under multimodal distribution
  • miss rate at fixed false positive rate
  • calibration of risk scores
  • closed-loop planner sensitivity

7. Data Leakage

Data leakage means evaluation data influences training, tuning, or model selection in a way the claim does not allow.

AV leakage examples:

Leakage typeExampleResult
adjacent-frame leakageframe t train, frame t+1 validationinflated perception metrics
route leakagesame route/day in train and testweak new-route evidence
map leakagefuture map used in current predictioninflated mapping and planning
label leakageauto-labeler saw future framesunrealistic online performance
metadata leakagefilename encodes scenario classshortcut learning
tuning leakagerepeated threshold tuning on test setoverfit benchmark
teacher leakagefoundation teacher trained on test imagesinflated SSL transfer
simulation leakagesame random seeds across splitsweak sim generalization

Leakage is not only a data engineering bug. It is a scientific validity bug.

8. Benchmark Contamination

Benchmark contamination is a broader form of leakage where evaluation examples, labels, or benchmark-specific artifacts are present in pretraining, fine-tuning, prompt tuning, data filtering, or repeated model selection.

In foundation-model workflows, contamination can happen because:

  • pretraining data is web-scale and hard to audit
  • benchmark datasets are public and copied into many corpora
  • synthetic data is generated from models that saw the benchmark
  • teams repeatedly tune on public leaderboards
  • evaluation prompts leak into training logs

For AV foundation models, analogous contamination includes:

  • training on validation routes through unlabeled SSL
  • using final test logs for "just representation learning"
  • deriving pseudo-labels from a model trained on the test site
  • evaluating on public driving datasets that were part of generic pretraining
  • tuning prompts, adapters, or thresholds on the benchmark

Mitigations:

text
private holdout sets
time-based data cutoffs
route/site quarantines
deduplicated pretraining corpora
leaderboard-limited submissions
contamination audits by hash, embedding, and metadata
fresh scenario generation after model freeze

9. Test, Validation, Calibration, and Shadow Sets

Separate sets by purpose:

SetPurposeMay tune on it?
trainfit model parametersyes
validationchoose architecture and hyperparametersyes
calibrationfit thresholds, temperatures, conformal scoresyes, for calibration only
testfinal evidence for release claimno
shadow/fleet monitorpost-release drift detectionno for initial claim
incident setregression and safety analysisonly under controlled protocol

Thresholds are model parameters for evaluation purposes. If thresholds are tuned on the test set, the test set is no longer a clean test.

10. Implementation Interface

An evaluation harness should make protocol explicit:

python
class EvalExample(NamedTuple):
    input_ref: str
    label_ref: str
    route_id: str
    site_id: str
    timestamp_ns: int
    map_version: str
    calibration_id: str
    scenario_tags: tuple[str, ...]

class EvalProtocol(NamedTuple):
    split_name: str
    split_manifest: str
    metrics: tuple[str, ...]
    thresholds: dict
    calibration_artifact: str | None
    aggregation: str
    primary_slices: tuple[str, ...]

Every result should emit:

text
metric table
confidence intervals
per-slice metrics
calibration curves
confusion examples
latency distribution
model and data provenance

For AV release gates, also emit a failure bundle:

text
worst false negatives
worst false positives
highest-confidence wrong predictions
uncertain near misses
route/site/weather regressions

11. Statistical Uncertainty

A metric from finite data is noisy.

For paired comparisons, evaluate per-scenario deltas:

text
d_i = metric_i(new_model) - metric_i(old_model)

Then report:

text
mean(d)
confidence interval over scenarios
number of improved/regressed scenarios
tail regressions

Bootstrap by scenario, not by frame, when frames within a scenario are correlated.

For safety-relevant metrics, do not rely only on mean improvement. A small mean gain with severe rare-scenario regressions is not a deployable win.

12. Failure Modes

Failure modeSymptomMitigation
test leakagegreat benchmark, poor field resultsplit audit and clean holdout
miscalibrationhigh confidence wrong predictionstemperature scaling, class/range calibration
aggregate maskingaverage improves, rare scenario regressesprimary slices and tail metrics
threshold overfittest score improves after repeated tuningfreeze thresholds before final test
benchmark contaminationpublic benchmark no longer discriminatesprivate/fresh holdouts and audits
label noisemodel penalized for correct behaviorlabel QA and uncertainty-aware metrics
ground-truth mismatchmetric rewards unsafe behavioralign metric with operational risk
latency ignoredaccurate model misses runtime budgetp50/p95/p99 latency gates
calibration-set drifttemperature works offline but not in fieldrecalibration protocol and drift monitors

13. AV and Research Relevance

Evaluation is an autonomy subsystem, not a report-writing step. It determines whether a model can safely enter a stack.

AV-specific priorities:

  • split by physical correlation, not random frames
  • report rare-class and rare-scenario behavior
  • calibrate by range, class, occlusion, and weather
  • include latency and compute
  • test under sensor degradation and calibration perturbation
  • preserve evidence bundles for safety review
  • keep final test sets quarantined
  • use closed-loop evaluation when model outputs affect planning

For airside systems, the evaluation set should explicitly cover:

  • aircraft proximity
  • stand entry and exit
  • service-road crossings
  • baggage trains and dollies
  • cones, chocks, FOD-like objects
  • night, glare, rain, and wet apron
  • jet blast or exhaust haze where relevant
  • map changes and temporary closures

14. Practical Checklist

Before trusting a metric:

  1. State the claim.
  2. Identify the operational domain.
  3. Confirm the split supports the claim.
  4. Check whether thresholds and temperatures were fit on a separate set.
  5. Audit leakage and contamination paths.
  6. Inspect per-scenario and tail metrics.
  7. Report confidence intervals.
  8. Review high-confidence failures.
  9. Verify latency and deployment constraints.
  10. Preserve the exact protocol artifact.

Sources

Public research notes collected from public sources.