Skip to content

Uncertainty Quantification, Calibration, and Conformal Prediction

Uncertainty Quantification, Calibration, and Conformal Prediction curated visual

Visual: reliability diagram plus conformal prediction-set construction showing calibration data, coverage guarantee, uncertainty type, and AV decision use.

Uncertainty quantification is the practice of making model uncertainty explicit enough to support decisions. Calibration asks whether predicted probabilities match empirical frequencies. Conformal prediction wraps a predictor with finite-sample coverage guarantees under exchangeability assumptions.

Why it matters for AV, perception, SLAM, and mapping

Autonomy decisions depend on uncertainty:

  • A tracker gates detections using predicted innovation covariance.
  • A planner slows down when occupancy or localization uncertainty grows.
  • A perception model threshold turns scores into object reports.
  • A world model exposes multiple future modes for interactive agents.
  • A safety monitor needs coverage claims that survive distribution shift audits.

Raw neural confidence is not enough. Guo et al. showed that modern neural networks can be poorly calibrated and that temperature scaling is a practical post-hoc fix for many classification settings. Deep ensembles often provide strong predictive uncertainty and OOD sensitivity. MC dropout provides one approximate Bayesian route. Conformal prediction provides distribution-free prediction sets with explicit coverage when the calibration and test examples are exchangeable.

Core definitions

Aleatoric and epistemic uncertainty

Aleatoric uncertainty is irreducible randomness in observations or outcomes:

text
wet pavement returns, photon noise, occlusion, future human choice

Epistemic uncertainty is model ignorance:

text
unseen airport layout, novel vehicle type, insufficient training data

In practice they interact. A far pedestrian under rain is both noisy and less represented in training.

Predictive distribution

A probabilistic predictor returns:

text
p_theta(y | x)

For regression, a common Gaussian output is:

text
y | x ~ N(mu_theta(x), Sigma_theta(x))

For classification:

text
p_theta(y = k | x) = softmax_k(z_theta(x))

Uncertainty quality is about the distribution, not just the point estimate.

Calibration

A classifier is calibrated if:

text
P(Y = y_hat | confidence = p) = p

For binary or multiclass confidence:

text
confidence = max_k p_theta(k | x)

If examples assigned 0.9 confidence are correct about 90 percent of the time, the model is calibrated on that slice.

Sharpness

Calibration alone is not enough. A model that always predicts base rates may be calibrated but useless. Sharpness measures concentration of predictive distributions. The goal is calibrated and sharp.

Prediction set

A prediction set returns a set C(x) rather than one label:

text
P(Y in C(X)) >= 1 - alpha

The set should be small when the model is confident and large when the input is ambiguous.

First-principles math

Negative log-likelihood

For data {(x_i, y_i)}:

text
NLL = - (1 / n) sum_i log p_theta(y_i | x_i)

NLL rewards probability mass on the realized outcome. It penalizes confident wrong predictions heavily, which is useful for safety-critical perception.

Brier score

For class probabilities p_i and one-hot target e_{y_i}:

text
Brier = (1 / n) sum_i ||p_i - e_{y_i}||_2^2

It is a proper scoring rule and often easier to decompose visually than NLL.

Expected calibration error

Partition examples into confidence bins B_m:

text
acc(B_m)  = mean_i in B_m 1[y_i = y_hat_i]
conf(B_m) = mean_i in B_m confidence_i

Then:

text
ECE = sum_m |B_m| / n * |acc(B_m) - conf(B_m)|

ECE is easy to report but depends on binning and can hide class, range, weather, and scene-specific miscalibration.

Temperature scaling

Given logits z, calibrated probabilities are:

text
p_k = softmax(z_k / T)

T is fitted on a held-out calibration set by minimizing NLL. If T > 1, probabilities are softened. Temperature scaling does not change the predicted class ranking; it changes confidence.

Ensembles

For M models:

text
p(y | x) = (1 / M) sum_m p_m(y | x)

Regression with mean and variance can decompose uncertainty:

text
E[y] = (1 / M) sum_m mu_m
Var[y] = (1 / M) sum_m (sigma_m^2 + mu_m^2) - E[y]^2

The average predicted variance captures aleatoric uncertainty; disagreement between means is a proxy for epistemic uncertainty.

MC dropout

At test time, keep dropout active and sample predictions:

text
y_hat_s = f_{theta, dropout_s}(x)

The empirical mean and variance approximate a Bayesian predictive distribution under assumptions that connect dropout to variational inference.

Split conformal prediction

Given a trained model and calibration examples (x_i, y_i), define a nonconformity score s_i where larger means worse fit.

For classification, one simple score is:

text
s_i = 1 - p_theta(y_i | x_i)

Let q be the empirical quantile:

text
q = ceil((n + 1) * (1 - alpha)) / n quantile of {s_i}

For a new example:

text
C(x) = {y: 1 - p_theta(y | x) <= q}

Under exchangeability of calibration and test examples:

text
P(Y_new in C(X_new)) >= 1 - alpha

For regression, if s_i = |y_i - mu(x_i)|, the conformal interval is:

text
[mu(x) - q, mu(x) + q]

For heteroscedastic models, use normalized scores:

text
s_i = |y_i - mu(x_i)| / sigma(x_i)

Algorithmic patterns

PatternOutputBest useMain caveat
Temperature scalingcalibrated class probabilitiesclassification post-processingassumes validation slice matches deployment
Isotonic or Platt scalingcalibrated scoresbinary detectorscan overfit small calibration sets
Deep ensemblespredictive distributionrobust UQ, OOD detectionhigher train and inference cost
MC dropoutsample-based uncertaintyapproximate Bayesian retrofitdropout distribution may be weak
Evidential modelsdistribution over evidence parameterscompact uncertainty headcan be miscalibrated without strong validation
Quantile regressionconditional intervalsregression boundsquantile crossing and coverage drift
Split conformalprediction sets or intervalscoverage guarantee on exchangeable dataguarantee is marginal, not per-slice
Mondrian conformalgroup-conditional setsclass/range/weather slicesneeds enough calibration data per group

AV, perception, SLAM, mapping, and planning relevance

Detection

Detector scores are often ranking scores, not calibrated probabilities. Calibrate by slices that affect sensor quality:

  • class
  • range
  • object size
  • occlusion
  • weather
  • time of day
  • sensor modality
  • map region or site

A 0.8 score for a close vehicle and a 0.8 score for a far cone may not mean the same empirical correctness.

Tracking and fusion

Kalman-style tracking requires covariance consistency. Use normalized innovation squared:

text
NIS = innovation^T S^-1 innovation

If the model is consistent, NIS follows an approximate chi-square distribution with degrees of freedom equal to measurement dimension. Persistent excess NIS means uncertainty is underestimated or the model is wrong.

Occupancy and mapping

For occupancy:

text
P(cell occupied | predicted p = 0.7) should be about 0.7

But cell-wise calibration is not enough. Evaluate connected components, object surfaces, drivable-space boundaries, and high-consequence regions separately.

Forecasting and world models

Future prediction is multimodal. A single mean trajectory can be calibrated in MSE terms and still be useless. Use probabilistic metrics:

  • NLL under a mixture or sample distribution
  • coverage of prediction sets
  • miss rate at fixed false positive rate
  • closed-loop planner regret or collision rate
  • slice calibration for interactive and occluded scenarios

Planning

A planner should consume uncertainty through explicit contracts:

text
state estimate + covariance
object distribution or prediction set
occupancy probability and age
model confidence or OOD score
calibration domain metadata

Do not pass an uncalibrated neural score as if it were a probability in a safety cost.

Implementation notes

  • Keep a separate calibration split. Do not tune temperature, thresholds, or conformal quantiles on the final test set.
  • Version calibration artifacts with dataset slice, model checkpoint, label version, preprocessing, and operating domain.
  • Report reliability diagrams by scenario slice, not only globally.
  • Use proper scoring rules such as NLL and Brier score when evaluating predictive distributions.
  • For conformal prediction, document exchangeability assumptions. Route, date, weather, sensor rig, and map version splits matter.
  • Avoid treating marginal conformal coverage as per-class or per-scenario coverage. Use grouped methods or separate calibration when operationally required.
  • Monitor uncertainty drift online with NIS, score histograms, set sizes, OOD rates, and post-deployment label audits.

Failure modes and diagnostics

SymptomLikely causeDiagnostic
Confident false detectionsscore is uncalibrated or OODreliability by class/range/weather
Conformal sets too largebase model weak or calibration slice broadinspect nonconformity quantiles by slice
Conformal coverage fails in deploymentexchangeability brokenrecalibrate by route/site/weather/time split
Ensemble disagreement low but wrongshared training bias or blind spotdiversify data, architecture, and OOD tests
MC dropout variance meaninglessdropout not aligned with epistemic uncertaintycompare to ensembles and held-out OOD
Planner overreacts to uncertaintyuncertainty not tied to consequencecalibrate risk costs against closed-loop outcomes
Planner underreacts to uncertaintyprobabilities treated as scoresenforce calibrated probability contracts
Global ECE looks goodslice errors cancelclass/range/scenario reliability diagrams

Sources

Public research notes collected from public sources.