Skip to content

Information Theory for Perception and Machine Learning

Information Theory for Perception and Machine Learning curated visual

Visual: relationship map for entropy, cross-entropy, KL, mutual information, compression, prediction error, and active sensing value.

Information theory gives a quantitative language for uncertainty, compression, prediction, sensing, and representation learning. For autonomy, it explains why some measurements reduce pose uncertainty, why some features are sufficient for planning, why calibration matters, and why lossy learned representations can be useful or dangerous.

Why it matters for AV, perception, SLAM, and mapping

Autonomy systems repeatedly ask information-theoretic questions:

  • How much does this LiDAR scan reduce map uncertainty?
  • Is a BEV feature sufficient for planning, or did the encoder discard a small obstacle?
  • Does a detector score behave like a probability, or only like a ranking?
  • Which route, viewpoint, or maneuver would reveal the most about an occluded area?
  • How much prediction detail is worth carrying into the planner?

Shannon's communication theory starts from uncertainty over possible messages. Cover and Thomas develop entropy, relative entropy, mutual information, channel capacity, and rate-distortion theory as a general mathematical toolkit. In perception and ML, the "channel" may be a sensor, neural encoder, map update, or latent world model.

Core definitions

Entropy

For a discrete random variable X with probabilities p(x):

text
H(X) = - sum_x p(x) log p(x)

If the logarithm is base 2, units are bits. Entropy is low when the outcome is nearly known and high when many outcomes remain plausible.

For a Bernoulli occupancy cell:

text
H(p) = -p log p - (1 - p) log(1 - p)

H(p) is highest at p = 0.5, where the cell is maximally uncertain, and zero at p = 0 or p = 1.

Conditional entropy

text
H(Y | X) = - sum_x p(x) sum_y p(y | x) log p(y | x)

It measures remaining uncertainty in Y after observing X. In perception, H(map | sensor history) should decrease as observations make the map more certain.

Mutual information

text
I(X; Y) = H(Y) - H(Y | X)
        = H(X) - H(X | Y)
        = sum_{x,y} p(x,y) log (p(x,y) / (p(x)p(y)))

Mutual information is expected uncertainty reduction. If I(sensor; map) is large, the sensor measurement is informative about the map. If I(feature; label) is large, the feature carries label-relevant information.

Conditional mutual information is:

text
I(X; Y | Z) = H(Y | Z) - H(Y | X, Z)

This is useful when scoring a candidate action given what the robot already knows.

KL divergence

Relative entropy, or KL divergence, is:

text
D_KL(p || q) = sum_x p(x) log (p(x) / q(x))

It measures the extra code length, in expectation under p, from using model q when the true distribution is p. It is not symmetric.

In ML, negative log-likelihood and cross-entropy are empirical ways to push model distributions toward data distributions.

Cross entropy

text
H(p, q) = - sum_x p(x) log q(x)
        = H(p) + D_KL(p || q)

Minimizing cross entropy over q is equivalent to minimizing KL divergence from the data distribution to the model distribution, up to the fixed entropy of the data distribution.

Differential entropy

For a continuous variable:

text
h(X) = - integral p(x) log p(x) dx

Unlike discrete entropy, differential entropy can be negative and changes under coordinate scaling. Mutual information and KL divergence remain invariant in the ways usually needed for modeling.

For a Gaussian N(mu, Sigma) in d dimensions:

text
h(X) = 0.5 * log((2 pi e)^d det(Sigma))

This connects uncertainty volume to covariance determinant.

First-principles math

Information gain from a measurement

Before a measurement z, the map belief is p(m). After the measurement:

text
p(m | z) = p(z | m) p(m) / p(z)

The information gain from observing z is:

text
D_KL(p(m | z) || p(m))

The expected information gain before knowing which z will arrive is:

text
E_z[D_KL(p(m | z) || p(m))] = I(M; Z)

This is the basis of active SLAM, next-best-view planning, and uncertainty-aware mapping.

Fisher information

For likelihood p(z | theta), the score is:

text
s(theta) = grad_theta log p(z | theta)

The Fisher information matrix is:

text
I(theta) = E[s(theta) s(theta)^T]

Under regularity conditions, it also equals:

text
I(theta) = -E[Hessian_theta log p(z | theta)]

For Gaussian residual models, the Gauss-Newton Hessian approximates accumulated information. This links measurement geometry, calibration observability, and SLAM covariance.

Rate-distortion

Compression is only meaningful relative to distortion. Rate-distortion theory asks for the minimum code rate R needed to represent X with expected distortion at most D:

text
R(D) = min_{p(x_hat | x): E[d(X, X_hat)] <= D} I(X; X_hat)

For AV learned representations, distortion should not mean pixel MSE by default. A small cone, reflective aircraft towbar, or lane boundary may occupy few pixels but carry high planning value.

Information bottleneck

The information bottleneck method seeks a representation T that compresses input X while preserving information about target Y:

text
min p(t | x): I(X; T) - beta I(T; Y)

Equivalently, keep target-relevant information and discard nuisance variation. For perception, Y may be object labels, occupancy, future trajectories, or planning cost. The failure mode is obvious: if the target omits rare hazards, the bottleneck may learn to discard them.

Channel capacity

A noisy sensor or communication link can be abstracted as a channel:

text
p(y | x)

Channel capacity is:

text
C = max_{p(x)} I(X; Y)

This gives a clean mental model for bandwidth, quantization, sensor noise, compression, and lossy logging. A perception stack cannot recover information that the sensor channel or preprocessing destroyed.

Algorithmic patterns

PatternInformation-theoretic viewAV use
Entropy mapuncertainty per cell or objectactive mapping, safety visualization
Information gain plannerchoose action maximizing expected uncertainty reductionnext best view, occlusion handling
NLL trainingminimize empirical code length under modelprobabilistic perception and forecasting
Cross entropy classificationfit class probability modelobject classes, semantic cells
Contrastive learningmaximize lower bound on mutual information or separabilitysensor representation learning
VAE / latent modeltrade reconstruction, likelihood, and latent priorworld-model state compression
Rate-distortion tokenizercompress inputs under task distortionBEV, point cloud, image tokenizers
Calibration metriccompare confidence to empirical frequencyrisk scoring and thresholding

AV, perception, SLAM, mapping, and planning relevance

Sensor selection and active perception

An active perception objective can be written:

text
score(a) = E_{z | a}[D_KL(p(M | z, a) || p(M))]

The action a may be a viewpoint, speed profile, lane position, sensor mode, or query policy. This objective rewards actions that are expected to make the map or scene belief more certain.

Feature sufficiency

A representation r_t is sufficient for a task if:

text
p(y | history) = p(y | r_t)

For planning, y may be future collision, rule violation, or progress. This is why planner-friendly BEV features can be useful: they compress history into a state intended to preserve decision-relevant information. It is also why they need targeted validation.

Occupancy and entropy

Occupancy cells near p = 0.5 have high entropy. But planning risk is not entropy alone:

text
risk(cell) = P(occupied) * consequence(cell)

A far unknown cell and a near unknown cell may have similar entropy but very different safety implications.

World models

World models choose what future information to predict. Pixel prediction carries many bits unrelated to control. Occupancy, agent state, and latent predictive features may carry fewer bits but more planning value. Rate-distortion and information bottleneck thinking help define the trade:

text
compress observations enough for real-time planning
preserve enough information for safe decisions

Implementation notes

  • Specify log base. Bits use log2; nats use natural log. Mixing units causes confusing dashboards.
  • Estimate entropy and mutual information cautiously. Finite-sample estimators can be biased, especially for high-dimensional continuous features.
  • Do not optimize generic information gain without consequence weighting. A planner can waste effort learning irrelevant scene details.
  • Separate aleatoric sensor uncertainty from epistemic model uncertainty. Entropy of a softmax output is not a complete uncertainty estimate.
  • Treat compression as a safety-critical design choice. Downsampling, quantization, tokenization, and feature pooling can erase rare small objects.
  • Use NLL only when the likelihood family is meaningful. A low NLL from a misspecified model can still hide spatially concentrated failures.
  • When scoring active mapping, account for motion cost, latency, observability, and collision risk, not only expected entropy reduction.

Failure modes and diagnostics

SymptomLikely causeDiagnostic
High-confidence wrong predictionsmodel is miscalibrated or missing OOD statereliability diagrams and scenario-sliced NLL
Active planner stares at irrelevant regionsinformation objective ignores task valueweight information gain by consequence and mission
Latent model misses small hazardsbottleneck or distortion erased rare detailsinspect reconstructions and task probes by hazard type
Entropy dashboard looks good but planner unsafeentropy not tied to consequenceevaluate risk-weighted occupancy and closed-loop outcomes
Mutual information estimate is unstablehigh-dimensional estimator biasuse lower-dimensional probes and bootstrap intervals
Sensor fusion double counts evidencecorrelated channels treated as independentestimate conditional information and audit shared preprocessing
Map updates become overconfidentlikelihood too sharp or repeated datacompare predicted entropy drop to empirical error reduction

Sources

Public research notes collected from public sources.