Convolutional Neural Networks: First Principles
Visual: grid convolution diagram showing shared kernel, stride, padding, dilation, receptive-field growth, feature pyramid, and aliasing tradeoff.
Locality, Weight Sharing, and Translation Structure
A fully connected layer treats every input position as unrelated:
flatten image -> W x + bFor images and BEV grids, that ignores two facts:
- Nearby pixels or cells are strongly related.
- The same visual pattern can appear at many locations.
A convolutional neural network encodes those assumptions by applying a small learned kernel across spatial positions with shared weights. This makes the model more parameter-efficient and gives it translation equivariance:
shift input -> shift feature mapEquivariance is not invariance. A CNN preserves where evidence appears. Later pooling, striding, global aggregation, or detection heads decide how much position sensitivity to keep.
1. Cross-Correlation In Deep Learning
Most deep-learning libraries implement cross-correlation, even when the layer is called convolution. For input x and kernel w:
y[i, j] = sum_{u, v, c} w[c, u, v] * x[c, i + u, j + v] + bThe kernel is not flipped as in mathematical convolution. Because weights are learned, the distinction usually does not matter for modeling, but it matters when comparing formulas or porting weights.
For a 2D convolution layer:
input: (B, C_in, H_in, W_in)
weight: (C_out, C_in, K_h, K_w)
bias: (C_out)
output: (B, C_out, H_out, W_out)Each output channel has its own bank of kernels over all input channels.
2. Output Shape Arithmetic
For one spatial dimension:
out = floor((in + 2 * padding - dilation * (kernel - 1) - 1) / stride + 1)For height and width, apply the formula separately.
Common cases:
kernel=3, stride=1, padding=1, dilation=1 -> same spatial size
kernel=3, stride=2, padding=1 -> roughly half spatial size
kernel=1, stride=1, padding=0 -> channel mixing onlyShape arithmetic is safety-critical in AV because spatial tensors often align to camera intrinsics, feature pyramids, BEV cells, voxel grids, anchors, or map coordinates. An off-by-one padding error can shift detections or occupancy evidence relative to the world frame.
3. Receptive Field
The receptive field of an output unit is the input region that can influence it. Stacking small kernels grows receptive field:
two 3x3 stride-1 convs -> effective 5x5 receptive field
three 3x3 stride-1 convs -> effective 7x7 receptive fieldStriding and pooling grow receptive field faster but reduce resolution.
Important distinction:
- The theoretical receptive field is the maximum possible influence region.
- The effective receptive field is where gradients and learned influence actually concentrate.
AV examples:
- A lane boundary head needs long spatial context but also precise localization.
- A small-object detector needs high-resolution features.
- A traffic-light classifier needs local appearance plus map/crop context.
- BEV occupancy needs enough receptive field to infer occluded areas from visible geometry.
4. Padding And Boundary Effects
Zero padding lets features near borders use the same kernel size as interior features, but it injects artificial values. Boundary artifacts matter when:
- Crops truncate objects.
- Wide-angle camera images have strong distortion near edges.
- BEV grids have physical boundaries.
- Map raster layers use special unknown or out-of-range values.
Padding choices:
zero simple, common, but creates artificial border
reflect useful for natural images, less obvious for semantic rasters
replicate repeats edge values
circular appropriate only for periodic domains
valid no padding, shrinks spatial size
same pads to preserve output sizeDo not choose padding only to make tensor shapes convenient. Choose it based on what the boundary means.
5. Stride, Pooling, and Aliasing
Stride reduces resolution:
stride=2 -> sample every second locationPooling aggregates local regions:
max_pool -> presence of strong feature
avg_pool -> local average evidenceDownsampling without low-pass filtering can alias high-frequency patterns. In perception, aliasing can affect lane markings, traffic signs, small objects, and thin occupancy structures. Anti-aliased downsampling or careful feature pyramid design can improve stability.
The detection tradeoff:
more downsampling -> more context, lower compute, worse localization
less downsampling -> better small objects, higher memory and latencyFeature pyramid networks exist because no single stride is ideal for all object scales.
6. Dilation
Dilated convolution spaces kernel samples apart:
effective_kernel = dilation * (kernel - 1) + 1A 3x3 kernel with dilation 2 covers a 5x5 region using 9 sampled points. This increases receptive field without reducing spatial resolution.
Use cases:
- Semantic segmentation.
- BEV occupancy.
- Road topology context.
- Long-range spatial context in dense grids.
Failure modes:
- Gridding artifacts when repeated dilation samples disconnected positions.
- Poor local detail if dilation replaces too many standard convolutions.
- Hardware inefficiency depending on accelerator kernels.
7. Grouped And Depthwise Convolution
Grouped convolution splits channels into groups:
groups = 1 normal convolution
groups = C_in depthwise convolutionDepthwise separable convolution:
depthwise KxK conv: per-channel spatial filtering
pointwise 1x1 conv: channel mixingThis reduces parameters and compute:
standard: K*K*C_in*C_out
depthwise + pointwise: K*K*C_in + C_in*C_outIt is useful for efficient deployment, but channel mixing is delayed until the pointwise stage. In early fusion layers, overly aggressive grouping can reduce cross-channel interaction that might be needed for color, intensity, or multi-sensor cues.
8. CNN Building Blocks
Conv-BN-ReLU
Classic block:
conv -> batch norm -> ReLUBatchNorm stabilizes training and ReLU provides nonlinearity. During export, conv and BatchNorm can often be fused:
y = gamma * (conv(x) - mean) / sqrt(var + eps) + betaThe fused layer has adjusted weights and bias.
Residual Blocks
Residual block:
y = x + F(x)The identity path improves gradient flow and lets blocks learn refinements. For perception, residual designs are common in image backbones, BEV encoders, and occupancy decoders.
1x1 Convolution
A 1x1 convolution mixes channels independently at each spatial location. It is used for:
- Reducing channel count.
- Expanding channel count before cheap spatial operations.
- Per-cell classification or regression heads.
- Feature fusion after concatenation.
9. LeNet To Modern Perception
LeNet demonstrated that learned convolutional features, pooling, and gradient training could solve visual recognition tasks end to end. Modern AV CNNs are larger and often hybridized with attention or state-space models, but the same core assumptions remain:
local evidence matters
patterns repeat across space
hierarchies compose simple features into complex features
spatial resolution is traded for context and computeIn AV, CNNs remain strong for:
- Camera image backbones.
- Feature pyramid networks.
- BEV raster encoders.
- Occupancy and segmentation decoders.
- Efficient local refinement after sparse attention.
- Learned image or BEV preprocessing before transformer blocks.
10. Implementation Notes
PyTorch Conv2d
conv = torch.nn.Conv2d(
in_channels=64,
out_channels=128,
kernel_size=3,
stride=2,
padding=1,
dilation=1,
groups=1,
bias=False,
)Input layout is usually (N, C, H, W). Some accelerators prefer channels-last memory format, but APIs may still present tensors in NCHW shape. Performance debugging should distinguish semantic shape from memory layout.
Dense Prediction Loss Shapes
For segmentation:
# logits: (B, C, H, W)
# target: (B, H, W)
loss = torch.nn.functional.cross_entropy(logits, target, ignore_index=255)Do not flatten away the class dimension accidentally.
Coordinate Alignment
Every resize, crop, pad, stride, and deconvolution changes coordinate mapping. Keep an explicit record of:
input pixel center convention
feature stride
BEV cell size
origin and axis direction
padding offsets
camera projection scaleMany AV "model" bugs are coordinate convention bugs.
11. Failure Modes
Small Object Loss
Aggressive stride can erase pedestrians, cones, debris, or distant traffic lights. Feature pyramids, higher-resolution branches, and task-specific crops can help.
Border Artifacts
Padding can create false evidence near image or BEV edges. Inspect predictions near crop boundaries and camera seams.
Shortcut Texture Bias
CNNs can rely on texture instead of shape. For AV, this can fail under weather, lighting, construction zones, or new pavement. Shape-biased augmentation and cross-domain validation help reveal the issue.
Misaligned Upsampling
Decoder outputs can shift by half a pixel or one cell due to interpolation settings. This matters for lane boundaries, occupancy, and freespace edges.
BatchNorm Deployment Drift
CNN backbones often include BatchNorm. If deployment batch size or domain differs from training, running statistics can hurt reliability.
Inefficient Kernel Choices
Dilated, grouped, or unusual kernel shapes may be mathematically elegant but slow on the target accelerator. Measure latency on the deployment stack, not only FLOPs.
12. AV Review Checklist
What is the feature stride at every head?
What receptive field does each output need?
Are padding and coordinate conventions explicit?
How are small objects preserved?
Are BatchNorm statistics deployment-safe?
Are upsampling operations aligned to labels and world coordinates?
Are grouped/depthwise/dilated convolutions efficient on target hardware?
Do augmentations preserve convolutional assumptions?CNNs are not just older backbones. They are the standard tool for encoding local spatial structure, and the mistakes they prevent or introduce are still central to AV perception.
13. Sources
- LeCun et al., Gradient-Based Learning Applied to Document Recognition, 1998.
- Stanford CS231n, Convolutional Networks.
- PyTorch, Conv2d.
- Dumoulin and Visin, A guide to convolution arithmetic for deep learning, 2016.
- Goodfellow, Bengio, and Courville, Deep Learning, especially convolutional networks.