SplatFlow

What It Is

SplatFlow is a CVPR 2025 method for self-supervised dynamic Gaussian Splatting in autonomous-driving scenes.
It reconstructs dynamic 4D scenes without requiring tracked 3D bounding boxes for dynamic-object supervision.
It introduces Neural Motion Flow Field (NMFF), a set of implicit functions that model temporal motion for both LiDAR points and Gaussians.
The method decomposes static background and dynamic objects, representing background with 3D Gaussians and dynamic content with 4D Gaussians.
It is a reconstruction, rendering, and simulation-support method, not a production tracker or occupancy estimator.

Replace object-box-driven dynamic Gaussian decomposition with self-supervised motion-flow learning.
Pretrain or learn 3D motion priors from LiDAR data so dynamic and static points can be separated in 3D.
Use NMFF to estimate continuous motion flow fields over space and time.
Convert dynamic LiDAR points and dynamic Gaussians across timestamps through the learned motion field.
Aggregate temporal features for each 4D Gaussian so dynamic actors remain consistent across views and times.
Distill features from 2D foundation models into 4D space-time representation to improve dynamic object identification.
Use separate 3D static Gaussians and 4D dynamic Gaussians for rendering RGB, depth, and flow.

Inputs: synchronized calibrated cameras, LiDAR point clouds, camera poses or ego poses, and temporal driving sequences.
Training signals: image reconstruction, LiDAR-based motion priors, self-supervised temporal correspondence, and distilled 2D foundation-model features.
Explicitly avoided input: manually labeled tracked 3D dynamic-object boxes as the core dynamic supervision.
Outputs: reconstructed dynamic Gaussian scene, novel-view RGB renderings, rendered depth, and rendered flow.
Intermediate output: static/dynamic decomposition of scene elements and NMFF-based temporal correspondences.
Non-output: SplatFlow does not provide safety-certified object velocities, semantic occupancy grids, or production map updates by itself.

Ingest multi-sensor driving sequences with RGB and LiDAR.
Learn or initialize NMFF motion priors from 3D LiDAR observations.
Identify dynamic components through motion-field behavior and foundation-feature distillation.
Represent static background with ordinary 3D Gaussians.
Represent moving objects with time-dependent 4D Gaussians whose status and correspondence are modeled by NMFF.
Aggregate features across timestamps for each dynamic Gaussian to improve cross-view consistency.
Render RGB, depth, and flow from the composed static plus dynamic scene.
Optimize reconstruction, temporal motion consistency, and feature-distillation losses.

Evaluation is reported on Waymo Open Dataset and KITTI.
The paper evaluates image reconstruction and novel-view synthesis with PSNR, SSIM, and LPIPS.
Baselines include NeRF, NSG, SUDS, MARS, 3DGS, PVG, StreetGS or StreetGaussian, and EmerNeRF-style dynamic neural fields.
The paper reports state-of-the-art performance across standard rendering metrics on the evaluated dynamic urban scenes.
KITTI split experiments test robustness under reduced training data settings.
Ablations show the value of NMFF priors, NMFF optimization, and optical-flow or foundation-model feature distillation.
Dynamic-region metrics are separated from whole-scene metrics to show whether moving objects are actually improved rather than hidden by static background quality.

Reduces dependence on expensive dynamic object boxes and tracks.
Static/dynamic decomposition is learned from motion and features, which helps with unlabeled fleet-scale data.
LiDAR motion priors give the method a stronger geometric basis than RGB-only dynamic splatting.
Dynamic 4D Gaussians preserve object detail better than methods that smear or ghost moving objects.
Rendered depth and flow make the reconstruction more useful for perception QA than RGB-only scene replay.
Good candidate for offline dynamic-object removal because dynamic regions are explicitly discovered and separated.

Self-supervised motion decomposition can confuse ego-motion, object motion, shadows, reflections, and calibration errors.
LiDAR sparsity or poor synchronization can create wrong motion priors.
Foundation-model features may not identify airside-specific equipment or may group operationally different objects together.
Dynamic objects with slow or intermittent motion can be absorbed into the static layer.
Rendered flow is a reconstruction signal, not automatically a calibrated velocity estimate for planning.
4D Gaussian memory and optimization cost can grow quickly on long airport sequences with many independently moving assets.

Very relevant for airport domains because labeled 3D boxes and tracks for every GSE class are expensive and incomplete.
Useful for cleaning static stand maps by discovering moving GSE, workers, carts, aircraft under pushback, and temporary obstructions.
LiDAR-informed motion fields can transfer better to low-texture aprons than pure photometric dynamic methods.
Dynamic decomposition can support simulation variants: remove all transient assets, replay observed movement, or insert edited motion.
Airside transfer needs validation on slow-moving, stop-start, articulated, and reflective objects that differ from road vehicles.
Treat it as offline reconstruction and simulation infrastructure; production AV stacks still need explicit tracking, occupancy, and safety monitors.

Preserve raw LiDAR timestamps, camera exposure times, and ego-pose interpolation because motion-field supervision is synchronization-sensitive.
Audit static-only and dynamic-only renders separately.
For map hygiene, measure false-static insertions and false-dynamic deletions against repeated-day logs.
Add manual review slices for parked aircraft, parked GSE that later moves, shadows under aircraft, wet pavement, and floodlit night operations.
If foundation features are used for airside labels, evaluate class coverage before trusting dynamic identification.
Store NMFF and Gaussian outputs with source log provenance so edited simulation assets do not contaminate real map-building data.