Skip to content

SplatFlow

What It Is

  • SplatFlow is a CVPR 2025 method for self-supervised dynamic Gaussian Splatting in autonomous-driving scenes.
  • It reconstructs dynamic 4D scenes without requiring tracked 3D bounding boxes for dynamic-object supervision.
  • It introduces Neural Motion Flow Field (NMFF), a set of implicit functions that model temporal motion for both LiDAR points and Gaussians.
  • The method decomposes static background and dynamic objects, representing background with 3D Gaussians and dynamic content with 4D Gaussians.
  • It is a reconstruction, rendering, and simulation-support method, not a production tracker or occupancy estimator.

Core Technical Idea

  • Replace object-box-driven dynamic Gaussian decomposition with self-supervised motion-flow learning.
  • Pretrain or learn 3D motion priors from LiDAR data so dynamic and static points can be separated in 3D.
  • Use NMFF to estimate continuous motion flow fields over space and time.
  • Convert dynamic LiDAR points and dynamic Gaussians across timestamps through the learned motion field.
  • Aggregate temporal features for each 4D Gaussian so dynamic actors remain consistent across views and times.
  • Distill features from 2D foundation models into 4D space-time representation to improve dynamic object identification.
  • Use separate 3D static Gaussians and 4D dynamic Gaussians for rendering RGB, depth, and flow.

Inputs and Outputs

  • Inputs: synchronized calibrated cameras, LiDAR point clouds, camera poses or ego poses, and temporal driving sequences.
  • Training signals: image reconstruction, LiDAR-based motion priors, self-supervised temporal correspondence, and distilled 2D foundation-model features.
  • Explicitly avoided input: manually labeled tracked 3D dynamic-object boxes as the core dynamic supervision.
  • Outputs: reconstructed dynamic Gaussian scene, novel-view RGB renderings, rendered depth, and rendered flow.
  • Intermediate output: static/dynamic decomposition of scene elements and NMFF-based temporal correspondences.
  • Non-output: SplatFlow does not provide safety-certified object velocities, semantic occupancy grids, or production map updates by itself.

Architecture or Pipeline

  • Ingest multi-sensor driving sequences with RGB and LiDAR.
  • Learn or initialize NMFF motion priors from 3D LiDAR observations.
  • Identify dynamic components through motion-field behavior and foundation-feature distillation.
  • Represent static background with ordinary 3D Gaussians.
  • Represent moving objects with time-dependent 4D Gaussians whose status and correspondence are modeled by NMFF.
  • Aggregate features across timestamps for each dynamic Gaussian to improve cross-view consistency.
  • Render RGB, depth, and flow from the composed static plus dynamic scene.
  • Optimize reconstruction, temporal motion consistency, and feature-distillation losses.

Training and Evaluation

  • Evaluation is reported on Waymo Open Dataset and KITTI.
  • The paper evaluates image reconstruction and novel-view synthesis with PSNR, SSIM, and LPIPS.
  • Baselines include NeRF, NSG, SUDS, MARS, 3DGS, PVG, StreetGS or StreetGaussian, and EmerNeRF-style dynamic neural fields.
  • The paper reports state-of-the-art performance across standard rendering metrics on the evaluated dynamic urban scenes.
  • KITTI split experiments test robustness under reduced training data settings.
  • Ablations show the value of NMFF priors, NMFF optimization, and optical-flow or foundation-model feature distillation.
  • Dynamic-region metrics are separated from whole-scene metrics to show whether moving objects are actually improved rather than hidden by static background quality.

Strengths

  • Reduces dependence on expensive dynamic object boxes and tracks.
  • Static/dynamic decomposition is learned from motion and features, which helps with unlabeled fleet-scale data.
  • LiDAR motion priors give the method a stronger geometric basis than RGB-only dynamic splatting.
  • Dynamic 4D Gaussians preserve object detail better than methods that smear or ghost moving objects.
  • Rendered depth and flow make the reconstruction more useful for perception QA than RGB-only scene replay.
  • Good candidate for offline dynamic-object removal because dynamic regions are explicitly discovered and separated.

Failure Modes

  • Self-supervised motion decomposition can confuse ego-motion, object motion, shadows, reflections, and calibration errors.
  • LiDAR sparsity or poor synchronization can create wrong motion priors.
  • Foundation-model features may not identify airside-specific equipment or may group operationally different objects together.
  • Dynamic objects with slow or intermittent motion can be absorbed into the static layer.
  • Rendered flow is a reconstruction signal, not automatically a calibrated velocity estimate for planning.
  • 4D Gaussian memory and optimization cost can grow quickly on long airport sequences with many independently moving assets.

Airside AV Fit

  • Very relevant for airport domains because labeled 3D boxes and tracks for every GSE class are expensive and incomplete.
  • Useful for cleaning static stand maps by discovering moving GSE, workers, carts, aircraft under pushback, and temporary obstructions.
  • LiDAR-informed motion fields can transfer better to low-texture aprons than pure photometric dynamic methods.
  • Dynamic decomposition can support simulation variants: remove all transient assets, replay observed movement, or insert edited motion.
  • Airside transfer needs validation on slow-moving, stop-start, articulated, and reflective objects that differ from road vehicles.
  • Treat it as offline reconstruction and simulation infrastructure; production AV stacks still need explicit tracking, occupancy, and safety monitors.

Implementation Notes

  • Preserve raw LiDAR timestamps, camera exposure times, and ego-pose interpolation because motion-field supervision is synchronization-sensitive.
  • Audit static-only and dynamic-only renders separately.
  • For map hygiene, measure false-static insertions and false-dynamic deletions against repeated-day logs.
  • Add manual review slices for parked aircraft, parked GSE that later moves, shadows under aircraft, wet pavement, and floodlit night operations.
  • If foundation features are used for airside labels, evaluate class coverage before trusting dynamic identification.
  • Store NMFF and Gaussian outputs with source log provenance so edited simulation assets do not contaminate real map-building data.

Sources

Public research notes collected from public sources.