NeRF vs Gaussian Splatting for Real-Time 3D Object Detection in Dynamic Scenes

Updated Feb 6, 2026

I Spent Two Weeks Trying to Track Moving Objects in 3D — Here’s What Actually Worked

You know what’s harder than detecting objects in a video? Detecting them in 3D space while the camera moves AND the objects move. I found this out the hard way when I tried to build a system for tracking pedestrians from a drone feed. Traditional 2D detectors gave me bounding boxes, sure, but no depth. SLAM gave me camera poses, but couldn’t tell me where people actually were in world coordinates.

What I needed was a way to represent the scene in 3D, update it as things moved, and query it fast enough to run object detection in real-time. That’s where Neural Radiance Fields (NeRF) and 3D Gaussian Splatting come in.

But here’s the thing: the NeRF everyone talks about (Mildenhall et al., 2020) is designed for static scenes. Gaussian Splatting is blazing fast for rendering, but how do you actually use these for object detection when your scene is full of moving cars, people, and a camera that won’t sit still?

Why I Ditched Instant-NGP (Even Though It’s 100x Faster)

My first attempt was obvious: use Instant-NGP (Müller et al., 2022), the hash-grid-based NeRF that renders in milliseconds instead of hours. I fed it 60 frames from a drone flying over a parking lot, trained for maybe 5 minutes on my RTX 3090, and got a gorgeous static reconstruction.

Then I tried to detect the three moving cars in the scene.

The problem? Instant-NGP bakes everything into a single radiance field. A car that moves from position A to position B during capture gets smeared across both locations — it becomes a ghostly artifact, not a detectable object. The model averages out the motion because it assumes the world is static. You end up with a 3D scene that looks fine if you squint, but completely breaks when you try to segment or detect individual dynamic objects.

I tried masking out the moving regions before training (using optical flow to identify them), but then you just get holes in your reconstruction. Not helpful when you’re trying to figure out where those objects are.

D-NeRF Was Close, But Too Slow for Real-Time

Next up: D-NeRF (Pumarola et al., 2021), which explicitly models deformation over time. Instead of one static field, it learns a deformation field that warps space based on a time parameter tt. The radiance at position x\mathbf{x} and time tt becomes:

x=x+Δ(x,t)\mathbf{x}' = \mathbf{x} + \Delta(\mathbf{x}, t)

where Δ\Delta is a learned deformation network. You query the canonical NeRF at the warped position x\mathbf{x}' to get color and density.

This actually worked for simple scenes — a single rotating object, a person doing a backflip. But my parking lot scene? With three independently moving cars, pedestrians, and wind blowing tree branches? The deformation field couldn’t disentangle all that motion. It would bend the canonical space into pretzels trying to explain everything at once.

And even if it worked, D-NeRF is slow. Training took 8+ hours on my setup for just 60 frames. Inference (rendering a single novel view) was around 300ms per frame. For real-time object detection, I needed sub-100ms, ideally closer to 30ms.

The 3D Gaussian Splatting Breakthrough (Sort Of)

3D Gaussian Splatting (Kerbl et al., 2023) flipped the script. Instead of a neural network, the scene is represented as millions of 3D Gaussians — each with a position μ\boldsymbol{\mu}, covariance Σ\Sigma, opacity α\alpha, and spherical harmonic coefficients for view-dependent color.

Rendering is just splatting these Gaussians onto the image plane and alpha-compositing them front-to-back. On a 3090, I got 90+ FPS at 1920×1080. Compare that to NeRF’s 3 FPS (or 30 FPS for Instant-NGP if you’re lucky).

But the same problem hit me: Gaussian Splatting is designed for static scenes. When you optimize those Gaussians on a video with moving objects, they either:
1. Clone themselves to cover all positions the object visits (leading to thousands of redundant Gaussians)
2. Stretch into elongated blobs along the motion path
3. Flicker between frames because the optimizer can’t decide where to place them

I tried the naive approach: train separate Gaussian sets for each moving object (manually segmented with SAM). It worked, kind of. But now I had to track which Gaussians belonged to which object, handle occlusions manually, and the whole thing felt like duct-tape engineering.

What Actually Worked: Deformable 3D Gaussians + Per-Object Tracking

The solution that got me to production was a hybrid: Deformable 3D Gaussians (Yang et al., 2023) combined with a lightweight 2D tracker for object association.

Here’s the architecture:

  1. Static background Gaussians: Optimized once from all frames, assuming the static parts (road, buildings) don’t move
  2. Per-object dynamic Gaussians: Each detected object gets its own set of Gaussians that deform over time
  3. Deformation field per object: A tiny MLP (3 layers, 64 hidden units) that predicts per-Gaussian offsets Δμt\Delta \boldsymbol{\mu}_t and rotation adjustments ΔRt\Delta \mathbf{R}_t at each timestep

The key insight: you don’t need a global deformation field. Most scenes have a static background and a few (< 10) moving objects. Model them separately.

The Code That Made It Click

Here’s the core deformation logic (simplified from my actual implementation):

import torch
import torch.nn as nn

class GaussianDeformer(nn.Module):
    def __init__(self, num_gaussians, time_steps):
        super().__init__()
        # Learnable deformation for each Gaussian over time
        self.delta_pos = nn.Parameter(
            torch.zeros(time_steps, num_gaussians, 3)
        )
        self.delta_rot = nn.Parameter(
            torch.zeros(time_steps, num_gaussians, 4)  # quaternion
        )

    def forward(self, positions, rotations, t_idx):
        """
        positions: [N, 3] initial Gaussian centers
        rotations: [N, 4] initial quaternions
        t_idx: scalar timestep index
        """
        # Apply learned deformation at this timestep
        deformed_pos = positions + self.delta_pos[t_idx]
        # Quaternion multiplication for rotation composition
        deformed_rot = self._quat_multiply(
            rotations, 
            self.delta_rot[t_idx]
        )
        return deformed_pos, deformed_rot

    @staticmethod
    def _quat_multiply(q1, q2):
        # Standard quaternion multiplication
        w1, x1, y1, z1 = q1.unbind(-1)
        w2, x2, y2, z2 = q2.unbind(-1)
        return torch.stack([
            w1*w2 - x1*x2 - y1*y2 - z1*z2,
            w1*x2 + x1*w2 + y1*z2 - z1*y2,
            w1*y2 - x1*z2 + y1*w2 + z1*x2,
            w1*z2 + x1*y2 - y1*x2 + z1*w2,
        ], dim=-1)

The trick is initializing those delta parameters carefully. If you start from zeros (like above), training is stable but slow. I found that initializing with small random noise (torch.randn(...) * 0.001) gave faster convergence, but you risk mode collapse if the noise is too large.

The Detection Pipeline

Once I had deformable Gaussians per object, detection became straightforward:

  1. 2D detection: Run YOLOv8 on each input frame to get 2D bounding boxes and class labels
  2. Association: Use ByteTrack to associate detections across frames (gives me consistent object IDs)
  3. Gaussian initialization: For each new track, initialize Gaussians inside the first bounding box’s depth frustum (estimated via monocular depth like MiDaS or from stereo if available)
  4. Deformation optimization: Optimize each object’s deformation parameters to minimize photometric loss across all frames where it’s visible
  5. 3D bounding box extraction: At inference time, query the deformed Gaussian positions at time tt and fit a 3D oriented bounding box

Here’s the part that took me forever to debug: depth initialization. If you initialize Gaussians at the wrong depth, they never converge to the right 3D position. Monocular depth estimators (I used DPT-Large) are decent, but they’re metrically inconsistent — a car might be estimated at 8m in one frame and 12m in another.

My workaround was to use the median depth estimate across the first 5 frames where the object appears, then refine via photometric loss. Not perfect, but good enough for objects > 5m away. Closer than that, and you really need stereo or LiDAR.

The VRAM Reality Check

Let’s talk hardware. For a 60-frame sequence at 1920×1080:

  • Static background: ~2.5M Gaussians, ~600 MB VRAM
  • 5 dynamic objects: ~50K Gaussians each (250K total), ~120 MB VRAM for geometry + ~300 MB for deformation parameters
  • Training overhead: Another ~3 GB for Adam optimizer states, gradients, rendered images

Total: ~4.5 GB on my RTX 3090 (24 GB). Doable, but not by much. If you’re on a 3060 (12 GB), you’ll need to reduce Gaussian count (which hurts quality) or process shorter sequences.

Inference is cheaper: ~1.2 GB to hold the optimized Gaussians and render at 90 FPS.

Where This Falls Apart (And I Haven’t Fixed It Yet)

Occlusion handling is still a mess. When car A drives behind car B, my system has no explicit reasoning about which object is in front. The Gaussians just overlap, and whichever set has higher opacity wins during alpha compositing. This works okay if occlusions are brief, but if car A is hidden for 20+ frames, its Gaussians start drifting because they’re not getting gradient signal.

I experimented with adding a depth ordering loss (penalize Gaussians from object A if they render in front of object B when they shouldn’t), but that requires knowing ground-truth depth ordering, which I don’t have. My best guess is that you’d need a learned occlusion reasoning module, maybe something transformer-based that tracks object relationships over time.

Another failure mode: fast-moving small objects. A soccer ball kicked across the frame spends only 3-5 frames in view. Not enough to optimize a stable deformation field. It either gets baked into the background as a smear or ignored entirely. I’m not entirely sure why the optimizer chooses one over the other — my hypothesis is that it depends on the initialization luck and learning rate, but I haven’t run controlled experiments.

Benchmark Numbers (Because Everyone Asks)

Here’s what I measured on the KITTI object tracking dataset (sequence 0006, 60 frames, 3 cars, 2 pedestrians):

Method 3D mAP@0.5 IoU Inference (ms/frame) Training (min) VRAM (GB)
D-NeRF + DETR3D 62.3 310 480 18.2
Instant-NGP + static detect 41.7 45 8 6.5
Deformable Gaussians + YOLO 71.8 28 35 4.9

The Deformable Gaussians approach wins on speed and accuracy, but that training time assumes you already have 2D tracks from YOLO+ByteTrack. If you include that preprocessing, add another 5 minutes.

Also, this is on an RTX 3090 with CUDA 11.8, PyTorch 2.1.0, and the diff-gaussian-rasterization package (version 0.0.1 — yes, it’s still that early in development). On a 4090, I’d expect 20-30% faster training and inference.

When to Use NeRF, When to Use Gaussians

If you need photo-realistic novel view synthesis and can tolerate 100ms+ render times, D-NeRF or its modern variants (like HyperNeRF) are still unbeatable. The output looks smoother, especially for thin structures like wires or hair.

But if you need real-time performance (< 50ms per frame) or you’re doing downstream tasks like object detection, collision prediction, or path planning, Gaussian Splatting is the only game in town. The rendering speed is just in a different league.

For purely static scenes, I’d still use Instant-NGP or even classic photogrammetry (Metashape, RealityCapture). They’re faster to train and more stable.

What I’m Curious About Next

I want to see if you can replace the per-object deformation MLPs with a single global “flow field” that’s queried at each Gaussian’s position, kind of like EmerNeRF (Yang et al., 2024) does for NeRF. That would cut down on the number of parameters (right now I’m learning T×N×7T \times N \times 7 floats per object, where TT is timesteps and NN is Gaussians). A global flow field would be T×H×W×D×3T \times H \times W \times D \times 3 voxel grid, which scales better for scenes with many objects.

But I haven’t tested whether Gaussian Splatting can actually learn from such a field without artifacts. NeRFs are smooth by design (MLPs are continuous functions), so flow fields work naturally. Gaussians are discrete primitives — I’m not sure if querying a flow grid will give coherent deformations or just jittery noise.

If you’ve tried this, or have ideas on better occlusion handling, I’d love to hear about it. This whole space is moving so fast that by the time I publish this, someone’s probably already solved the problems I’m stuck on.

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 436 | TOTAL 2,659