SelfEvo

Input

Pretrained VGGT

[Requires WebGL2 support]

Loading...

SelfEvo (VGGT)

[Requires WebGL2 support]

Loading...

Left drag Rotate Right drag Pan Scroll Zoom WS Forward / Backward AD Left / Right QE Up / Down Space Pause / Resume

Scene geometry is downsampled for faster loading. Sky masking applied for visualization only.

More scenes →

Abstract

Large-scale feedforward multi-view reconstruction models have made remarkable progress, but existing approaches still rely on fully supervised training with 3D annotations. Such annotations are expensive and particularly scarce for dynamic scenes, limiting scalability. We propose SelfEvo, a self-improving framework that continually improves pretrained multi-view reconstruction models using unlabeled videos. SelfEvo introduces a self-distillation scheme using spatiotemporal context asymmetry, enabling self-improvement for learning-based 4D perception without external supervision. We systematically study design choices that make self-improvement effective, including loss signals, forms of asymmetry, and other training strategies. Across eight benchmarks spanning diverse datasets and domains, SelfEvo consistently improves pretrained baselines and generalizes across base models (e.g., VGGT and π³), with significant gains on dynamic scenes. Overall, SelfEvo achieves up to 36.5% relative improvement in video depth estimation and 20.1% in camera estimation, without using any labeled data.

Method

SelfEvo is an annotation-free self-improving framework that continually post-trains pretrained multi-view reconstruction models using unlabeled videos. SelfEvo forms an online self-distillation loop where a richer-context teacher provides stop-gradient pseudo targets to a student operating on a dropped subset, and is updated as an EMA of the student after each step.

How Does the Model Evolve?

We visualize how SelfEvo's predictions evolve over training, which reveals a clear and sustained improvement trend with more iterations.

Input

Want to explore the 3D point clouds interactively? Try the interactive timeline viewer →

Quantitative Metrics Across Iterations

Depth Abs Rel ↓

Depth δ<1.25 ↑

Camera AUC@30 ↑

Experiments

Self-Improvement Results

SelfEvo (VGGT) consistently outperforms pretrained VGGT across all eight benchmarks — including new-domain, original domain, and unseen domain settings — without using any labeled data.

Depth — Abs Rel ↓

Depth — δ < 1.25 ↑

Camera — AUC@30 ↑

Generality Across Models & Domains

The gains are not tied to a specific model or dataset. SelfEvo improves both VGGT and π³ when self-improved on DROID and BEDLAM2.0, with no annotations used during training, only during evaluation. VGGT π³

Depth — Abs Rel ↓

Depth — δ < 1.25 ↑

Camera — AUC@30 ↑

Analysis

We present four ablations of our design choices on the Omniworld-Game dataset, with each subsection showing two scatter plots for depth and camera estimation. ★ denotes our default choice.

Asymmetry Mechanism

Frame dropping (★) achieves the best performance, surpassing photometric perturbations (aug-stu, aug-all) and frame cropping. Dropping frames creates a strong spatiotemporal context asymmetry, leading to more effective training signals.

Depth ↖

Camera ↗

Frame Selection Strategy

We study three frame dropping strategies: random sampling, attention-based sampling with high attention scores (keep top), and attention-based sampling with low attention scores (keep bottom). Random sampling (★) achieves the best performance on both depth and camera estimation.

Depth ↖

Camera ↗

Online vs. Offline Teacher

Online EMA (★) shows the largest shift toward the optimal corner in both plots. Unlike a fixed teacher, which grows stale as the student evolves, keeping the teacher and student co-evolving allows continuous improvement.

Depth ↖

Camera ↗

Training Recipe

Freeze-C (★) occupies the best corner in both plots. Freeze-A is a clear outlier on camera (bottom-left), confirming backbone adaptation is critical for pose learning. Freezing the camera decoder anchors pose quality while the backbone and depth decoder continue to improve.

freeze-A freeze aggregator freeze-C freeze camera head freeze-D freeze depth head freeze-C&D freeze both heads train-all update all modules

Depth ↖

Camera ↗

Just for Fun: What If We Train on a Single Movie?

SelfEvo also improves in this data-constrained, test-time adaptation setting.

Input

Pretrained VGGT

[Requires WebGL2 support]

Loading...

SelfEvo (VGGT)

[Requires WebGL2 support]

Loading...

Left drag Rotate Right drag Pan Scroll Zoom WS Forward / Backward AD Left / Right QE Up / Down Space Pause / Resume

Scene geometry is downsampled for faster loading.

Future Work

Our framework is most effective in settings with sufficient camera motion, where frame dropping provides a strong context asymmetry signal. When the camera remains static, it is difficult to create asymmetry through frame dropping alone. Future work could extend frame-level selection to the token level by selectively dropping tokens for greater flexibility. Additionally, as with other self-improving frameworks, the absence of ground-truth supervision means extended training may risk model collapse. In practice, however, we generally observe stable performance without significant degradation. Understanding how to sustain improvement over longer training horizons remains an important direction for future work.

BibTeX


@misc{huang2026selfimproving4dperceptionselfdistillation,
      title={Self-Improving 4D Perception via Self-Distillation}, 
      author={Nan Huang and Pengcheng Yu and Weijia Zeng and James M. Rehg and Angjoo Kanazawa and Haiwen Feng and Qianqian Wang},
      year={2026},
      eprint={2604.08532},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.08532}, 
}