Interactive Results
Abstract
This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.
Method Overview

Given a casually captured video, Uni4D harnesses visual foundation models and structured prediction to jointly estimate camera poses, dynamic geometry, and dense 3D motion. Static geometry and poses are obtained through tracklet-based structure-from-motion using predicted depths. Dynamic geometry is improved through smoothness and ARAP motion priors. A final fusion densifies geometry to obtain high quality 4D reconstruction.
Results - Geometry
Qualitative
Quantitative
We compare Uni4d geometry both qualitatively and quantitatively over multiple datasets against strong baselines ranging from metric depth models such as Metric3D, DepthPro, and reconstruction methods such as CasualSAM and MonST3R. Uni4D demonstrates superior visual quality, temporal consistency, and geometric accuracy across a variety of challenging scenes across various datasets. Quantative evaluation through depthmap metrics further supports Uni4D's reconstruction geometry.
Results - Pose
Qualitative
Quantitative
We compare Uni4d pose estimation both qualitatively and quantitatively over multiple datasets against strong baselines such as LEAPVO, DPVO, CasualSAM and MonST3R. Uni4D demonstrates accurate and smooth camera poses across a variety of challenging scenes across various datasets. Quantative evaluation through pose evaluation metrics further supports Uni4D's pose estimation results. * indicates methods with known camera intrinsics.