Final project for 6.8300 Computer Vision (Spring 2026), MIT

Video Real2Sim (VR2S)

Zehua Wang · Joy Zhuo

1. Introduction

Turning a single tabletop photograph into a 3D scene that can support simulation is still a bottleneck for scalable real2sim. The hard part is not only engineering. It is information. A single camera view cannot observe the back of an object, the contact patch hidden under it, or geometry clipped by the image boundary. Monocular depth can infer a plausible front surface, but the missing views remain missing.

Prior approaches usually choose one of three compromises. Monocular-depth pipelines preserve the single-image interface, but lift only the visible surface. Per-object generators can hallucinate complete assets, but have trouble preserving scene scale, contacts, and support geometry. Dense multi-view capture recovers much richer structure, but requires the user to walk around the scene. That walkaround is exactly what a single-image real2sim system is supposed to avoid. We characterize these choices in Section 2.

Video Real2Sim (VR2S) tests a different compromise. Instead of asking a depth model to infer all geometry from one image, we ask a general image-to-video model to synthesize the missing views, then pass those views to a pose-free reconstruction stack. The video model is not given camera extrinsics, depth maps, or 3D supervision at inference time. It receives only the input image and a prompt asking for an orbital video. With small exceptions noted in the experimental setup, this is mostly a single-image pipeline: the populated scene is always specified by one photo, while the video model supplies a synthetic walkaround rather than a real capture.

The answer is not simply that video diffusion solves real2sim. VR2S improves perceptual fidelity and back-side appearance in our evaluation, especially when extra views reveal object or background structure that a single photo cannot. It also loses some silhouette accuracy because the generated views, segmentation masks, and ICP placement introduce drift. The useful claim is therefore narrower and more concrete: synthetic video can relax the single-view information bottleneck, but it does not remove the need for careful reconstruction, alignment, and evaluation.

2. Related Work

Single-image real2sim. RoLA [1] and OpenReal2Sim [2] typically segment the scene, estimate monocular depth with models such as MoGe [6], then back-project the visible surface. This gives strong input-view placement, but no direct observation of object backs, occluded contact regions, or clipped geometry. Per-object generators such as Gen2Sim [3] and InstantMesh [4] improve object completeness but weaken scene-scale placement. Dense multi-view methods such as SplatSim [5], NeRF, and 3DGS recover richer geometry, but require a real walkaround.

Pose-free multi-view reconstruction. DUSt3R [7], MASt3R [8], VGGT [9], and DA3 [10] recover geometry from unposed views using feed-forward models. We use DA3 because it turns an unposed video into depth, intrinsics, and extrinsics in a shared frame. Since DA3 also appears in our evaluation pipeline, we treat those outputs as proxy targets rather than sensor-grade ground truth.

Video diffusion as a view synthesizer. General image-to-video models such as Wan [11], CogVideoX [12], and Veo [13] can produce orbit-like videos from one image. The views are not metrically guaranteed, but they often contain side and back evidence that monocular depth never receives. We deliberately avoid camera-conditioned generators such as Stable Video 4D [14], MotionCtrl [15], and CameraCtrl [16], because VR2S asks whether general video priors alone are useful.

3. Method

Figure 1. VR2S pipeline. Two orbital videos are synthesized from one photo by Veo3: one over the populated scene and one over the background-removed image. Separate DA3 branches reconstruct foreground and support geometry, then ICP placement assembles the final scene mesh.

Core idea. OpenReal2Sim lifts the visible surface from monocular depth. This preserves input-view placement well, but cannot observe object backs, under-object support regions, or clipped surfaces.

VR2S changes the source of evidence. Instead of asking one image to carry the whole reconstruction, we ask a general video model to synthesize an orbit. DA3 estimates poses and depth, SAM2 supplies masks, Hunyuan3D builds object meshes, TSDF fusion constructs the background, and ICP places objects back into the scene. The contribution is the substitution of synthetic multi-view evidence for the missing physical walkaround.

The central design choice. Before reconstruction, VR2S passes the input image to Veo3 and asks for an orbital video.

We give Veo3 no camera extrinsics, no depth conditioning, and no geometric supervision at inference time. It receives only the image and a prompt requesting a 360° orbit. The experiment is whether that general model can provide enough side and back evidence to help real2sim reconstruction.

The claim is narrow: generated views can expose geometry and appearance that the input photo does not contain. This should improve back-side and perceptual reconstruction when the orbit is coherent, while hurting placement-sensitive metrics when the generated video or alignment drifts.

Scope of contribution. DA3, SAM2, Hunyuan3D, TSDF fusion, and ICP are supporting components, not independent contributions. The design choice we defend is using a general video model to create the missing observations that the rest of the pipeline consumes.

The two-video trick. VR2S generates one orbit of the populated scene and one orbit of a clean-background image produced by object removal. The second orbit lets us reconstruct support geometry under and behind objects. The two reconstructions are not explicitly matched frame-by-frame; placement handles alignment through the recovered support plane.

Components. DA3 [10] processes the generated videos and emits per-frame depth, intrinsics, and extrinsics. For each pixel, back-projection follows the standard pinhole model

$X_c = (u - c_x) d / f_x, Y_c = (v - c_y) d / f_y, Z_c = d, X_w = R_c_to_w X_c + t_c_to_w,$

which lifts confident depth pixels into a world pointcloud. We anchor DA3 to the first frame (ref_view_strategy="first") because the default reference selection drifts on low-texture generated videos.

Object masks come from Grounding DINO [18] and SAM2 [17]. Object meshes come from Hunyuan3D-2mv [19] using four RGBA crops sampled near 0°, 90°, 180°, and 270° around each object. The clean-background orbit is converted to a background mesh by DA3+3DGS export, gsplat rendering, and Open3D TSDF fusion.

Placement and finalization. Each object mesh is placed into the DA3 world frame by Sim3 ICP. We fit the support plane with RANSAC, rotate the scene upright, crop the background mesh, and lift each object slightly above the plane to avoid immediate intersection.

4. Experiments

We evaluate VR2S on four real-world tabletop scenes chosen to vary support geometry and object shape, comparing against OpenReal2Sim (OR2S) on identical held-out novel views. Section 4.1 describes the proxy evaluation protocol. Section 4.2 reports the main tradeoff: VR2S improves perceptual fidelity, while OR2S keeps stronger silhouette alignment. Section 4.3 examines where synthetic multi-view evidence helps. Section 4.4 studies the background-video and multi-view-object stages. Section 4.5 characterizes two systematic failure modes.

Scene 1

Scene 2

Scene 3

Scene 4

Veo3-generated orbital videos for the four evaluation scenes. These videos provide the synthetic multi-view evidence consumed by DA3 and the downstream VR2S pipeline.

4.1 Setup

Scenes. We evaluate on four real-world scenes that span a deliberate axis of support type and object geometry: a flat wooden table, a slanted deformable beanbag, a curved upholstered sofa cushion, and a grey floor with thin elongated objects. Scene compositions are shown in Figure 2. The setup is single-image for the populated scene in all cases. One protocol exception is worth stating up front: for Scene 2, the background branch uses a real captured clean-beanbag pass rather than a Gemini-inpainted clean image, because inpainting the deformable support produced unusable geometry. We keep that scene in the aggregate, but interpret background-branch results with this exception in mind.

Evaluation protocol. For each scene we capture a held-out walkaround video covering roughly 360° of the static scene. Running DA3 on this video recovers per-frame camera intrinsics, extrinsics, dense depth, and SAM2 object silhouettes. We use these as proxy evaluation targets, not as sensor-grade ground truth. This distinction matters because DA3 is also part of VR2S, and shared model bias could make some metrics more favorable to VR2S. We sample 16 evenly-spaced novel views from the recovered trajectory, of which 8 are back-side views with azimuth offset greater than 90° from the input photo's camera. Back-side views carry the load-bearing evaluation signal, since front-facing views can score well even on a depth-pancake reconstruction. Each method's reconstruction is Sim3-aligned to the proxy pointcloud before rendering, and the input view is excluded from scoring. This means the reported metrics evaluate reconstructed appearance and geometry after global alignment; they do not prove that the pipeline recovers deployable scene scale or placement without held-out information.

Metrics. For each held-out view, we compute per-object silhouette IoU against the SAM2 proxy mask, and masked LPIPS, PSNR, and SSIM against the corresponding held-out photo. Each photometric metric is reported separately for the foreground (the union of object masks) and the background regions. We aggregate to per-scene means. IoU is most sensitive to placement and scale, while LPIPS and PSNR better reflect back-side appearance and texture fidelity. We therefore read the metrics as complementary rather than expecting one method to dominate every column.

4.2 Headline comparison

Figure 2. The four evaluation scenes, spanning background type (flat wooden table, slanted beanbag, curved sofa cushion, grey floor) and object geometry (including thin elongated objects).

Method	Mean IoU (all) ↑	Mean IoU (back) ↑	FG LPIPS (back) ↓	BG LPIPS (all) ↓	FG PSNR (back) ↑
OR2S	0.562 ± 0.133	0.538 ± 0.145	0.701 ± 0.084	0.646 ± 0.108	11.36 ± 2.43
VR2S (no bg video)	0.449 ± 0.082	0.442 ± 0.109	0.594 ± 0.074	0.544 ± 0.085	11.35 ± 2.08
VR2S (full)	0.525 ± 0.092	0.509 ± 0.121	0.557 ± 0.053	0.601 ± 0.067	12.12 ± 2.51

Table 1. Headline comparison aggregated across all four scenes (mean ± per-scene std). Best value per column in bold. "back" restricts to the 8 held-out views with azimuth offset >90° from the input photo.

VR2S improves the perceptual metrics we care about most for novel-view reconstruction. Foreground LPIPS drops from 0.701 to 0.557, background LPIPS drops from 0.646 to 0.601, and foreground PSNR rises from 11.36 dB to 12.12 dB. The smaller per-scene standard deviations suggest that this is not driven by a single easy scene, although the dataset is too small for strong statistical claims. The background branch should be read more narrowly: it helps support geometry and contact reasoning in specific cases, especially when the relevant support surface is largely occluded, but it does not monotonically improve every background photometric metric. Qualitatively, the generated orbit gives VR2S evidence for sides and support regions that the input photo never observes, so back-side renders tend to look more complete.

OR2S performs better on silhouette IoU, scoring 0.562 to VR2S's 0.525 across all views, and 0.538 to 0.509 on back-side views. This is not a minor footnote; it is the main tradeoff. Because OR2S lifts the visible front surface directly from the input photo, each reconstructed object is anchored to the input silhouette by construction. VR2S reconstructs objects from generated multi-view evidence and then aligns them through Sim3 ICP. A few centimeters of placement or scale error can hurt IoU even when the rendered back side looks more plausible. The headline result is therefore mixed but useful: VR2S buys perceptual and back-side completeness at the cost of some placement-sensitive silhouette accuracy.

4.3 Where multi-view evidence helps

VR2S helps most when the generated orbit exposes information that is absent or ambiguous in the input image. Three cases show up repeatedly in the evaluation.

Non-planar support. Scene 2 uses a slanted beanbag rather than a flat table, exposing the limits of single-image back-projection. VR2S improves back-side IoU on this scene from 0.482 to 0.541. This result should be read with the setup caveat above: Scene 2 uses a real clean-background pass for the background branch, so it is evidence that support geometry matters, not a pure test of inpainted clean-background generation.

Object-background separation. Single-image methods must infer contact regions from one view, so objects and support surfaces can fuse near their boundary. VR2S reconstructs a clean-background orbit separately from foreground objects, then places each object onto the recovered surface. In Scene 1, this improves back-side PSNR from 11.65 dB to 13.31 dB on foreground regions and from 15.71 dB to 18.30 dB on background regions.

View-dependent surfaces. A single input view gives ambiguous evidence for glossy or curved objects. The clearest example is the Nintendo Switch display: VR2S uses four orbit crops, while OR2S sees only the front view. Switch back-side LPIPS improves from 0.899 to 0.544. Similar gains appear on the Scene 3 plate and balloon.

Aggregating per-object back-side LPIPS within each scene gives the same pattern. VR2S wins in every scene, with the largest gaps on Scenes 2 and 3 and smaller gains on the more box-like or flat cases.

Scene	OR2S ↓	VR2S ↓	Δ (VR2S − OR2S)
Scene 1 (laptop, mug, tissue)	0.613	0.534	−0.079
Scene 2 (book, tennis, earmuffs)	0.825	0.479	−0.347
Scene 3 (switch, plate, balloon)	0.776	0.539	−0.237
Scene 4 (box, umbrella, orange)	0.573	0.555	−0.018

Table 2. Per-scene mean of per-object back-side LPIPS (averaged across the objects in each scene; lower is better). Scene 4's blue paint bottle is excluded since it is fully occluded from every back-side view.

4.4 Ablation analysis

Two ablations isolate the pipeline components most directly tied to the multi-view-evidence hypothesis: the dedicated background video and the multi-view conditioning of the per-object mesh generator. Both ablations modify one stage of the full pipeline while keeping the orientation and snap fixes in icp_place and finalize intact. The goal is not to explain every metric, but to test whether the extra generated views help in the places the method expects them to help.

4.4.1 Background video. In the full pipeline, the background mesh is reconstructed from a dedicated clean-background orbit produced by Gemini-inpainting the input photo and feeding the result to the video model. In this ablation (noBgVid), we instead derive the background from the main video alone, masking out object pixels with dilated SAM masks before TSDF fusion.

Scene	VR2S (full) ↑	noBgVid ↑	Δ
1 (table)	0.584	0.571	−0.013
2 (beanbag)	0.541	0.528	−0.013
3 (sofa)	0.609	0.349	−0.260
4 (table)	0.305	0.320	+0.015

Table 3. Mean back-side IoU with and without the dedicated background video.

The background-video ablation is an indirect but useful test. Table 3 reports back-side object IoU, not a direct under-object surface metric, so it should not be overread as a complete measurement of support quality. Still, the pattern is consistent with the visual failure mode. On the mostly planar or externally supported scenes (1, 2, 4), removing the background video produces near-neutral changes in back-side IoU. Scene 3, on a curved sofa cushion, exhibits a 0.260 drop. The contact patch directly under each placed object is never observed in the main video, and the resulting background TSDF has a crater there. When finalization snaps objects to the local surface, they fall through or tilt into the geometry. The dedicated background orbit helps because it reconstructs the support before objects occlude it.

4.4.2 Multi-view conditioning of the mesh generator. The full pipeline feeds Hy3D-2mv four near-orthogonal RGBA crops drawn from the diffusion orbit. In this ablation (singleViewHy3d), we replace Hy3D-2mv with Hy3D-2 single-view, fed only the front SAM crop from the input view.

Scene	VR2S (full) ↑	singleViewHy3d ↑	Δ
1 (table)	0.584	0.565	−0.019
2 (beanbag)	0.541	0.573	+0.032
3 (sofa)	0.609	0.548	−0.060
4 (table)	0.305	0.349	+0.044

Table 4. Mean back-side IoU with and without multi-view conditioning of the per-object mesh generator.

The result is more nuanced than a simple "more views are better" hypothesis. Multi-view conditioning improves back-side IoU on Scene 3, where the curved cushion occludes shape-relevant parts of the objects in the input view, notably the balloon's tail and the underside of the plate. In that case, the multi-view crops supply evidence that the front view cannot. On Scenes 2 and 4, single-view Hy3D-2 outperforms the multi-view variant. The side and back SAM crops drawn from the diffusion orbit carry segmentation drift and silhouette noise, and that noise propagates into the shape model. For those scenes, the clean front crop alone produces a more reliable mesh.

The pure mesh-quality evaluation reinforces this pattern. Measured as per-object Chamfer distance under a free Sim3 alignment, singleViewHy3d produces the better mesh on three of four scenes:

Scene	VR2S (full) ↓	singleViewHy3d ↓	Better mesh
1	12.06 mm	10.38 mm	single-view
2	16.62 mm	15.21 mm	single-view
3	12.56 mm	18.62 mm	multi-view (+6 mm)
4	16.31 mm	16.07 mm	single-view

Table 5. Mean Chamfer distance to proxy meshes, decoupled from placement.

Multi-view conditioning therefore helps when the orbit's additional views contain shape-relevant geometric information that the front view cannot supply (Scene 3), and hurts when the additional views inject silhouette noise that the front view would not have had (Scenes 1, 2, 4). The video model contributes useful information for scene reconstruction, but the per-object mesh stage is sensitive to the quality of that information in a way that does not always favor multi-view input.

4.5 Failure modes

Two failure modes dominate our errors.

FM1: Video-diffusion artifacts. The generated orbit can drift from the input scene. The folded umbrella in Scene 4 is the clearest case: its silhouette shifts across frames, and the resulting multi-view mesh is worse than the single-view alternative. Umbrella back-side IoU is 0.380 for VR2S, 0.425 for OR2S, and 0.493 for the singleViewHy3d ablation.

FM2: Pose drift on inpainted backgrounds. The clean-background orbit is usually generated from a Gemini-inpainted image. When the inpainted image loses geometric constraints, DA3 pose estimates drift and TSDF fusion creates background artifacts. This is most visible in Scene 1. Scene 2 avoids the failure by using a real clean-beanbag pass, which is why its background results should not be treated as evidence that inpainted clean-background videos always work.

5. Discussion and limitations

The thesis partially holds. A general video model can supply useful missing views for real2sim reconstruction. VR2S improves LPIPS and PSNR on held-out views and gives the object mesh generator evidence for backs and sides the input photo never observed. It does not dominate every metric: OR2S still wins silhouette IoU, and single-view object meshing gives lower Chamfer distance on three of four scenes.

The main limitation is fidelity. Veo3 produces plausible orbits, but objects drift, surfaces deform, and lighting changes inconsistently. These artifacts propagate through DA3, SAM2, Hy3D-2mv, TSDF fusion, and ICP. Our evaluation is also small and proxy-based: DA3 and SAM2 produce the held-out trajectory and masks, and there is no real-robot deployment. The output should therefore be read as simulation-oriented reconstruction, not proven simulation readiness. Collision proxies, watertightness, physical scale, material properties, and robot policy transfer remain untested.

References

[1] S. Zhao, J. Mao, W. Chow, Z. Shangguan, T. Shi, R. Xue, Y. Zheng, Y. Weng, Y. You, D. Seita, L. Guibas, S. Zakharov, V. Guizilini, Y. Wang. Robot Learning from Any Images. CoRL 2025. arXiv:2509.22970.

[2] PointsCoder et al. OpenReal2Sim: A Toolbox for Real-to-Sim Reconstruction and Robotic Simulation. 2025. github.com/PointsCoder/OpenReal2Sim.

[3] P. Katara, Z. Xian, K. Fragkiadaki. Gen2Sim: Scaling Up Robot Learning in Simulation with Generative Models. ICRA 2024. arXiv:2310.18308.

[4] J. Xu, W. Cheng, Y. Gao, X. Wang, S. Gao, Y. Shan. InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models. 2024. arXiv:2404.07191.

[5] M. N. Qureshi, S. Garg, F. Yandun, D. Held, G. Kantor, A. Silwal. SplatSim: Zero-Shot Sim2Real Transfer of RGB Manipulation Policies Using Gaussian Splatting. 2024. arXiv:2409.10161.

[6] R. Wang, S. Xu, C. Dai, J. Xiang, Y. Deng, X. Tong, J. Yang. MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision. CVPR 2025. arXiv:2410.19115.

[7] S. Wang, V. Leroy, Y. Cabon, B. Chidlovskii, J. Revaud. DUSt3R: Geometric 3D Vision Made Easy. CVPR 2024. arXiv:2312.14132.

[8] V. Leroy, Y. Cabon, J. Revaud. Grounding Image Matching in 3D with MASt3R. ECCV 2024. arXiv:2406.09756.

[9] J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, D. Novotny. VGGT: Visual Geometry Grounded Transformer. CVPR 2025. arXiv:2503.11651.

[10] H. Lin, S. Chen, J. Liew, D. Y. Chen, Z. Li, G. Shi, J. Feng, B. Kang. Depth Anything 3: Recovering the Visual Space from Any Views. 2025. arXiv:2511.10647.

[11] Team Wan. Wan: Open and Advanced Large-Scale Video Generative Models. Alibaba, 2025. arXiv:2503.20314.

[12] Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, et al. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer. ICLR 2025. arXiv:2408.06072.

[13] Google DeepMind. Veo 3 Technical Report. 2025. deepmind.google/models/veo.

[14] Y. Xie, C.-H. Yao, V. Voleti, H. Jiang, V. Jampani. SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency. 2024. arXiv:2407.17470.

[15] Z. Wang, Z. Yuan, X. Wang, T. Chen, M. Xia, P. Luo, Y. Shan. MotionCtrl: A Unified and Flexible Motion Controller for Video Generation. SIGGRAPH 2024. arXiv:2312.03641.

[16] H. He, Y. Xu, Y. Guo, G. Wetzstein, B. Dai, H. Li, C. Yang. CameraCtrl: Enabling Camera Control for Text-to-Video Generation. 2024. arXiv:2404.02101.

[17] N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Radle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollar, C. Feichtenhofer. SAM 2: Segment Anything in Images and Videos. ICLR 2025. arXiv:2408.00714.

[18] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, L. Zhang. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. ECCV 2024. arXiv:2303.05499.

[19] Z. Zhao et al. Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation. Tencent, 2025. arXiv:2501.12202.