Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos (2412.07770v1)

Published 10 Dec 2024 in cs.CV and cs.LG

Abstract: Three-dimensional (3D) understanding of objects and scenes play a key role in humans' ability to interact with the world and has been an active area of research in computer vision, graphics, and robotics. Large scale synthetic and object-centric 3D datasets have shown to be effective in training models that have 3D understanding of objects. However, applying a similar approach to real-world objects and scenes is difficult due to a lack of large-scale data. Videos are a potential source for real-world 3D data, but finding diverse yet corresponding views of the same content has shown to be difficult at scale. Furthermore, standard videos come with fixed viewpoints, determined at the time of capture. This restricts the ability to access scenes from a variety of more diverse and potentially useful perspectives. We argue that large scale 360 videos can address these limitations to provide: scalable corresponding frames from diverse views. In this paper, we introduce 360-1M, a 360 video dataset, and a process for efficiently finding corresponding frames from diverse viewpoints at scale. We train our diffusion-based model, Odin, on 360-1M. Empowered by the largest real-world, multi-view dataset to date, Odin is able to freely generate novel views of real-world scenes. Unlike previous methods, Odin can move the camera through the environment, enabling the model to infer the geometry and layout of the scene. Additionally, we show improved performance on standard novel view synthesis and 3D reconstruction benchmarks.

Summary

  • The paper introduces the ODIN model, trained on the large-scale 360-1M dataset of one million 360° videos, achieving state-of-the-art novel view synthesis and 3D reconstruction from single images.
  • ODIN utilizes viewpoint-conditioned diffusion models and scalable correspondence search on 360° video data, demonstrating improved performance over prior methods on benchmarks like DTU and Mip-NeRF 360.
  • Leveraging the diversity and scale of the 360-1M dataset enables more robust 3D scene understanding from complex real-world environments, with implications for AR/VR and robotics.

From an Image to a Scene: Learning to Imagine the World from a Million 360° Videos

The paper presents a significant contribution to the field of computer vision by tackling the challenges associated with three-dimensional (3D) understanding of real-world scenes from videos. It posits that leveraging a large-scale dataset of 360-degree videos can overcome the limitations of traditional video datasets, which offer fixed viewpoints and sparse correspondences. The authors introduce a novel approach for transforming 360° videos into usable multi-view data and demonstrate an efficient method for novel view synthesis (NVS) and 3D reconstruction.

The core innovation of the paper lies in its ODIN model, which is trained on the largest known real-world, multi-view dataset named 360-1M, composed of over a million 360° videos collected from YouTube, offering an extensive range of diverse and real-world scenes. The use of diffusion models for novel view synthesis allows ODIN to freely generate novel views from a single input image, thus providing a significant advancement over previous methodologies that required dense multi-view datasets or known camera poses.

Methodology and Technical Approach

The authors address the bottlenecks of finding high-quality frame correspondences by employing a scalable correspondence search, leveraging the rotational freedom afforded by 360° videos. This is operationalized through sub-sampling of frames and applying equirectangular projections to align camera views. The use of Dust3R enables rapid and extensive generation of correspondences, essential for the construction of the 360-1M dataset.

The paper further explores enhancing NVS with viewpoint-conditioned diffusion, allowing for novel view synthesis conditioned on both rotation and translation. The incorporation of motion masking, which predicts and filters dynamic scene elements that are challenging to reconstruct, is another notable addition. This minimizes the loss function's impact from moving objects, which is a significant factor when training models with in-the-wild videos.

Experimental Results and Evaluation

The authors conducted extensive benchmarking against established models for both novel view synthesis and 3D reconstruction. The ODIN model demonstrated improved performance on benchmarks like DTU and Mip-NeRF 360, quantifiably outperforming existing methods like Zero1-to-3 and ZeroNVS (LPIPS scores of 0.378 on DTU and 0.587 on Mip-NeRF 360). On complex, real-world scenes, ODIN was markedly better at maintaining geometric consistency over significant camera movements, showcasing its ability to generate plausible 3D reconstructions from novel views.

The approach to 3D reconstruction using Dust3R combined with the ODIN-generated images along trajectories is effective, particularly in comparison with methods like Google Scanned Objects, evidenced by competitive Chamfer Distance and IoU metrics.

Implications and Future Directions

The implications of these findings are multifaceted. Practically, the scalability and diversity of the 360-1M dataset open up opportunities for more sophisticated AR/VR applications and innovations in robotic navigation. Theoretically, the integration of large-scale, real-world data in 3D scene understanding affirms the potential for advancements in AI's perception of complex environments. However, despite its strengths, the approach still requires advancements to model dynamic scene elements fully, pointing towards the exploration of 4D scene modeling.

The paper concludes with a positive societal note, recognizing broader impacts while highlighting potential misuse, such as creating fake images. The open-sourcing of the model, dataset, and code promotes transparency and usability within the research community, offering a foundation for further innovation in AI-driven 3D modeling techniques.

X Twitter Logo Streamline Icon: https://streamlinehq.com