- The paper presents a direct synthesis method that bypasses explicit depth estimation to convert monocular videos into immersive stereo views.
- It employs a two-stage pipeline with a low-resolution base generator and a high-resolution refiner to ensure accurate global layouts and fine details.
- Quantitative studies and user evaluations show Eye2Eye outperforms traditional methods, with 66% user preference and superior iSQoE scores in challenging reflective and transparent scenes.
The paper "Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis" (2505.00135) presents a method for generating a stereoscopic 3D video from a single monocular video input. The goal is to synthesize a corresponding video from a horizontally shifted viewpoint (e.g., the left eye view from a right eye input) to enable immersive 3D viewing experiences using VR headsets or 3D displays.
Traditional approaches to mono-to-stereo conversion typically involve a multi-stage pipeline:
- Estimate video disparity or depth for the input video.
- Warp the video based on the estimated geometry to produce the second view.
- Inpaint the disoccluded regions that become visible in the new viewpoint.
The paper highlights a fundamental limitation of this warp-and-inpaint approach: it assumes a single, well-defined depth for each pixel. This assumption breaks down in scenes with complex light transport, such as specular reflections or transparent surfaces. In such cases, a single depth map cannot correctly represent the scene (e.g., the depth of a reflective surface vs. the virtual depth of the reflected object). Warping based on a single depth leads to visual artifacts and incorrect 3D perception, where reflections might appear "pasted" flat onto the surface rather than at their true virtual depth.
Eye2Eye proposes a different strategy: directly synthesizing the target view (the left eye video) from the input view (the right eye video), bypassing explicit disparity estimation and warping steps. This direct synthesis leverages the implicit priors about geometry, object materials, optics, and semantics learned by large, pre-trained generative video models. The method is built on top of Lumiere [bartal2024lumiere], a cascaded text-to-video diffusion model.
The Eye2Eye framework adapts the pre-trained Lumiere model, which originally takes text as input, to accept a video as an additional conditioning input. The process involves two main components:
- Base Eye2Eye Generator: A fine-tuned version of Lumiere's low-resolution model. It is trained on downsampled (128x128) rectified stereo video pairs from the Stereo4D dataset [jin2024stereo4d]. This model learns to generate a low-resolution left view conditioned on the low-resolution right view and a text caption. It establishes the correct global layout and the appropriate scale of stereo disparity required for a compelling 3D effect.
1
2
3
4
5
6
7
|
# Pseudocode for Base Eye2Eye Generation (Inference)
input_video_right_low_res = downsample(input_video_right, target_res=128)
text_caption = get_caption(input_video_right) # Using BLIP2 or similar
noise = random_noise_like(input_video_right_low_res)
# Base Eye2Eye model denoises noise conditioned on low-res right view and text
generated_video_left_base = base_eye2eye_generator(noise, timestep_T, input_video_right_low_res, text_caption) |
- Eye2Eye Refiner: Another fine-tuned instance of the base Lumiere model, trained on high-resolution (512x512) crops of size 128x128 from the Stereo4D dataset. This model excels at generating fine details and inpainting disoccluded regions at higher resolution. However, when sampling at full resolution, models trained only on crops can exhibit a bias towards uniform shifts, reducing the perceived 3D depth.
To combine the strengths of both models, Eye2Eye uses a two-stage inference pipeline similar to SDEdit [SDEdit]:
- Generate an initial low-resolution left view video using the Base Eye2Eye Generator (e.g., at 256x256 resolution during sampling, as found to balance quality and 3D effect).
- Upsample this low-resolution output to the target high resolution (e.g., 512x512).
- Add noise to the upsampled video up to a certain diffusion timestep t (e.g., t=0.9).
- Denoise this noised, upsampled video using the Eye2Eye Refiner, conditioned on the original high-resolution right view and the text caption.
1
2
3
4
5
6
7
8
9
10
|
# Pseudocode for Eye2Eye Refinement (Inference)
generated_video_left_base = generate_low_res(input_video_right_low_res, text_caption) # from step 1
upsampled_video_left_base = upsample(generated_video_left_base, target_res=512)
# Add noise to upsampled low-res output
timestep_t = 0.9 # Starting timestep for denoising
noisy_upsampled_video = add_noise(upsampled_video_left_base, timestep_t)
# Eye2Eye refiner denoises starting from timestep_t, conditioned on high-res right view and text
final_video_left = eye2eye_refiner(noisy_upsampled_video, timestep_t, input_video_right_high_res, text_caption) |
This two-stage approach ensures that the global scene structure and correct disparity scale are determined by the base model's low-resolution output, while the refiner adds high-quality details and handles inpainting at the target resolution.
The training data is curated from the Stereo4D dataset [jin2024stereo4d], consisting of over 100K rectified stereo VR180 videos. Videos with excessive disparity are filtered out. BLIP2 [li2023blip] is used for captioning. Training details include fine-tuning Lumiere for 120K steps with a batch size of 32. For the base generator, videos are downsampled to 128x128. For the refiner, random 128x128 crops from 512x512 videos are used.
The method was evaluated qualitatively and quantitatively against baselines including a re-implemented warp-and-inpaint method using Lumiere, StereoCrafter [stereocrafter], Deep3D [Xie2016Deep3DFA], and Dynamic Gaussian Marbles (DGM) [Stearns_2024]. Evaluations were performed on a held-out test set featuring challenging scenes with reflections and transparencies.
Qualitative comparisons demonstrated that Eye2Eye successfully handles complex light transport scenarios, correctly shifting layered content (like a transparent umbrella vs. the building behind it, or a reflection vs. the surface) according to the depth of each element. Warp-and-inpaint and DGM baselines struggled with these cases, exhibiting distortions or incorrect depth perception. DGM also produced visible holes due to its lack of a generative prior for inpainting.
Quantitative evaluation included:
- User Study: A 2AFC paper comparing Eye2Eye against the warp-and-inpaint baseline in VR. Participants favored Eye2Eye's realistic 3D effect, especially in reflective/transparent areas, in 66% of overall judgments and for 74% of the evaluated videos (statistically significant).
- iSQoE Metric: The iSQoE [tamir2024makesgoodstereoscopicimage] stereoscopic quality metric was used. Eye2Eye achieved higher average scores on 84% of test videos compared to StereoCrafter and 74% compared to the re-implemented warp-and-inpaint baseline.
The paper concludes that the direct synthesis approach, powered by large pre-trained video diffusion models and trained on real-world stereo data, is effective for mono-to-stereo conversion, particularly overcoming challenges posed by complex light transport that traditional depth-based methods cannot handle. Acknowledged limitations include the lack of explicit control over the camera baseline, which affects the strength of the 3D effect. Future work could explore methods to dynamically adjust the baseline.