Papers
Topics
Authors
Recent
2000 character limit reached

UniStereo: Unified Stereo Dataset

Updated 21 December 2025
  • UniStereo is a unified dataset offering both parallel and converged stereo video formats for comprehensive, reproducible research.
  • It comprises three diverse subsets with standardized 81-frame clips and paired textual captions to support multi-modal studies.
  • The dataset underpins benchmarking protocols and advances in mono-to-stereo conversion, driving significant improvements in performance.

UniStereo is the first large-scale, unified dataset designed for stereo video conversion research, explicitly constructed to support both parallel and converged (toe-in) stereo formats. It enables fair benchmarking and robust training of monocular-to-stereo video conversion models for applications such as virtual reality (VR) and 3D cinema. UniStereo addresses limitations of previous datasets by providing format-diverse, standardized training and evaluation splits, paired with per-clip textual captions to support multi-modal research and avoid evaluation mismatches across different stereo geometries (Shen et al., 18 Dec 2025).

1. Motivation, Design, and Objectives

UniStereo responds to the rapidly increasing demand for high-quality stereo content driven by widespread adoption of stereoscopic displays. Traditional 3D video acquisition remains expensive, while manual 2D-to-3D workflows are labor-intensive. Contemporary automatic approaches—dominated by the depth-warp-inpaint pipeline—suffer from error propagation, intrinsic depth ambiguity, and lack of robustness to differing stereo geometries. Previous methods have been limited by reliance on private datasets or on content in a single stereo format, undermining the reproducibility and comparability of results.

UniStereo was constructed with three main objectives:

  • Provide a unified corpus for both parallel and converged stereo geometries.
  • Supply standardized test splits for each format to ensure consistent evaluation.
  • Include paired textual captions to advance text-conditioned mono-to-stereo research (Shen et al., 18 Dec 2025).

2. Dataset Composition and Structure

UniStereo comprises three subsets: Stereo4D (parallel), 3DMovie (converged), and UE5-Syn (synthetic, parallel). Each subset contributes diverse and complementary data sources, with all clips standardized to 81 frames (5 seconds at 16 fps), 832×480 resolution.

Subset Source Format Approx. Clips Training / Test Clips Domains / Notables
Stereo4D VR180 YouTube (~7,000 videos) Parallel ~100,000 58,000 / 400 Indoor/outdoor, moving/static, specular
3DMovie 142 commercial 3D films Converged ~44,900 44,879 / 400 Live-action, animation, cinematic scenes
UE5-Syn Unreal Engine 5 synthetic Parallel 200 n/a / 200 28 animal classes, day/night, synthetic

Each video comprises fixed 81-frame segments, resulting in a total of approximately 103,000 stereo pairs. The dataset’s domain diversity includes indoor/outdoor environments, dynamic/static content, and a balance of reflective and transparent surfaces, challenging models to generalize across varied parallax and scene content.

3. Data Acquisition and Preprocessing Pipeline

Stereo4D (Parallel)

  • Sourced from ~7,000 VR180 YouTube videos; includes camera2world matrices and rectification parameters.
  • Rectification to perspective views (HFOV = 90°), cropped to 832×480.
  • Downsampled to 16 fps and sliced to 81-frame clips; short clips are discarded.
  • Captions generated for left views using ShareGPT4Video (Shen et al., 18 Dec 2025).

3DMovie (Converged)

  • Manually curated 142 true converged films in side-by-side (SBS) format; excludes pseudo-stereo and top-bottom layouts.
  • Clips segmented using PySceneDetect; only segments ≥81 frames retained.
  • Manual trimming removes credits/overlays; strict odd-indexed sampling reduces redundancy.
  • Symmetric border cropping eliminates visual artifacts.
  • Frames resized to 832×480; left-view captions created analogously.

3DMovie assumes temporal alignment from SBS encoding, requiring no further calibration. Stereo4D provides per-clip geometric calibration data, enabling, for example, on-the-fly computation of disparity from geometry if required.

4. Annotations, Quality Assurance, and Distribution

Annotations consist of paired left/right stereo video clips, which serve as ground-truth for supervised learning. While no explicit pixelwise depth maps or disparity maps are provided, Stereo4D’s geometric metadata allows derivation for evaluation purposes. Left-eye captions, obtained via ShareGPT4Video, offer a foundation for text-conditioned baseline and multi-modal task development.

Quality control includes:

  • Manual verification of converged geometry in 3DMovie to prevent mislabeling.
  • Automated filtration of non-informative segments (e.g., logos, credits).
  • Uniform, symmetric border cropping to maintain data integrity (Shen et al., 18 Dec 2025).

Distribution is managed via a public repository (https://github.com/KlingTeam/StereoPilot), with download scripts enabling retrieval of video sources and metadata; users acquire raw assets using tools such as yt-dlp or from institutional archives.

5. Geometric Formulation and Statistical Properties

Stereo Geometry

Parallel

  • xl=fX/Zx_l = f X / Z, xr=f(XB)/Zx_r = f (X-B)/Z
  • Disparity: d=xlxr=fB/Zd = x_l - x_r = f B / ZZ=fB/dZ = f B / d

Converged

  • Cameras rotated by θ\theta about vertical axis, baseline BB
  • For left camera: [XL;YL;ZL]=R(θ)[X+B/2;Y;Z][X_L;Y_L;Z_L] = R(-\theta)[X+B/2;Y;Z], right: [XR;YR;ZR]=R(+θ)[XB/2;Y;Z][X_R;Y_R;Z_R] = R(+\theta)[X-B/2;Y;Z]
  • xl=fXL/ZLx_l = f X_L/Z_L, xr=fXR/ZRx_r = f X_R/Z_R
  • Rotation matrix: R(θ)=[cosθ0sinθ 010 sinθ0cosθ]R(\theta) = \begin{bmatrix} \cos\theta & 0 & -\sin\theta \ 0 & 1 & 0 \ \sin\theta & 0 & \cos\theta \end{bmatrix}

Statistical Highlights

  • Baseline BB in Stereo4D: 6–8 cm, per-clip rectification supplied.
  • 3DMovie convergence angles distributed for cinematic zero-disparity planes at ~5–10 m.
  • Disparity spans ±\approx \pm32 px at the prescribed resolution.
  • Focal length f416f \approx 416 px from f=(832/2)/tan(90/2)f = (832/2)/\tan(90^\circ/2).
  • Scene types: even split between static/dynamic, indoor/outdoor; moving objects, specular materials, and complex lighting are well-represented (Shen et al., 18 Dec 2025).

6. Benchmarking Protocols and Impact

UniStereo enables fair, cross-format evaluation through standardized test splits and dataset-wide diversity. The introduction of a learnable “domain switcher” mechanism in associated models such as StereoPilot allows end-to-end adaptation to both stereo geometries. Evaluation protocols follow the prescribed splits: 58,000 Stereo4D clips + 44,879 3DMovie clips for training, and 400 reserved clips per subset for testing (no predefined validation split is published).

Baseline Metrics and Results

Metrics: SSIM, MS-SSIM, PSNR, LPIPS, SIOU, and inference latency per 81-frame clip on a single GPU.

Method SSIM (Par/Conv) PSNR (Par/Conv) LPIPS (Par/Conv) SIOU (Par/Conv) Latency
StereoDiffusion 0.642 / 0.678 20.54 / 20.70 0.245 / 0.341 0.252 / 0.181 60 min
StereoCrafter 0.553 / 0.706 17.67 / 23.79 0.298 / 0.203 0.226 / 0.213 1 min
Mono2Stereo 0.649 / 0.795 20.89 / 25.76 0.222 / 0.191 0.241 / 0.201 15 min
StereoPilot 0.861 / 0.837 27.74 / 27.86 0.087 / 0.122 0.408 / 0.260 11 s

StereoPilot, trained on UniStereo, demonstrates leading fidelity and perceptual alignment—PSNR improvement by 7 dB over prior state-of-the-art and LPIPS reduced by half. Inference time is reduced to 11 seconds per clip, versus minutes or hours for previous approaches. This suggests UniStereo is critical for advancing real-time, robust, and generalizable mono-to-stereo models (Shen et al., 18 Dec 2025).

7. Significance and Research Implications

UniStereo establishes the first regime for equitable evaluation of mono-to-stereo conversion across both prevailing geometric conventions in stereoscopic video. Its careful design—unified geometry coverage, exhaustive preprocessing, domain-balanced splits, and multi-modal annotations—directly addresses format-specific biases that previously hampered the field. The dataset lowers the barrier to reproducible benchmarking, catalyzing robust model development (as in the StereoPilot framework) and facilitating new research on text-conditioned conversion and generalization across diverse content domains. The public data release, detailed metadata, and alignment with open-source evaluation protocols are designed to advance the scientific rigor and comparability in stereo video research (Shen et al., 18 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to UniStereo Dataset.