UniStereo: Unified Stereo Dataset
- UniStereo is a unified dataset offering both parallel and converged stereo video formats for comprehensive, reproducible research.
- It comprises three diverse subsets with standardized 81-frame clips and paired textual captions to support multi-modal studies.
- The dataset underpins benchmarking protocols and advances in mono-to-stereo conversion, driving significant improvements in performance.
UniStereo is the first large-scale, unified dataset designed for stereo video conversion research, explicitly constructed to support both parallel and converged (toe-in) stereo formats. It enables fair benchmarking and robust training of monocular-to-stereo video conversion models for applications such as virtual reality (VR) and 3D cinema. UniStereo addresses limitations of previous datasets by providing format-diverse, standardized training and evaluation splits, paired with per-clip textual captions to support multi-modal research and avoid evaluation mismatches across different stereo geometries (Shen et al., 18 Dec 2025).
1. Motivation, Design, and Objectives
UniStereo responds to the rapidly increasing demand for high-quality stereo content driven by widespread adoption of stereoscopic displays. Traditional 3D video acquisition remains expensive, while manual 2D-to-3D workflows are labor-intensive. Contemporary automatic approaches—dominated by the depth-warp-inpaint pipeline—suffer from error propagation, intrinsic depth ambiguity, and lack of robustness to differing stereo geometries. Previous methods have been limited by reliance on private datasets or on content in a single stereo format, undermining the reproducibility and comparability of results.
UniStereo was constructed with three main objectives:
- Provide a unified corpus for both parallel and converged stereo geometries.
- Supply standardized test splits for each format to ensure consistent evaluation.
- Include paired textual captions to advance text-conditioned mono-to-stereo research (Shen et al., 18 Dec 2025).
2. Dataset Composition and Structure
UniStereo comprises three subsets: Stereo4D (parallel), 3DMovie (converged), and UE5-Syn (synthetic, parallel). Each subset contributes diverse and complementary data sources, with all clips standardized to 81 frames (5 seconds at 16 fps), 832×480 resolution.
| Subset | Source | Format | Approx. Clips | Training / Test Clips | Domains / Notables |
|---|---|---|---|---|---|
| Stereo4D | VR180 YouTube (~7,000 videos) | Parallel | ~100,000 | 58,000 / 400 | Indoor/outdoor, moving/static, specular |
| 3DMovie | 142 commercial 3D films | Converged | ~44,900 | 44,879 / 400 | Live-action, animation, cinematic scenes |
| UE5-Syn | Unreal Engine 5 synthetic | Parallel | 200 | n/a / 200 | 28 animal classes, day/night, synthetic |
Each video comprises fixed 81-frame segments, resulting in a total of approximately 103,000 stereo pairs. The dataset’s domain diversity includes indoor/outdoor environments, dynamic/static content, and a balance of reflective and transparent surfaces, challenging models to generalize across varied parallax and scene content.
3. Data Acquisition and Preprocessing Pipeline
Stereo4D (Parallel)
- Sourced from ~7,000 VR180 YouTube videos; includes camera2world matrices and rectification parameters.
- Rectification to perspective views (HFOV = 90°), cropped to 832×480.
- Downsampled to 16 fps and sliced to 81-frame clips; short clips are discarded.
- Captions generated for left views using ShareGPT4Video (Shen et al., 18 Dec 2025).
3DMovie (Converged)
- Manually curated 142 true converged films in side-by-side (SBS) format; excludes pseudo-stereo and top-bottom layouts.
- Clips segmented using PySceneDetect; only segments ≥81 frames retained.
- Manual trimming removes credits/overlays; strict odd-indexed sampling reduces redundancy.
- Symmetric border cropping eliminates visual artifacts.
- Frames resized to 832×480; left-view captions created analogously.
3DMovie assumes temporal alignment from SBS encoding, requiring no further calibration. Stereo4D provides per-clip geometric calibration data, enabling, for example, on-the-fly computation of disparity from geometry if required.
4. Annotations, Quality Assurance, and Distribution
Annotations consist of paired left/right stereo video clips, which serve as ground-truth for supervised learning. While no explicit pixelwise depth maps or disparity maps are provided, Stereo4D’s geometric metadata allows derivation for evaluation purposes. Left-eye captions, obtained via ShareGPT4Video, offer a foundation for text-conditioned baseline and multi-modal task development.
Quality control includes:
- Manual verification of converged geometry in 3DMovie to prevent mislabeling.
- Automated filtration of non-informative segments (e.g., logos, credits).
- Uniform, symmetric border cropping to maintain data integrity (Shen et al., 18 Dec 2025).
Distribution is managed via a public repository (https://github.com/KlingTeam/StereoPilot), with download scripts enabling retrieval of video sources and metadata; users acquire raw assets using tools such as yt-dlp or from institutional archives.
5. Geometric Formulation and Statistical Properties
Stereo Geometry
Parallel
- ,
- Disparity: →
Converged
- Cameras rotated by about vertical axis, baseline
- For left camera: , right:
- ,
- Rotation matrix:
Statistical Highlights
- Baseline in Stereo4D: 6–8 cm, per-clip rectification supplied.
- 3DMovie convergence angles distributed for cinematic zero-disparity planes at ~5–10 m.
- Disparity spans 32 px at the prescribed resolution.
- Focal length px from .
- Scene types: even split between static/dynamic, indoor/outdoor; moving objects, specular materials, and complex lighting are well-represented (Shen et al., 18 Dec 2025).
6. Benchmarking Protocols and Impact
UniStereo enables fair, cross-format evaluation through standardized test splits and dataset-wide diversity. The introduction of a learnable “domain switcher” mechanism in associated models such as StereoPilot allows end-to-end adaptation to both stereo geometries. Evaluation protocols follow the prescribed splits: 58,000 Stereo4D clips + 44,879 3DMovie clips for training, and 400 reserved clips per subset for testing (no predefined validation split is published).
Baseline Metrics and Results
Metrics: SSIM, MS-SSIM, PSNR, LPIPS, SIOU, and inference latency per 81-frame clip on a single GPU.
| Method | SSIM (Par/Conv) | PSNR (Par/Conv) | LPIPS (Par/Conv) | SIOU (Par/Conv) | Latency |
|---|---|---|---|---|---|
| StereoDiffusion | 0.642 / 0.678 | 20.54 / 20.70 | 0.245 / 0.341 | 0.252 / 0.181 | 60 min |
| StereoCrafter | 0.553 / 0.706 | 17.67 / 23.79 | 0.298 / 0.203 | 0.226 / 0.213 | 1 min |
| Mono2Stereo | 0.649 / 0.795 | 20.89 / 25.76 | 0.222 / 0.191 | 0.241 / 0.201 | 15 min |
| StereoPilot | 0.861 / 0.837 | 27.74 / 27.86 | 0.087 / 0.122 | 0.408 / 0.260 | 11 s |
StereoPilot, trained on UniStereo, demonstrates leading fidelity and perceptual alignment—PSNR improvement by 7 dB over prior state-of-the-art and LPIPS reduced by half. Inference time is reduced to 11 seconds per clip, versus minutes or hours for previous approaches. This suggests UniStereo is critical for advancing real-time, robust, and generalizable mono-to-stereo models (Shen et al., 18 Dec 2025).
7. Significance and Research Implications
UniStereo establishes the first regime for equitable evaluation of mono-to-stereo conversion across both prevailing geometric conventions in stereoscopic video. Its careful design—unified geometry coverage, exhaustive preprocessing, domain-balanced splits, and multi-modal annotations—directly addresses format-specific biases that previously hampered the field. The dataset lowers the barrier to reproducible benchmarking, catalyzing robust model development (as in the StereoPilot framework) and facilitating new research on text-conditioned conversion and generalization across diverse content domains. The public data release, detailed metadata, and alignment with open-source evaluation protocols are designed to advance the scientific rigor and comparability in stereo video research (Shen et al., 18 Dec 2025).