StereoWorld-11M: Human-IPD Stereo Video Dataset

Updated 12 December 2025

The dataset StereoWorld-11M is a large-scale, HD stereo video resource compiled from over 11 million frames of professional 3D films for geometry-aware video synthesis.
It is characterized by natural human-IPD calibration with dense frame-level depth and disparity annotations, benchmarked using metrics like PSNR, SSIM, and LPIPS.
StereoWorld-11M offers diverse cinematic genres split into 142,520 clips, ensuring varied motion, lighting, and scene settings to support robust stereo video research.

StereoWorld-11M is a large-scale, high-definition stereo video dataset comprising over 11 million frames sourced from more than 100 professionally produced Blu-ray 3D movies. The dataset is curated to enable large-scale training and evaluation of geometry-aware monocular-to-stereo video generation frameworks. StereoWorld-11M is explicitly aligned to natural human interpupillary distance (IPD), contains dense frame-level depth and disparity annotations, and is benchmarked with a comprehensive set of visual fidelity and stereo consistency metrics. It is released for non-commercial research use, supporting the advancement of stereo video synthesis and evaluation (Xing et al., 10 Dec 2025).

1. Dataset Composition and Statistics

StereoWorld-11M contains more than 11,000,000 frames structured into 142,520 distinct side-by-side (SBS) stereo clips, each comprising 81 frames. These clips are derived from high-definition, professionally produced SBS Blu-ray 3D films. The original source material is encoded at 1920×1080 resolution per eye at 24 frames per second (fps). The dataset’s training release consists of downsampled clips at 832×480 pixels (16:9 aspect ratio) and 12 fps, producing a total training duration of approximately 254.6 hours (11,000,000 frames / 12 fps ≈ 916,667 seconds) (Xing et al., 10 Dec 2025).

Component	Statistic	Notes
Total frames	>11,000,000
Source clips	142,520
Clip length	81 frames	≈7 s per clip at 12 fps
Original spatial resolution	1920×1080	Blu-ray masters at 24 fps
Training resolution	832×480	12 fps
Train/test split	141,520 / 1,000 clips	No validation set

2. IPD Calibration and Stereo Geometry

StereoWorld-11M is aligned to natural human IPD, approximately 55–75 mm, and the baseline definition (physical distance between stereo camera optical centers) inherits the original settings of cinematic stereo rigs, which are typically calibrated to ~65 mm. No synthetic widening, baseline stretching, or artificial alteration of stereo baselines is performed. The correspondence between stereo geometry and depth in rectified image pairs adheres to the relation:

$b = \frac{f \cdot d}{Z}$

where $b$ is the stereo baseline (in mm), $f$ the focal length (in pixels), $d$ the observed disparity (in pixels), and $Z$ the scene depth. While no explicit formula mapping baseline to depth is released, this relationship forms the geometric foundation for disparity and depth alignment in the dataset. Each frame pair thus retains IPD-consistent geometry, facilitating realistic stereo synthesis and evaluation (Xing et al., 10 Dec 2025).

3. Source Content, Scene Diversity, and Motion

The dataset includes more than 100 SBS Blu-ray releases, encompassing a broad range of genres:

Animation (~20,000 clips)
Realism/drama/documentary (~30,000 clips)
War/action (~25,000 clips)
Sci-fi/fantasy (~20,000 clips)
Historical (~15,000 clips)
Other (comedy/drama, ~32,520 clips)

Motion variation in the dataset spans static dialogue to high-speed chase sequences, yielding baseline disparities translating to pixel shifts up to approximately 50 pixels at the downsampled 480p resolution. Scene content and lighting conditions cover indoor, outdoor, day, and night settings. Depth ranges from near-field (≈0.5 m foreground) to far-field (exceeding 100 m), ensuring that models trained or evaluated on this data generalize across structural and semantic contexts (Xing et al., 10 Dec 2025).

4. Preprocessing, Organization, and Annotation

The original SBS frames are processed by horizontally cropping and undoing any stretch artifacts to recover clean left/right eye images at 1080p, 24 fps. All data are uniformly downscaled to 480p (832×480) for compatibility with base video generation models. Clips are formed by sampling 81 frames at fixed intervals to maximize temporal diversity.

Directory and file organization are as follows:

StereoWorld-11M/
  ├── clips/
  │     ├── movieA_clip00001.mp4       (left eye)
  │     ├── movieA_clip00001_R.mp4     (right eye)
  │     ├── movieA_clip00002_left.mp4  ...
  │     └── ...
  └── metadata.json

RGB video is provided as H.264‐encoded MP4 files (832×480×81 frames). Depth maps (per-frame $D_r$ ) are estimated using Video Depth Anything and saved as latent codes $d_r$ via a 3D VAE. Disparity maps (ground-truth $\hat{b}_\text{gt}$ ) are computed using Stereo Any Video and stored in latent form. All annotations are pixel-aligned and temporally synchronized with the RGB data. Metadata includes clip identifiers, movie titles, genres, frame counts, resolution descriptors, and train/test split indicators. Camera extrinsics or poses are not explicitly released, as intrinsic stereo geometry is enforced by the rectified SBS format (Xing et al., 10 Dec 2025).

5. Splits, Licensing, and Usage Constraints

The dataset is split into 141,520 training clips and 1,000 test clips, assigned via random sampling at the clip level with no overlap. No explicit validation set is provided. Usage is restricted to non-commercial research:

Users must legally own or license the original Blu-ray discs.
Direct redistribution of raw video clips is prohibited.
Only pre-computed latent representations and associated annotations are publicly shared.
Publications utilizing StereoWorld-11M must cite "StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation” and include the dataset URL. These conditions enforce legal compliance while maximizing accessibility for research purposes (Xing et al., 10 Dec 2025).

6. Benchmarks and Quantitative Evaluation

StereoWorld-11M includes a unified suite of evaluation metrics:

Visual fidelity: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), Learned Perceptual Image Patch Similarity (LPIPS)
Quality/temporal: VBench IQ-Score, VBench TF-Score
Stereo consistency: End-Point Error (EPE) on disparity, D1-all (ratio of pixels with >3 px or >5% disparity error)

Baseline results on the 1,000-clip test set are summarized as follows:

Model	PSNR	SSIM	LPIPS	IQ	TF	EPE	D1-all
StereoWorld	25.98	0.7964	0.0952	0.5019	0.9704	17.45	0.4213
StereoCrafter	23.04	0.6561	0.1869	0.4370	0.9685	24.78	0.5271

StereoWorld-11M is described as the first publicly documented, large-scale, human-IPD-aligned stereo video dataset with dense depth and disparity annotations, made available under a non-commercial research license and benchmarked with an extensive suite of stereo and perceptual metrics (Xing et al., 10 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to StereoWorld-11M Dataset.