Sekai: A Video Dataset towards World Exploration (2506.15675v2)

Published 18 Jun 2025 in cs.CV and cs.AI

Abstract: Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaningdream'' in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications. The project page is https://lixsp11.github.io/sekai-project/.

Summary

The paper introduces Sekai, a dataset featuring over 5,000 hours of first-person videos to support interactive global exploration.
It details a four-stage pipeline that curates high-quality clips with rich annotations including location, weather, and camera trajectories.
The authors validate Sekai's utility by training the YUME model, while addressing limitations in existing video datasets for exploration.

This paper introduces Sekai, a large-scale egocentric video dataset designed specifically to support the training of models for interactive video-based world exploration. The authors identify limitations in existing video generation datasets, such as restricted locations, short durations, static scenes, and a lack of annotations relevant to exploration and the world state (like camera trajectories, location, weather, etc.). Sekai aims to address these gaps.

Key Features of Sekai:

Content: Over 5,000 hours of high-quality, first-person view (FPV) videos, including walking and drone views. Videos include audio for immersion.
Diversity: Covers 101 countries and regions across more than 750 cities, capturing diverse cultures, scenes, weather conditions, times of day, and crowd densities.
Duration: Walking videos are at least 60 seconds long, with clips ranging from 1 to 39 minutes, enabling training for long-term exploration.
Annotations: Rich annotations are provided for each video clip, including:
- Location (country, city, chapter-specific)
- Scene type
- Weather
- Time of day
- Crowd density
- Detailed captions
- Camera trajectories (for a subset)
Sources: The dataset comprises two parts:
- Sekai-Real: Collected from YouTube videos, totaling over 5,000 hours. Annotations are derived using automated and semi-automated methods.
- Sekai-Game: Collected from the realistic video game Lushfoil Photography Sim, totaling 60 hours. Provides ground-truth annotations.

Dataset Curation Pipeline:

The authors developed a four-stage pipeline: collection, pre-processing, annotation, and sampling.

Video Collection:
- YouTube: Manual search for high-quality walking and drone videos, followed by downloading. Initial collection was over 10,000 hours, reduced to around 8,600 hours after downloading issues.
- Video Game: Recording gameplay from Lushfoil Photography Sim using OBS Studio. 40 hours were recorded initially.
Pre-processing: Applied to both YouTube and game videos to standardize format and filter low quality content.
- Trimming: Removing opening/ending segments.
- Shot Boundary Detection: Using a refactored, GPU-accelerated TransNetV2 [soucek2024transnet] to split videos into continuous shots.
- Clip Extraction and Transcoding: Splitting shots into one-minute clips and re-encoding to 720p, 30fps, H.265 MP4 with 4 Mbps bitrate using PyNVideoCodec. Audio is kept and re-encoded to AAC.
- Luminance Filtering: Removing clips with prolonged periods of extreme brightness or darkness based on the YUV luma channel.
- Quality Filtering: Using the COVER [cover2024cpvrws] metric to remove the bottom 10% of clips based on technical quality.
- Subtitle Filtering: Detecting and removing clips with hardcoded subtitles using VideoSubFinder.
- Camera Trajectory Filtering: For clips with extracted trajectories, filtering out implausible motions based on abrupt directional changes, viewpoint shifts, or position displacement.
Annotation (Sekai-Real): Leveraging large vision-LLMs and specialized tools.
- Location: Extracting location information from YouTube titles/descriptions using GPT-4o [hurst2024gpt] and matching it to video clips based on timestamps.
- Category and Caption: A two-stage process. First, classifying videos along four dimensions (scene, weather, time of day, crowd density). Second, generating detailed captions using frames (one every two seconds) and the 72B version of Qwen2.5-VL [bai2025qwen2] with vLLM [kwon2023efficient] for inference, incorporating location and category information.
- Camera Trajectories: Extracting 600+ hours of trajectories using a modified MegaSaM [li2024megasam], replacing Depth Anything [yang2024depth] with Video Depth Anything [chen2025video] for better temporal consistency and optimizing for parallel inference.
Annotation (Sekai-Game):
- Developed a toolchain using RE-UE4SS and OBS Studio to capture ground-truth camera poses, location, and other metadata directly from the game engine. Captured poses are calibrated and interpolated.
Video Sampling (Sekai-Real-HQ):
- To create a high-quality subset (400 hours), a sampling strategy is applied based on quality and diversity.
- Quality Sampling: Ranking clips by a sum of aesthetic and semantic quality scores from COVER [cover2024cpvrws] and sampling the top $\alpha_{quality}=70\%$ .
- Diversity Sampling: Iteratively balancing content, location, category, and camera trajectory distributions. Content diversity uses InternVideo2 [wang2024internvideo2] embeddings and K-Means clustering to remove similar videos. Location and category diversity use inverse-probability weighted sampling to ensure representation. Camera trajectory diversity bins trajectories by direction and jitter and samples from each bin.

Dataset Statistics:

Sekai-Real consists of over 5000 hours from 101 countries. Duration follows a long-tail distribution, with the top 8 countries accounting for ~60%. Category distributions are shown for weather, scene, time of day, and crowd density, showing a mix of common and unique scenarios. Sekai-Real-HQ shows improved distributions, particularly in location diversity and camera trajectory jitter uniformity, and higher average video quality compared to Sekai-Real. Caption length in Sekai is significantly longer than in comparable datasets like OpenVid-1M [nan2024openvid].

YUME Model:

The authors trained an interactive video world exploration model called YUME using a subset of Sekai-Real-HQ. This model takes an initial image and allows users to explore via keyboard and mouse control, generating subsequent video frames corresponding to the user's input trajectory.

Limitations:

The paper notes two primary limitations:

Insufficient training: The full potential of the Sekai dataset could not be realized due to limited computational resources for training large models on the entire dataset.
Insufficient camera trajectory annotation: Only a partial subset of Sekai-Real videos has camera trajectory annotations due to computational constraints.

Overall, Sekai is presented as a novel and valuable resource for the development of advanced video generation models capable of supporting interactive, long-term exploration of diverse, realistic, and synthetic environments.