Papers
Topics
Authors
Recent
Search
2000 character limit reached

Omni360-X Dataset: 360° Video & Panoramic Data

Updated 11 April 2026
  • Omni360-X dataset is a collection of three specialized resources for 360° video summarization, synthetic 3D scene generation, and panoptic scene understanding.
  • Each resource employs rigorous acquisition, annotation, and benchmarking protocols, offering high-definition video, detailed synthetic panoramas, and multi-view audio-visual data.
  • The dataset enables robust evaluation of methods in saliency-based summarization, physically-based rendering, and multi-modal action localization across diverse environments.

The term "Omni360-X Dataset" refers to multiple, domain-leading 360-degree video and panoramic datasets—each targeting distinct problems in video summarization, synthetic graphics-ready scene generation, and multi-modal panoptic scene understanding. This umbrella term commonly refers to three principal resources: (1) Omni360-X for human-annotated 360° video summarization (Kontostathis et al., 2024), (2) PanoX—synthetic multimodal panoramas for 3D scene generation (Huang et al., 30 Oct 2025), and (3) 360+x—a panoptic, multi-modal, multi-view video–audio scene understanding benchmark (Chen et al., 2024). Each subset is developed using rigorous acquisition, annotation, and benchmarking protocols, and all are open to academic research use with explicit licensing.

1. Dataset Variants and Scope

Three datasets commonly fall under the "Omni360-X" designation (listed with Editor's term in parentheses):

Dataset (Alias) Principal Task Key Modalities/Annotations
360-VSumm (Omni360-X) Video summarization 360° videos, human summaries
PanoX (Omni360-X) Panoramic inverse rendering RGB, depth, normals, PBR maps
360+x (Omni360-X) Panoptic scene understanding Video (multi-view), audio, ITD

The 360-VSumm dataset focuses on supervised training and objective evaluation of 360°→2D video summarization. PanoX targets graphics-ready 3D scene generation, supporting tasks such as physically-based rendering (PBR), inverse rendering, and panoramic perception. The 360+x resource is designed for research in multi-modal, multi-view scene comprehension, with extensive annotation for action localization, cross-modal retrieval, and self-supervised learning.

2. Acquisition, Modalities, and Content Specifications

2.1. 360-VSumm / Omni360-X (Video Summarization)

  • 40 panoramic clips (1–4 min each), content includes indoor scenes, outdoor activities, underwater, sports, narrative films, and music shows.
  • Video captured at 25–30 fps in high-definition (~4K×2K) equirectangular projection; down-sampled to 2 fps for summarization purposes.
  • Human annotators select concise 2D-view summaries; each original is segmented into 2 s non-overlapping fragments.

2.2. PanoX (Synthetic Panoramas for 3D Scene Generation)

  • Eight large-scale scenes (5 indoor, 3 outdoor) constructed or imported into Unreal Engine 5.
  • For each of ~10,000 camera positions, the system captures pixel-aligned, equirectangular panoramas in six modalities: RGB, Euclidean distance, world-space normals, albedo, roughness, and metallicity.
  • Native resolutions up to 2048×4096, normalized to 512×1024 for training; all modalities use the same projection and dimensions.

2.3. 360+x (Panoptic Multi-Modal Scene Understanding)

  • 232 examples, spanning 2,152 video streams and ~68 hours total; up to four view streams per scene: full spherical (5760×2880 @ 25 fps), front view, egocentric monocular, and binocular (2432×1216 @ 60 fps).
  • Multi-channel audio (ambisonic, binaural), GPS traces, ITD (interaural time delay), fine-grained action segments (38 labels), and meta-data (scene description, weather, time).
  • Balanced scene composition: 28 scene categories, with both indoor and outdoor coverage, and diverse geographic locations (UK, France, Spain, China, Japan).

3. Annotation Protocols and Ground Truth Structures

3.1. 360-VSumm Annotations

  • Annotation interface: Desktop GUI ("Fragment Selector") in Unity/C# enabling precise fragment selection and real-time feedback.
  • 15 annotators per clip, selecting exactly 15% of fragments; ground truth stored as per-annotator .txt (start_frame, end_frame) and as binary frame vectors.
  • Averaged summaries used for aggregate evaluation: for each video frame ii, si=115a=115ba,is_i = \frac{1}{15} \sum_{a=1}^{15} b_{a, i}.

3.2. PanoX Ground Truth

  • All modalities (RGB, depth, normals, albedo, roughness, metallic) are per-pixel and pixel-aligned across passes.
  • File formats: PNG (8/16 bit) for RGB and albedo; EXR (32-bit float) for depth, normals, roughness, and metallicity.
  • Scene-level directory structure for organization and deterministic file naming patterns.

3.3. 360+x Annotations

  • Per-example JSON: scene label (one of 28), temporal segmentation for fine-grained actions (three annotators, consensus protocol).
  • View-specific files: panoramic, front, egocentric (mono and stereo), each with aligned video, audio, ITD, and annotation JSONs.
  • Additional metadata: GPS, UTC timestamp, weather, high-resolution scene descriptions.

4. Experimental Protocols and Evaluation Benchmarks

4.1. 360-VSumm

  • 5-fold cross-validation: 80% train, 20% test per split; optional 10% of train for validation.
  • Evaluation metrics: Precision, Recall, F₁, as

P=TPTP+FP,R=TPTP+FN,F1=2PRP+RP=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}},\quad R=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}},\quad F_1 = \frac{2PR}{P+R}

  • Aggregation: For each test video, report the highest F₁ over all annotators; then mean over all test videos.
  • Baseline performance:
Method Pre-train Set Test F₁ (%)
Random 35.2
PGL-SUM (SumMe) SumMe 33.7
PGL-SUM (TVSum) TVSum 33.2
CA-SUM (SumMe) SumMe 36.5
CA-SUM (TVSum) TVSum 35.4
PGL-SUM (Omni360-X) Omni360-X 47.9
PGL-SUM + saliency Omni360-X 48.2
CA-SUM (Omni360-X) Omni360-X 45.3
CA-SUM + saliency Omni360-X 46.6

4.2. PanoX

  • Panoramic perception: Train/test splits by scene (8:1:1 for train, val, test; 2 extra scenes held for out-of-domain testing).
  • Tasks: Estimation of depth (AbsRel, δ<1.25, RMSE), normals (mean/median angle error, % below threshold), and PBR maps (PSNR, LPIPS).
  • Representative out-of-domain results: e.g., albedo (PSNR 17.76, LPIPS 0.344), depth (AbsRel 0.158, δ1.25 0.787, RMSE 6.83 m).

4.3. 360+x

  • Video scene classification: mAP, 360° only yields 56.3%, all views + audio + ITD up to 80.6%.
  • Temporal action localization: mAP@{0.5,0.75,0.95}, best model achieves Avg mAP 17.6%.
  • Cross-modality retrieval (Recall@K): A+D→V yields R@1=55.9%, R@10=86.6%.
  • Self-supervised learning: VP+CO pretext tasks provide +16.9% mAP gain in classification and +1.9% in TAL.
  • Dataset adaptation: 360+x pre-trained features boost out-of-domain mAP in THUMOS14 (69.5→71.9→73.7%) and Epic-Kitchens (∼+0.5%).

5. Data Access, Organization, and Licensing

  • 360-VSumm/Omni360-X: Publicly available at https://github.com/IDT-ITI/360-VSumm; open-source for non-commercial/academic use with required attribution ("Kontostathis et al., Video4IMX-2024").
  • PanoX: Scene-based directories, consistent file naming; synthetic, thus free from real-person privacy constraints.
  • 360+x: Hierarchical structure with scene directories, organized by split and viewpoint; data includes video, audio, ITD, GPS/timestamp, and annotation JSONs.
  • All datasets emphasize open research usage, stratified splits for unbiased benchmarking, and reproducible pipelines.

6.1. Loading and Processing (360-VSumm Example)

Typical workflow for loading and binarizing segmentations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import os
import cv2
import numpy as np

DATA_ROOT = "360-VSumm/videos"
ANN_ROOT  = "360-VSumm/annotations"

def load_annotations(video_id):
    ann_file = os.path.join(ANN_ROOT, f"{video_id}.txt")
    segs = []
    with open(ann_file) as f:
        for line in f:
            a, b = map(int, line.strip().split())
            segs.append((a, b))
    return segs

def load_video_frames(video_id, fps=2):
    path = os.path.join(DATA_ROOT, f"{video_id}.mp4")
    cap = cv2.VideoCapture(path)
    frames = []
    frame_idx = 0
    while True:
        ret, frame = cap.read()
        if not ret:
            break
        if frame_idx % int(cap.get(cv2.CAP_PROP_FPS) // fps) == 0:
            frames.append(frame)
        frame_idx += 1
    cap.release()
    return frames

Binarization and scoring strategy for fragment-level evaluation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def binarize_summary(segs, total_frames):
    arr = np.zeros(total_frames, dtype=int)
    for a, b in segs:
        arr[a:b+1] = 1
    return arr

def f1_score(gt_arr, pred_arr):
    TP = np.sum((gt_arr==1)&(pred_arr==1))
    FP = np.sum((gt_arr==0)&(pred_arr==1))
    FN = np.sum((gt_arr==1)&(pred_arr==0))
    P  = TP / (TP+FP+1e-8)
    R  = TP / (TP+FN+1e-8)
    F1 = 2*P*R / (P+R+1e-8)
    return P, R, F1

6.2. Best Practices

  • For 360-VSumm: Down-sample to 2 fps; extract pool5 GoogleNet features; apply 5-fold cross-validation; use saliency-weighted attention as appropriate.
  • For PanoX: Store EXR for geometry/physically-based maps; use Separate-Adapter cross-attention when modeling multiple modalities; out-domain benchmarks recommended for generalization validation.
  • For 360+x: Enforce hierarchical attention for ITD→audio→video fusion; synchronize AV/ITD for all augmentations; maintain stratified splits by scene.

7. Limitations and Community Recommendations

  • Synthetic datasets (PanoX) exhibit domain gaps in material and lighting realism; fine-tuning on real panoramas is recommended to mitigate this effect.
  • For 360-VSumm, short video duration and coarse annotation granularity (2 s) may limit fine-grained temporal analysis; the 15% length constraint mirrors prevailing summarization protocol but may suppress scene diversity.
  • 360+x does not provide spatial annotations (e.g., object masks, depth) in its initial release; future efforts may extend panoptic coverage to more spatially localized tasks.
  • Seam artifacts and projection wrap-around issues remain a challenge for equirectangular formats, particularly for learning positional representations in panoramic domains.

These datasets collectively provide a foundation for research in summarization, scene generation, panoramic perception, and multi-modal scene understanding. All protocols, metrics, and baseline results are reported in detail to foster reproducibility and comparability in future work (Kontostathis et al., 2024, Huang et al., 30 Oct 2025, Chen et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Omni360-X Dataset.