Multi-view Timelapse Datasets
- Multi-view timelapse datasets are curated collections capturing the same scene from diverse viewpoints over time, enabling spatiotemporal analysis.
- They employ structure-from-motion techniques, automated filtering, and segmentation to align images and manage transient elements effectively.
- This approach supports modeling continuous lighting variations and discrete scene changes, establishing benchmarks for realistic 4D neural reconstructions.
A multi-view timelapse dataset is a curated collection of images capturing the same scene or landmark from a @@@@1@@@@ of viewpoints and at numerous temporal instances. Such datasets are foundational for 4D neural reconstruction tasks, where the objective is to model changes in appearance and structure of real-world environments over time, incorporating spatial (multi-view) and chronological (timelapse) dimensions. The "Neural Scene Chronology" dataset exemplifies these principles and introduces high-value resources for learning intricate spatiotemporal phenomena from uncontrolled Internet imagery (Lin et al., 2023).
1. Dataset Composition and Scope
The multi-view timelapse dataset introduced in "Neural Scene Chronology" comprises four distinct urban landmarks, each exhibiting a history of discrete and sporadic content changes. The scenes are:
- Times Square (New York City), characterized by billboards and neon signage
- Akihabara (Tokyo), featuring electronic storefronts and signage
- 5Pointz (Long Island City), an outdoor graffiti-focused venue
- The Metropolitan Museum of Art ("The Met," New York City), with frequently changing façade banners
Post calibration and selection by structure-from-motion (SfM), the dataset includes approximately 12,691 multi-view images, distributed as follows:
| Scene | Time Span | Calibrated Images | Distinct Camera Poses |
|---|---|---|---|
| Times Sq. | 2009–2013 | 5,965 | ~2,000 |
| Akihabara | 2009–2013 | 1,078 | ~700 |
| 5Pointz | 2009–2013 | 3,521 | ~1,200 |
| The Met | 2009–2013 | 2,127 | ~800 |
All images are sourced from Flickr public uploads, spanning approximately four years per site (2009–2013). The viewpoint distribution reflects dense clustering around popular vantage points and marked sparsity or absence in "blind spots" such as high façades.
2. Data Collection and Calibration
Images are assembled by keyword-based crawling of Flickr, typically resulting in an initial candidate pool of up to ~300,000 photos per scene. Automatic filtering proceeds via COLMAP SfM, estimating camera poses and generating sparse point clouds. The process is parallelized by partitioning the images into overlapping batches, each reconstructed separately and merged into a global model. Unregistered and outlier images—typically due to failure in registration—are discarded.
Manual curation reduces ambiguities and scale, including:
- Selecting a region-of-interest within the reconstructed point cloud (using MeshLab) to ensure tractable neural rendering
- Excluding non-photorealistic or portrait-style shots, especially from held-out test splits
- Annotating and masking transient or moving scene elements (e.g., pedestrians, vehicles) by using a pretrained segmentation network. A learned per-pixel uncertainty weighting further refines this masking, minimizing the impact of labeling errors or ambiguity
There is no manual temporal binning; EXIF timestamps are utilized directly as continuous chronological signals. Discovery of discrete scene changes—such as mural repaints or signage swaps—is handled algorithmically during model training.
3. Dataset Characteristics and Annotation
Original images exhibit high variability in resolution (from low hundreds to several thousand pixels per edge). Preprocessing rescales the shorter side to approximately 600–800 pixels before NeRF (Neural Radiance Field) training.
The dataset encompasses a broad range of appearance conditions:
- Illumination varies from dawn to dusk, including complex artificial lighting at night
- Weather variation (sunny, cloudy) is present, as inherited from uncontrolled photography
- Critical discrete events include:
- Billboard image swaps (Times Square, Akihabara)
- Replacement or overpainting of graffiti (5Pointz)
- Banner updates (The Met)
Such discrete changes are non-periodic and typically separated by weeks or months with little or no alteration, followed by abrupt transitions. Each image is associated with precise EXIF-derived timestamps and, within the modeling pipeline, a learned, per-image 32-dimensional illumination embedding is computed. This embedding decouples lighting and white balance effects from structural or content change.
4. Preprocessing, Usage, and Accessibility
For optimal reuse, standard preprocessing steps must be executed:
- SfM (COLMAP or equivalent) to determine extrinsic camera parameters and generate a supporting sparse cloud for scene alignment
- Definition of a spatial region-of-interest (via point cloud selection) to limit computational demands of subsequent reconstructions
- Masking of transient scene objects, typically by deploying a pretrained segmentation model coupled with uncertainty-aware loss weighting
The dataset and codebase are openly distributed at https://zju3dv.github.io/neusc/ with licensing for academic research by Cornell and Zhejiang University. Users are advised to review licensing details on the project website.
5. Methodological Innovations for Chronology Modeling
A central challenge in multi-view timelapse modeling is encoding discrete, temporally localized content-change events amid continuous appearance variation from lighting or viewpoint. To address this, the dataset is paired with a representation wherein each timestamp is mapped through a learnable bank of smooth step functions:
where is the learned transition date and the sharpness. A vector with 16–32 such components enables the representation of multiple discrete transitions (e.g., billboard swaps, mural overpainting) along the timeline, without ground-truth change annotations or explicit interval binning.
A plausible implication is that future multi-view timelapse datasets may benefit from such algorithmic, unsupervised discovery of discrete changes, particularly when annotations are ambiguous or unavailable.
6. Limitations, Biases, and Benchmarking Value
The inherent properties of Internet photo collections lead to several known limitations and biases:
- Temporal coverage is peaked (2009–2013), reflecting historic popularity on Flickr; more recent images are scarce.
- Viewpoints are heavily clustered in tourist-frequented areas; many scene aspects (e.g., high façades) are under- or un-sampled, introducing viewpoint bias and “blind spots.”
- Scene chronology inference relies on the correctness of EXIF timestamps, which, if erroneous, may degrade the learned spatiotemporal sequences.
- No human-annotated ground-truth exists for the timing of discrete content changes; learned step-function transitions provide only approximate temporal localization.
Despite these limitations, the dataset’s scale, diversity, and coverage of real-world, sporadic changes establish it as a new benchmark for 4D (space + time) neural reconstruction using unconstrained Internet imagery (Lin et al., 2023). The strengths lie in supporting research on robust, large-scale, and realistic rendering of temporally evolving scenes, unconstrained by periodic or synthetic change assumptions.