Multi-view Timelapse Datasets

Updated 10 February 2026

Multi-view timelapse datasets are curated collections capturing the same scene from diverse viewpoints over time, enabling spatiotemporal analysis.
They employ structure-from-motion techniques, automated filtering, and segmentation to align images and manage transient elements effectively.
This approach supports modeling continuous lighting variations and discrete scene changes, establishing benchmarks for realistic 4D neural reconstructions.

A multi-view timelapse dataset is a curated collection of images capturing the same scene or landmark from a @@@@1@@@@ of viewpoints and at numerous temporal instances. Such datasets are foundational for 4D neural reconstruction tasks, where the objective is to model changes in appearance and structure of real-world environments over time, incorporating spatial (multi-view) and chronological (timelapse) dimensions. The "Neural Scene Chronology" dataset exemplifies these principles and introduces high-value resources for learning intricate spatiotemporal phenomena from uncontrolled Internet imagery (Lin et al., 2023).

1. Dataset Composition and Scope

The multi-view timelapse dataset introduced in "Neural Scene Chronology" comprises four distinct urban landmarks, each exhibiting a history of discrete and sporadic content changes. The scenes are:

Times Square (New York City), characterized by billboards and neon signage
Akihabara (Tokyo), featuring electronic storefronts and signage
5Pointz (Long Island City), an outdoor graffiti-focused venue
The Metropolitan Museum of Art ("The Met," New York City), with frequently changing façade banners

Post calibration and selection by structure-from-motion (SfM), the dataset includes approximately 12,691 multi-view images, distributed as follows:

Scene	Time Span	Calibrated Images	Distinct Camera Poses
Times Sq.	2009–2013	5,965	~2,000
Akihabara	2009–2013	1,078	~700
5Pointz	2009–2013	3,521	~1,200
The Met	2009–2013	2,127	~800

All images are sourced from Flickr public uploads, spanning approximately four years per site (2009–2013). The viewpoint distribution reflects dense clustering around popular vantage points and marked sparsity or absence in "blind spots" such as high façades.

2. Data Collection and Calibration

Images are assembled by keyword-based crawling of Flickr, typically resulting in an initial candidate pool of up to ~300,000 photos per scene. Automatic filtering proceeds via COLMAP SfM, estimating camera poses and generating sparse point clouds. The process is parallelized by partitioning the images into overlapping batches, each reconstructed separately and merged into a global model. Unregistered and outlier images—typically due to failure in registration—are discarded.

Manual curation reduces ambiguities and scale, including:

Selecting a region-of-interest within the reconstructed point cloud (using MeshLab) to ensure tractable neural rendering
Excluding non-photorealistic or portrait-style shots, especially from held-out test splits
Annotating and masking transient or moving scene elements (e.g., pedestrians, vehicles) by using a pretrained segmentation network. A learned per-pixel uncertainty weighting further refines this masking, minimizing the impact of labeling errors or ambiguity

There is no manual temporal binning; EXIF timestamps are utilized directly as continuous chronological signals. Discovery of discrete scene changes—such as mural repaints or signage swaps—is handled algorithmically during model training.

3. Dataset Characteristics and Annotation

Original images exhibit high variability in resolution (from low hundreds to several thousand pixels per edge). Preprocessing rescales the shorter side to approximately 600–800 pixels before NeRF (Neural Radiance Field) training.

The dataset encompasses a broad range of appearance conditions:

Illumination varies from dawn to dusk, including complex artificial lighting at night
Weather variation (sunny, cloudy) is present, as inherited from uncontrolled photography
Critical discrete events include:
- Billboard image swaps (Times Square, Akihabara)
- Replacement or overpainting of graffiti (5Pointz)
- Banner updates (The Met)

Such discrete changes are non-periodic and typically separated by weeks or months with little or no alteration, followed by abrupt transitions. Each image is associated with precise EXIF-derived timestamps and, within the modeling pipeline, a learned, per-image 32-dimensional illumination embedding is computed. This embedding decouples lighting and white balance effects from structural or content change.

4. Preprocessing, Usage, and Accessibility

For optimal reuse, standard preprocessing steps must be executed:

SfM (COLMAP or equivalent) to determine extrinsic camera parameters and generate a supporting sparse cloud for scene alignment
Definition of a spatial region-of-interest (via point cloud selection) to limit computational demands of subsequent reconstructions
Masking of transient scene objects, typically by deploying a pretrained segmentation model coupled with uncertainty-aware loss weighting

The dataset and codebase are openly distributed at https://zju3dv.github.io/neusc/ with licensing for academic research by Cornell and Zhejiang University. Users are advised to review licensing details on the project website.

5. Methodological Innovations for Chronology Modeling

A central challenge in multi-view timelapse modeling is encoding discrete, temporally localized content-change events amid continuous appearance variation from lighting or viewpoint. To address this, the dataset is paired with a representation wherein each timestamp $t \in [0,1]$ is mapped through a learnable bank of smooth step functions:

$\bar h(t;u,\beta) = \begin{cases} \frac{1}{2} \exp\left(\frac{t-u}{\beta}\right), & t \le u \ 1 - \frac{1}{2}\exp\left(-\frac{t-u}{\beta}\right), & t > u \end{cases}$

where $u$ is the learned transition date and $\beta$ the sharpness. A vector $H(t)$ with 16–32 such components enables the representation of multiple discrete transitions (e.g., billboard swaps, mural overpainting) along the timeline, without ground-truth change annotations or explicit interval binning.

A plausible implication is that future multi-view timelapse datasets may benefit from such algorithmic, unsupervised discovery of discrete changes, particularly when annotations are ambiguous or unavailable.

6. Limitations, Biases, and Benchmarking Value

The inherent properties of Internet photo collections lead to several known limitations and biases:

Temporal coverage is peaked (2009–2013), reflecting historic popularity on Flickr; more recent images are scarce.
Viewpoints are heavily clustered in tourist-frequented areas; many scene aspects (e.g., high façades) are under- or un-sampled, introducing viewpoint bias and “blind spots.”
Scene chronology inference relies on the correctness of EXIF timestamps, which, if erroneous, may degrade the learned spatiotemporal sequences.
No human-annotated ground-truth exists for the timing of discrete content changes; learned step-function transitions provide only approximate temporal localization.

Despite these limitations, the dataset’s scale, diversity, and coverage of real-world, sporadic changes establish it as a new benchmark for 4D (space + time) neural reconstruction using unconstrained Internet imagery (Lin et al., 2023). The strengths lie in supporting research on robust, large-scale, and realistic rendering of temporally evolving scenes, unconstrained by periodic or synthetic change assumptions.

Markdown Report Issue Upgrade to Chat

References (1)

Neural Scene Chronology (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-view Timelapse Datasets.