EgoNight-VQA: Benchmark for Nighttime VQA
- EgoNight-VQA is a comprehensive benchmark evaluating egocentric visual question answering under low-light conditions using day–night aligned video pairs.
- It employs a multi-stage annotation pipeline to generate reliable QA pairs and auxiliary tasks, including spatial retrieval and depth estimation.
- The benchmark reveals significant performance drops in current multimodal models, emphasizing the need for robust, illumination-invariant solutions.
EgoNight-VQA is a comprehensive benchmark specifically designed to evaluate egocentric vision understanding under nighttime, low-illumination conditions. It introduces day–night aligned video pairs and a multi-stage annotation pipeline to create a visual question answering (VQA) dataset with reliable ground truth. The benchmark also includes auxiliary tasks such as day–night correspondence retrieval and egocentric depth estimation at night, with the explicit aim of exposing and quantifying the limitations of current multimodal LLMs (MLLMs) in realistic, low-light first-person scenarios (Zhang et al., 7 Oct 2025).
1. Motivation and Benchmark Objectives
Most prior egocentric vision datasets and VQA benchmarks focus on daytime conditions, which is a significant limitation when considering real-world, application-driven deployments (e.g., lifelogging, security, or wearable AI). EgoNight-VQA addresses this gap by:
- Systematically collecting and aligning day–night video pairs to ensure that scene content, actions, and trajectories are temporally and spatially matched, so that only the illumination conditions differ.
- Designing a VQA task suite that encompasses both perception (object/text recognition) and complex reasoning (spatial, temporal, navigation, and non–common-sense questions) in low-light contexts.
- Introducing auxiliary retrieval and depth tasks to holistically probe spatial correspondence and geometry understanding at night, extending beyond classical VQA.
The explicit goal is to rigorously evaluate model generalization and reveal performance degradation attributable to low-illumination (Zhang et al., 7 Oct 2025).
2. Data Construction and Annotation Protocol
EgoNight-VQA is constructed from three primary sources:
- EgoNight-Synthetic: Simulated indoor videos (50 pairs) rendered in Blender using Infinigen, offering exact day–night alignment and access to ground truth scene geometry (RGB, depth, normals).
- EgoNight-Sofia: Real-world day–night paired videos (20 pairs) captured under a guided protocol in Sofia, Bulgaria. Wearers repeat the same trajectories and activities in both lighting conditions, using daytime footage as a reference for nighttime recording.
- EgoNight-Oxford: Nighttime-only egocentric videos from the Oxford Day-and-Night dataset (20 videos), included to expand scene and activity diversity.
The annotation pipeline is three-staged:
- Nighttime Captioning: Advanced MLLMs generate rich captions describing low-light video clips.
- Question Generation: For each caption, new candidate questions are proposed using the same model.
- Day-Augmented Pseudo Answer Synthesis: For "paired" QA types (e.g., object recognition), answers are extracted from the day video, mitigating the unreliability of human annotation under severe low illumination. For "unpaired" types (e.g., lighting recognition), answers are derived directly from nighttime content.
Manual post-processing (over 300 hours of annotator time) ensures removal, correction, or supplementation of QA pairs, resulting in a final corpus of 3,658 high-reliability QA pairs across 90 videos (Zhang et al., 7 Oct 2025).
3. Dataset Structure and QA Taxonomy
The dataset consists of approximately 40 QA pairs per video with twelve defined QA types:
- Paired Types: Identical questions can be answered for both day and night. Includes object recognition, text recognition, spatial reasoning, scene sequence, navigation, counting of statics, action recognition, and non–common-sense reasoning.
- Unpaired Types: Night-only queries such as lighting recognition, lighting dynamic, dynamic detection, and counting of dynamics, directly probe features unique to low-light environments.
Temporal granularity varies by QA type: static recognition tasks use 3-second clips; temporal reasoning and navigation use longer sequences. The day–night alignment in both synthetic and real data underpins the reliability of annotations and allows for controlled illumination-difference studies (Zhang et al., 7 Oct 2025).
4. Evaluation Protocols and Experimental Results
A broad set of MLLMs were evaluated:
| Model Type | Example Models | Nighttime VQA Accuracy (Best) | Day–Night Drop |
|---|---|---|---|
| Closed-source | GPT-4.1, Gemini 2.5 Pro | ~30.93% (on synthetic data) | 25–32% |
| Open-source | Qwen2.5-VL (various sizes), InternVL3-8B, EgoGPT | Significantly lower | Substantially so |
- Open-ended accuracy was judged using an "LLM-as-a-Judge" approach: GPT-4.1 grades generated answers for correctness and semantic alignment.
- On day–night paired questions, all models showed ~25–32% absolute performance drops from day to night.
- Perceptual tasks (object/text recognition) are most severely affected, while higher-level reasoning (navigation, scene sequence) exhibits relatively less, but still substantial, sensitivity to lighting conditions.
- Egocentric-specialized models (e.g., EgoGPT) do not outperform generic MLLMs under night conditions.
In addition to VQA, the auxiliary tasks show:
- Spatial Retrieval: Top-1 accuracy for matching night–day scene pairs is well below that for day–day pairs, reflecting difficulty in robust visual correspondence under illumination shift.
- Temporal Localization: Mean IoU (mIoU) between predicted and ground-truth temporal segments is used:
- Depth Estimation: On synthetic data, models are evaluated with AbsRel and accuracy thresholds δ₁, δ₂, δ₃; none approach the same level of precision seen under daytime data—an indicator of the challenge low-light presents for geometry estimation.
5. Technical Innovations and Annotation Strategies
A central innovation is the use of day-augmented auto-labeling for nighttime QA generation. For each paired QA type, the pipeline extracts or synthesizes answers directly from the daytime scene, capitalizing on better visibility and thus greatly improving annotation quality for otherwise ambiguous nighttime content. This day–night-aligned paradigm facilitates both automatic mass annotation and rigorous human verification.
The dataset's structure supports fine-grained benchmarking across QA types, scenarios, and lighting conditions, addressing key weaknesses in prior VQA resources (which either ignore low-light or do not measure generalization across illumination domains) (Zhang et al., 7 Oct 2025).
6. Broader Impact and Future Directions
EgoNight-VQA exposes a critical deficiency in the robustness of current MLLMs for real-world egocentric understanding under environmental variation. The observed degradation highlights the need for new training methods, architectures, and data augmentation strategies focused on illumination invariance, as well as effective transfer learning or domain adaptation between day and night visual domains.
The day–night annotation protocol introduced here suggests a generalizable strategy for leveraging privileged data for hard-to-annotate domains. The inclusion of auxiliary tasks (retrieval and depth) broadens the evaluation axis, supporting future research in robust navigation, embodied AI in low-light, and general-purpose, illumination-invariant vision–language systems.
A plausible implication is that future improvements will require both low-level representation advances (e.g., denoising, brightness normalization, sensor fusion) and higher-level reasoning adaptation (e.g., prompt tuning, domain-adaptive pretraining) to bridge the day–night gap (Zhang et al., 7 Oct 2025).
7. Position Within the Egocentric VQA Landscape
EgoNight-VQA complements and extends prior benchmarks such as EgoTaskQA (Jia et al., 2022), EgoThink (Cheng et al., 2023), EgoTextVQA (Zhou et al., 11 Feb 2025), and EgoCross (Li et al., 14 Aug 2025) by uniquely foregrounding illumination as a key axis of difficulty. While earlier efforts focused on task structure, first-person perspective, scene text, or domain-shift, EgoNight explicitly benchmarks the performance gap induced by lighting. Its data, methodology, and multi-modal evaluation regime render it an indispensable resource for advancing the field toward robust, generalizable egocentric video understanding.
EgoNight-VQA thus serves as a pivotal resource, both empirically and methodologically, for the diagnosis and advancement of multimodal systems expected to operate reliably in real-world, variable lighting—directly addressing crucial deficiencies in current state-of-the-art models and datasets (Zhang et al., 7 Oct 2025).