E2E-3M: Egocentric VQA Dataset

Updated 23 December 2025

E2E-3M Dataset is a comprehensive collection of 1.2M clips and 3.0M VQA pairs derived from first-person videos with strict rule-based validation.
The translation pipeline converts unstructured egocentric video into structured VQA supervision through temporal segmentation, schema annotation, and quality assurance.
Utilized for embodied planning and robotic manipulation, E2E-3M improves sample efficiency, long-horizon planning, and overall reliability in vision–language training.

The Egocentric2Embodiment Dataset (E2E-3M) is a large-scale, rule-validated corpus of multi-level visual question answering (VQA) instances derived from human first-person videos. Developed as the foundation for constructing egocentric-aware embodied vision–language systems, E2E-3M enables learning physical intelligence—reasoning about state changes, contact-rich interactions, and long-horizon planning—from richly annotated, temporally grounded, and physically coherent supervision extracted from egocentric perception and action (Lin et al., 18 Dec 2025).

1. Translation Pipeline and Dataset Construction

E2E-3M is generated by the Egocentric2Embodiment translation pipeline, which consists of four sequential stages designed to convert unstructured human egocentric video into actionable VQA supervision for embodied learning:

1. Data Intake & Pre-processing:

Videos are sourced from three major corpora: Ego4D (~1,500 h; household), BuildAI (~700 h; factory), and EgoDex (~300 h; laboratory). Temporal segmentation is performed using fixed-interval, event-driven (e.g., scene change, hand–object contact), or kinematic-aware (hand motion peaks) strategies to produce short, scenario-aware clips $c_i$ , each indexed with precise start/end times and metadata, including location type and object inventory.

2. Schema-Driven Annotation:

For each clip $c_i$ , a VQA mode $m$ is randomly sampled from the set $\{\text{Temporal}, \text{Spatial}, \text{Attribute}, \text{Mechanics}, \text{Reasoning}, \text{Summary}, \text{Trajectory}\}$ . A question template $T_{m,j}$ is instantiated based on domain metadata, while answers $A_i$ are generated by a dedicated VLMAnnotator, yielding provisional tuples $(c_i, m, j, Q_i, A_i)$ .

3. Quality Assurance & Validation:

A deterministic rule checker $f_{\text{check}}$ enforces three constraint families:

Evidence grounding ( $f_g$ ): All referenced entities (object, hand, action verb) must be visually present in the selected frames.
Egocentric consistency ( $f_e$ ): Hand designations match observed wrist pose; references to non-visible limbs are disallowed.
Temporal logic ( $f_t$ ): Temporal relations in questions and answers (e.g., “before”) are validated against annotated event timestamps.

Only records passing $f_g \wedge f_e \wedge f_t$ are admitted; otherwise, automatic regeneration occurs with corrective error messaging.

4. Structured Output:

Validated instances are structured records: $\{\text{visual clip frames}, m, j, Q_i, A_i, \text{validation\_log}\}$ , enabling full traceability to original video context.

2. Dataset Scale, Modalities, and Diversity

After complete pipeline execution, E2E-3M comprises approximately 1.2 million temporal clips and 3.0 million uniquely validated VQA pairs, each encoding both visual and linguistic modalities:

Visual: Each clip includes 3–5 RGB frame crops capturing key temporal moments.
Language: Each instance contains a natural-language question $Q_i$ and an answer $A_i$ corresponding to the annotation schema.

The annotation schema encompasses 7 distinct modes, each with its finite template set $\mathcal{T}_m$ . Object and verb lexical diversity is quantified as follows:

Domain (s)	Object Diversity per 1k Tokens	Verb Diversity per 1k QAs
Household	200–400 distinct nouns	80–160 verbs
Factory	200–400 distinct nouns	80–160 verbs
Laboratory	200–400 distinct nouns	80–160 verbs

This ensures broad coverage of physical entities and interactional verbs across diverse egocentric scenarios.

3. Schema-Driven, Multi-Level VQA Supervision

Each VQA mode targets a discrete facet of embodied planning or interaction:

Temporal: Ordering of actions and events (e.g., “What did the agent do before placing the cup?”)
Spatial: Egocentric spatial relations (“Where is the onion relative to the towel?”)
Attribute: Object properties perceivable in context (“What color is the tool in the left hand?”)
Mechanics: Contact and manipulation dynamics (“Which hand lifts the lid?”)
Reasoning: Causal or motivational explanations (“Why did the agent push the button?”)
Summary: High-level task progression (“What is the next step?”)
Trajectory: Motion path or region traversal (“Through which region does the slider move?”)

Critically, schema-driven rules restrict language to visible, temporally consistent phenomena, enforcing alignments such as $\tau(e_1) < \tau(e_2)$ for “before” relations and $\{\text{hand}, \text{object}\} \subseteq \text{Vis}(c_i)$ for contact events, where $\text{Vis}(c_i)$ is the set of entities annotated as visible in the clip.

4. Model Training and Benchmark Evaluation

PhysBrain, the embodied vision–LLM, is trained via supervised fine-tuning (SFT) on a balanced mixture of E2E-3M and FineVision data. The SFT objective is standard cross-entropy loss over answer tokens:

$\mathcal{L}_{\mathrm{SFT}} = -\mathbb{E}_{(c,Q,A)\sim D_{E}\cup D_{F}} \sum_{t=1}^{|A|} \log p_\phi(a_t \mid a_{<t}, Q, c)$

where $D_{E}$ is the E2E-3M corpus, $D_{F}$ is FineVision, and $\phi$ are the model parameters. Ego4D-derived clips are excluded from $D_{E}$ during evaluation on EgoThink to prevent data leakage. Optimization is performed using AdamW, deepspeed ZeRO, and a cosine learning rate schedule.

Benchmarks:

EgoThink: Six subtasks (Activity, Forecast, Localization, Object, Planning, Reasoning).
- Planning: 64.5% for PhysBrain versus 32.0% for Qwen2.5-VL-7B.
- Average across all subtasks: 64.3% versus 57.3%.
SimplerEnv (WidowX): Four robot manipulation tasks under the PhysGR00T VLA head.
- Aggregate success rate: 53.9% (with task-level rates: 65.6%, 37.5%, 33.3%, 79.2%).
- Observed +9% absolute improvement over the next-best VLM-initialized vision–language actor (VLA).

5. Downstream Applications and Significance

E2E-3M is the first empirically grounded, large-scale egocentric VQA dataset optimized for injecting planning structure and hand–object interaction semantics into vision–LLMs (VLMs). Rule-based validation ensures that annotated language aligns precisely with visible phenomena and temporal structure, suppressing hallucinations and facilitating physically meaningful affordance learning.

Fine-tuning on E2E-3M confers several empirical benefits:

Improved sample efficiency for downstream VLA adaptation.
Enhanced reliability for long-horizon planning under partial observability.
Increased robotic manipulation success rates when combined with even limited robot-collected data.

Anticipated future directions include curriculum learning for transfer from human to robotic tasks, hybrid supervision frameworks, and real-world deployment of VLM-based controllers in environments spanning household assistance, industrial assembly, and laboratory automation.

6. Context in Embodiment and Vision–Language Research

E2E-3M addresses a fundamental viewpoint mismatch between existing third-person VLM training data and the egocentric nature of robotic perception. By leveraging the scalability and diversity of human first-person video, and enforcing physically grounded, temporally coherent annotation schemas, E2E-3M enables new directions for research in embodied intelligence, bridging the gap between large-scale vision–language pretraining and the practical demands of physical agents in diverse, unstructured settings (Lin et al., 18 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to E2E-3M Dataset.