Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 17 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 102 tok/s Pro

Kimi K2 166 tok/s Pro

GPT OSS 120B 436 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind (2404.05072v2)

Published 7 Apr 2024 in cs.CV

Abstract: As humans move around, performing their daily tasks, they are able to recall where they have positioned objects in their environment, even if these objects are currently out of their sight. In this paper, we aim to mimic this spatial cognition ability. We thus formulate the task of Out of Sight, Not Out of Mind - 3D tracking active objects using observations captured through an egocentric camera. We introduce a simple but effective approach to address this challenging problem, called Lift, Match, and Keep (LMK). LMK lifts partial 2D observations to 3D world coordinates, matches them over time using visual appearance, 3D location and interactions to form object tracks, and keeps these object tracks even when they go out-of-view of the camera. We benchmark LMK on 100 long videos from EPIC-KITCHENS. Our results demonstrate that spatial cognition is critical for correctly locating objects over short and long time scales. E.g., for one long egocentric video, we estimate the 3D location of 50 active objects. After 120 seconds, 57% of the objects are correctly localised by LMK, compared to just 33% by a recent 3D method for egocentric videos and 17% by a general 2D tracking method.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces OSNOM and LMK, a novel method that projects 2D observations into accurate 3D trajectories despite occlusions.
The methodology employs a lift, match, and keep process that combines depth estimation with camera alignment to track objects across frames.
Evaluations on the EPIC-KITCHENS dataset demonstrate robust performance, achieving 60% accuracy in 3D localization after objects leave the view.

Exploring Spatial Cognition in AI through Egocentric Videos

Introduction to OSNOM

Recent advancements in computer vision have allowed us to mimic complex human abilities, with spatial cognition in egocentric settings posing a significant challenge. The recent work titled "Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind" introduces an innovative task - OSNOM (Out of Sight, Not Out of Mind). This task explores tracking active objects in 3D using observations captured from an egocentric perspective, even when these objects move out of the camera's sight. It is a pivotal step towards enabling AI systems to build and maintain a continuously updated cognitive map of their environment, mirroring a human-like understanding of space and object permanence.

Methodology: Lift, Match, and Keep (LMK)

The paper presents LMK, a novel method to address the OSNOM task. LMK encompasses three core processes:

Lift - This initial step involves leveraging partial 2D object observations from video frames and projecting them into 3D world coordinates. It employs a sophisticated combination of depth estimation and aligned camera poses to achieve accurate 3D localization.
Match - To form consistent object trajectories over time, LMK matches these observations based on visual appearance and their 3D locations. This matching process is crucial for tracking objects across different frames, even when they momentarily disappear from view.
Keep - Perhaps the most distinctive aspect of LMK, this process involves maintaining the 3D trajectories of objects when they are no longer visible in the video stream, thus embodying the "out of mind, not out of sight" principle.

This approach not only demonstrates a high degree of fidelity in tracking objects over short and long durations but also introduces a new paradigm in egocentric video analysis by acknowledging the importance of spatial cognition.

Evaluation and Findings

The authors tested LMK on the challenging EPIC-KITCHENS dataset, comprising 100 long videos showcasing various everyday activities. The results were compelling:

For actively moved objects, LMK was able to correctly estimate their 3D locations with significant accuracy long after they had exited the camera view - 60% accuracy after 2 minutes and a notable consistency over longer durations.
Through ablative studies, the paper illustrates the indispensable role of combining visual features with 3D locations for object tracking, demonstrating superior performance compared to using either criterion alone.

Implications and Future Directions

This research not only contributes a novel task and method in the field of computer vision but also sets the stage for future developments in AI and robotics. Understanding how to effectively track objects in 3D, beyond the line of sight, opens new avenues for developing autonomous systems that can more naturally interact with their surroundings. It holds potential benefits for assistive technologies, augmented reality, and more efficient navigation systems in dynamic environments.

Looking ahead, the groundwork laid by this paper invites further exploration into refining object detection and tracking mechanisms under variable conditions, extending OSNOM methodologies to more complex and unstructured environments, and integrating these capabilities into larger systems designed to emulate human-like understanding and interaction with the physical world.