Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 68 tok/s

Gemini 2.5 Pro 56 tok/s Pro

GPT-5 Medium 34 tok/s Pro

GPT-5 High 31 tok/s Pro

GPT-4o 84 tok/s Pro

Kimi K2 184 tok/s Pro

GPT OSS 120B 441 tok/s Pro

Claude Sonnet 4.5 33 tok/s Pro

2000 character limit reached

FindingDory: A Benchmark to Evaluate Memory in Embodied Agents (2506.15635v1)

Published 18 Jun 2025 in cs.CV and cs.RO

Abstract: Large vision-LLMs have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to real-world robotics. However, deploying these models for reasoning in embodied contexts is limited by their ability to incorporate long-term experience collected across multiple days and represented by vast collections of images. Current VLMs typically struggle to process more than a few hundred images concurrently, highlighting the need for more efficient mechanisms to handle long-term memory in embodied settings. To effectively evaluate these models for long-horizon control, a benchmark must specifically target scenarios where memory is crucial for success. Existing long-video QA benchmarks overlook embodied challenges like object manipulation and navigation, which demand low-level skills and fine-grained reasoning over past interactions. Moreover, effective memory integration in embodied agents involves both recalling relevant historical information and executing actions based on that information, making it essential to study these aspects together rather than in isolation. In this work, we introduce a new benchmark for long-range embodied tasks in the Habitat simulator. This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness in an environment. The tasks can also be procedurally extended to longer and more challenging versions, enabling scalable evaluation of memory and reasoning. We also present baselines that integrate state-of-the-art VLMs with low level navigation policies, assessing their performance on these memory-intensive tasks and highlight areas for improvement.

Summary

An Evaluation of Long-Horizon Memory in Embodied AI: Insights from the FindingDory Benchmark

The paper "FindingDory: A Benchmark to Evaluate Memory in Embodied Agents" addresses a foundational aspect of artificial intelligence—memory—and its integration into embodied agents operating in complex simulation environments. The authors introduce FindingDory, a benchmark specifically designed to assess memory utilization in vision-LLMs tasked with long-horizon control and decision-making. This essay provides a summary and critical analysis of the paper, focusing on its contributions to the field, key findings, and implications for future research in embodied intelligent systems.

Memory is a quintessential attribute for humans and animals when interacting with dynamic environments, facilitating navigation, reasoning, and decision-making. However, the translation of this capability to artificial agents, particularly those required to operate over extensive timescales in complex environments, faces several challenges. The paper effectively highlights the constraints of existing vision-LLMs (VLMs) in embodying this capability. Current VLMs, as noted, are typically optimized for short-term tasks involving limited concurrent image processing, such as Visual Question Answering (VQA), which do not necessitate long-term memory integration.

Benchmark Design and Contribution

FindingDory is a rigorously designed benchmark that introduces 60 diverse tasks within the Habitat simulator, requiring agents to demonstrate robust memory use over extended temporal and spatial horizons. The benchmark's emphasis on procedural extensibility allows for scalability, accommodating models as they evolve. Tasks are crafted to require agents to recall and utilize historical interaction data to complete navigation and manipulation tasks effectively. This approach contrasts with conventional QA benchmarks, where longer video sequences are often reduced to simplified multiple-choice questions that fail to challenge multi-step reasoning and memory retrieval processes comprehensively.

The benchmark evaluates an agent's capability in three primary aspects: spatial, temporal, and multi-goal memory evaluations. By categorizing tasks into interpretable segments, the benchmark facilitates a granular analysis of performance, enabling the isolation of specific challenges relating to memory retention and retrieval. The tasks demand engagement with past interactions and require agents to reconcile these with current task goals, thus testing both the storage and application of long-term memory.

Results and Insights

The empirical investigations with various VLM architectures—including proprietary models like GPT-4o and open-source variants such as Qwen2.5-VL—reveal critical insights. Across all task categories, state-of-the-art VLMs demonstrate limited success in accurately retrieving and utilizing long-term memory from observation streams, particularly in tasks involving complex spatio-temporal reasoning. Despite the potency of these models in handling vast datasets, the findings substantiate significant gaps in their capabilities to integrate extended contextual understanding—a hurdle that current neural architectures and training paradigms do not fully address.

Moreover, the hierarchical approach combining high-level reasoning modules with low-level navigation policies further elucidates the complexity of integrating memory for spatial and object-centric tasks. The finding that context scaling using more extended video sequences does not inherently improve performance points to the potential necessity for models that can process and prioritize information pertinent to task-specific memory demands.

Implications and Future Directions

The implications of the FindingDory benchmark are multifaceted. Practically, the benchmark sets a substantial foundation for developing embodied AI systems that must navigate, reason, and adapt within environments reflective of real-world complexity. Theoretically, it underscores the importance of creating memory-efficient architectures capable of long-term engagement with dynamic inputs, a necessity for advancing in areas like household robotics, autonomous exploration, and interactive AI agents.

Looking ahead, the authors suggest future research directions should explore memory-compression techniques that enhance the spatio-temporal reasoning capabilities of VLMs, potentially considering architectural innovations that more deeply interlink high-level cognition with perceptual input processing. Additionally, the benchmark serves as a valuable resource for ablation studies and comparative analyses across varying memory mechanistic frameworks within embodied learning paradigms.

By highlighting current limitations and setting benchmarks for future evaluation, this paper contributes valuable insights toward advancing the intersection of memory systems and embodied AI, an expediently growing frontier in artificial intelligence research.