XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model (2207.07115v2)

Published 14 Jul 2022 in cs.CV

Abstract: We present XMem, a video object segmentation architecture for long videos with unified feature memory stores inspired by the Atkinson-Shiffrin memory model. Prior work on video object segmentation typically only uses one type of feature memory. For videos longer than a minute, a single feature memory model tightly links memory consumption and accuracy. In contrast, following the Atkinson-Shiffrin model, we develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores: a rapidly updated sensory memory, a high-resolution working memory, and a compact thus sustained long-term memory. Crucially, we develop a memory potentiation algorithm that routinely consolidates actively used working memory elements into the long-term memory, which avoids memory explosion and minimizes performance decay for long-term prediction. Combined with a new memory reading mechanism, XMem greatly exceeds state-of-the-art performance on long-video datasets while being on par with state-of-the-art methods (that do not work on long videos) on short-video datasets. Code is available at https://hkchengrex.github.io/XMem

Citations (321)

View on Semantic Scholar

Summary

The paper introduces a multi-store memory architecture that leverages sensory, working, and long-term memory to efficiently scale video segmentation.
A novel memory potentiation algorithm consolidates working memory into long-term storage to mitigate memory explosion in extended video sequences.
Empirical results show XMem surpasses state-of-the-art methods on long videos while maintaining competitive accuracy on short-video benchmarks.

Overview of XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model

The paper presents XMem, an architecture for long-term video object segmentation (VOS) that fundamentally improves upon previous methods by leveraging a novel memory model based on the Atkinson-Shiffrin model. Traditional methods generally use a single-type feature memory, which directly correlates memory consumption with prediction accuracy, limiting their scalability for longer videos. XMem addresses these limitations through a multi-store memory approach that includes sensory, working, and long-term memory components.

Key Contributions

Multi-Store Memory Architecture:
- XMem employs a multi-component memory architecture: a rapidly updated sensory memory, high-resolution working memory, and compact long-term memory. This hierarchical memory structure facilitates efficient long-term object segmentation while maintaining manageable memory consumption.
Memory Potentiation Algorithm:
- A novel memory potentiation algorithm is introduced to consolidate actively used working memory elements into the long-term memory. This consolidation prevents memory explosion—a common issue with large-scale video data.
Anisotropic L2 Similarity Function:
- The paper extends the use of L2 similarity by introducing anisotropic scaling terms that break key-query symmetry, enhancing the expressiveness of the similarity measure, and providing improved memory interaction in dynamic contexts.

Numerical Results and Implications

Empirically, XMem substantially surpasses contemporary state-of-the-art VOS techniques on extended video datasets. Its performance on long-term videos is highlighted by a robust scaling capability without sacrificing segmentation accuracy, even over tens of thousands of frames. Additionally, it attains competitive results on standard short-video benchmarks, which prior methods struggled to manage.

Theoretical and Practical Implications

The research showcases the efficacy of a multi-component memory system inspired by cognitive psychology in handling extensive sequences of visual data. The use of a memory potentiation algorithm indicates a promising direction in optimizing memory usage, mitigating degradation over extensive temporal contexts, and potentially informing future development in video processing solutions across resource-constrained platforms.

Future Directions

This work opens several avenues for further exploration in AI, particularly within the fields of VOS and memory-efficient neural designs. Future research could delve into more intricate designs of memory systems that further reduce computational overhead while enhancing prediction accuracy. The application of XMem on more complex and diverse datasets could also refine understanding and optimize performance in varying real-world scenarios. Moreover, adaptation of similar memory models for different tasks in AI, such as speech processing or natural language understanding, could open exciting new research opportunities.

In conclusion, XMem represents a methodical step forward in VOS, addressing critical drawbacks of previous models regarding scalability and resource efficiency. Through innovative use of memory architecture and interaction protocols, this work achieves commendable performance metrics and stands as a significant contribution to the field of long-term video analysis.

PDF Markdown

Related Papers

GitHub

XMem

YouTube

Show All Videos