Learning where to Attend with Deep Architectures for Image Tracking (1109.3737v1)

Published 16 Sep 2011 in cs.AI

Abstract: We discuss an attentional model for simultaneous object tracking and recognition that is driven by gaze data. Motivated by theories of perception, the model consists of two interacting pathways: identity and control, intended to mirror the what and where pathways in neuroscience models. The identity pathway models object appearance and performs classification using deep (factored)-Restricted Boltzmann Machines. At each point in time the observations consist of foveated images, with decaying resolution toward the periphery of the gaze. The control pathway models the location, orientation, scale and speed of the attended object. The posterior distribution of these states is estimated with particle filtering. Deeper in the control pathway, we encounter an attentional mechanism that learns to select gazes so as to minimize tracking uncertainty. Unlike in our previous work, we introduce gaze selection strategies which operate in the presence of partial information and on a continuous action space. We show that a straightforward extension of the existing approach to the partial information setting results in poor performance, and we propose an alternative method based on modeling the reward surface as a Gaussian Process. This approach gives good performance in the presence of partial information and allows us to expand the action space from a small, discrete set of fixation points to a continuous domain.

Citations (204)

View on Semantic Scholar

Summary

The paper introduces a novel attentional model combining deep learning (RBMs) for object identity and particle filtering for spatial control, inspired by the human visual system, to enhance object tracking with gaze data.
Numerical results demonstrate that the proposed attentional policy improves tracking performance over baselines, particularly through a Gaussian Process-based gaze strategy that effectively explores the action space.
The work validates integrating deep learning with probabilistic models for dynamic state estimation, offering practical potential in real-time applications like surveillance and robotics, while highlighting avenues for future research in online adaptation and handling complex scenarios.

Insights on "Learning where to Attend with Deep Architectures for Image Tracking"

This paper introduces an attentional model for the joint task of object tracking and recognition using gaze data, drawing inspiration from the human visual system and its division of perception processes into 'what' (identity) and 'where' (control) pathways. This bifurcated architecture models object appearance via Restricted Boltzmann Machines (RBMs) and oversees object location, orientation, scale, and speed through particle filtering.

Architecture and Methodology

A notable contribution of this work is the integration of deep learning architectures with particle filtering for robust object tracking. The identity pathway employs multi-layered RBMs for modeling object appearance, utilizing gaze-processed foveated images which decrease in resolution towards the periphery. This mirrors the processing in human ventral pathways. The control pathway leverages a Gaussian Process-based model to enhance gaze selection, allowing for continuous action space exploration, which was previously constrained to discrete fixation points. This dual-path system emulates human visual attention by directing computational resources towards areas of potential high reward, decreasing tracking uncertainty.

Numerical Findings and Comparisons

The results demonstrate that the proposed attentional policy provides a tangible improvement in tracking performance compared to baseline methods. The paper benchmarks the efficacy of their methodology against both deterministic and random policies, and further contrasts Hedge and EXP3 for gaze strategy learning under various information settings. Hedge, under the assumption of complete reward knowledge, yields admirable results; however, the Gaussian Process-based methodology offers notable aptitude in environments constrained by partial information, showing an ability to effectively expand the action space beyond previously fixed discrete sets.

Theoretical and Practical Implications

Theoretically, this work substantiates the utility of combining deep learning with probabilistic models for dynamic state estimation. The concept of directly learning attentional strategies is an interesting avenue for further research, potentially expanding to other modalities like auditory or multimodal attention models. Practically, these results could be employed in real-time object tracking in fields like automated surveillance or robotic vision, where computational efficiency must be balanced with adaptability and accuracy.

Future Research Directions

The potential for extending this model lies in its adaptability and modularity. Future research may look into optimizing online training algorithms to adapt RBM weights continuously. Investigating methodologies for handling long sequences and significant appearance changes in objects is vital. Additionally, incorporating the concept of attention to subsets of features or objects within the RBM framework could yield advanced high-level attentional strategies. Understanding how these models might recover from tracking failures using classifier outputs could significantly enhance practical deployment in dynamic environments.

This paper contributes to the ongoing discourse on attention mechanisms in deep learning, underscoring the interoperability between biologically inspired models and machine learning algorithms for sophisticated perceptual tasks.

PDF Markdown