Papers
Topics
Authors
Recent
2000 character limit reached

GRAB: Guided Representation with Attentive Bank

Updated 10 January 2026
  • The paper introduces GRAB, which uses guided self-attention to identify high-saliency tokens for precise local alignment without extra supervision.
  • It leverages an Attentive Bank to capture diverse, informative regions, enabling effective multimodal correspondence in text-based person search.
  • Integration with MARS and ATS dynamically refines token selection, improving discrimination and ensuring robust performance across TBPS benchmarks.

Guided Representation with Attentive Bank (GRAB) is a mechanism introduced within the ITSELF framework for vision-language retrieval tasks, aimed at addressing the challenges of fine-grained local alignment between image and text modalities in text-based person search (TBPS). GRAB leverages self-attention derived from the model's encoder to select a set of high-saliency tokens, forming an "Attentive Bank" on which local objectives are applied. This selection occurs without any requirement for additional supervision, enabling the system to learn implicit, fine-grained correspondences between multimodal inputs through the internal structure of attention.

1. Motivation and Theoretical Foundations

The ITSELF framework, motivated by empirical findings that encoder attention surfaces spatially precise evidence early in training, seeks to avoid reliance on explicit local alignment that can lead to shortcut learning and spurious correlations. Prior approaches to TBPS often incorporate external prior knowledge or local alignment methods but inadvertently distort intra-modality structure. GRAB addresses these limitations by proposing a method where attention, developed endogenously during model training, is utilized to identify critical input regions for alignment without introducing extra supervision or handcrafted knowledge. This minimizes intra-modality distortion and regularizes the alignment process (Nguyen et al., 3 Jan 2026).

2. Construction and Operation of the Attentive Bank

At the core of GRAB is the extraction of an Attentive Bank: a collection of high-saliency tokens as indicated by the model’s own attention scores. These tokens are presumed to encode salient visual or textual regions relevant for distinguishing entities in TBPS scenarios. The bank is constructed by converting raw attention scores into a set of selected tokens, which serve as anchors for the subsequent application of local alignment objectives. This data-driven approach to token selection is fully self-supervised within the model and exploits the spatial and contextual focus organically developed in self-attention layers.

3. Local Objectives Without Additional Supervision

GRAB applies local alignment objectives directly on the Attentive Bank. These objectives enforce that each token in the Attentive Bank aligns with corresponding elements in the paired input modality (for example, aligning a high-attention visual patch with salient words in the textual description). The process remains implicit—there is no need for ground-truth alignment labels or external part-based annotations. This stands in contrast to previous methods that inject prior knowledge or supervision for region-word or part-level matching, which can inadvertently bias or restrict the model’s intra-modality representations (Nguyen et al., 3 Jan 2026).

4. Integration with Multi-Layer Attention for Robust Selection (MARS) and Adaptive Token Scheduler (ATS)

To improve the quality and diversity of token selection, GRAB is combined with two auxiliary modules:

  • Multi-Layer Attention for Robust Selection (MARS): MARS aggregates attention scores across multiple layers and implements diversity-aware top-k selection. This reduces redundancy and ensures selected tokens represent diverse, informative regions within the input. This suggests a mechanism by which tokens are chosen to cover a wide range of salient features, though specific algorithmic details are not provided in the available abstract.
  • Adaptive Token Scheduler (ATS): ATS schedules the number of selected tokens over training epochs. During early stages, it preserves more context with a larger selection budget, then progressively narrows the focus to more discriminative details as training advances. This scheduling enables the model to balance holistic representation and fine-grained discrimination. A plausible implication is that ATS dynamically tunes the granularity of local alignment objectives to match the evolving focus of the network over the course of training.

5. Empirical Validation and Generalization in TBPS

GRAB, as implemented in ITSELF, is validated through extensive experiments on three widely-used TBPS benchmarks. The results demonstrate state-of-the-art performance and strong cross-dataset generalization, without reliance on additional prior supervision or handcrafted knowledge. The self-supervised, attention-guided mechanism underlying GRAB is shown to yield effective and robust alignment of image and text modalities, substantiating its utility for fine-grained retrieval tasks (Nguyen et al., 3 Jan 2026).

6. Relation to Prior and Contemporary Research

GRAB is proposed in the context of ongoing efforts to enhance the capability of vision-LLMs, especially for retrieval tasks requiring local, fine-grained alignment. Previous solutions have commonly leveraged explicit correspondence modeling or injected prior part/object knowledge, often resulting in model misalignment or intra-modality distortion. By instead utilizing the internal distributions discovered by self-attention, GRAB departs from reliance on external supervision and seeks to improve retrieval quality through implicit structure exploitation. This places GRAB among a new class of methods focused on endogenously guided local alignment in multimodal models.

7. Implications and Prospective Impact

By demonstrating the effectiveness of attention-driven token banks for local alignment, GRAB potentially signals a shift toward less supervised, more internally-coherent methods in vision-language retrieval. Its demonstrated ability to generalize across datasets further suggests robustness to distributional shifts, a desirable property for deployment in real-world person retrieval systems. The mechanism is released as part of the ITSELF project, facilitating its adoption and further study within the research community (Nguyen et al., 3 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Guided Representation with Attentive Bank (GRAB).