Anchor Captioning Module (AnCM) Overview

Updated 25 November 2025

Anchor Captioning Module (AnCM) is a mechanism that uses discrete anchor tokens and regions to guide image captioning with enhanced grounding and interpretability.
It constructs anchor-centered graphs and employs dual Transformer architectures, thereby enabling multi-view caption diversity and fine-grained image-text alignment.
Empirical studies show that AnCM frameworks significantly improve caption accuracy and diversity, offering robust solutions for both text-based and zero-shot image captioning tasks.

The Anchor Captioning Module (AnCM) designates a class of mechanisms that leverage discrete “anchor” elements—tokens, regions, or labels—as the primary foci for selective, structured image caption generation. AnCMs explicitly structure the information grounding and attention process of captioning models, either by organizing content around anchor-centric graph structures for multi-caption diversity or by injecting anchor tokens into language generation pipelines to enhance fine-grained image-text alignment. These mechanisms enable more accurate, interpretable, and diverse captions in complex visual scenes, particularly in tasks such as text-based image captioning and zero-shot image captioning. Notable instantiations include the Anchor-Captioner AnCM (Xu et al., 2021) and Anchor-Augment AnCM (Wang et al., 2022), each demonstrating distinctive architectural features and training regimes.

1. Anchor Selection

Anchor selection is the initial and foundational step in AnCM frameworks. For OCR-based image captioning, as in (Xu et al., 2021), given $M$ OCR-token embeddings $\bT = [\bt_1,\dots,\bt_M]^\top\in\mathbb{R}^{M\times d}$, each token receives an “importance score” via a learnable projection network: $\bs_{\rm anchor} = \Softmax(\phi(\bT)) \in \mathbb{R}^{M}$ where $\phi$ is typically a linear layer followed by LayerNorm and a non-linearity. Training selects the highest-scoring token as the anchor ( $i^* = \arg\max_i\,s_{{\rm anchor},i}$ ); inference uses the top- $K$ anchors. For anchor-augmented vision-language alignment in zero-shot captioning (Wang et al., 2022), anchor extraction at training uses a part-of-speech parser to select nouns from reference captions. At inference, object detector outputs (labels with confidence above threshold $p$ ) yield the anchor set $A = \{\mathrm{label}_k~|~\mathrm{conf}_k > p\}$ .

These anchor mechanisms provide explicit and adjustable control points for downstream captioning, facilitating both content focus and diversity.

2. Anchor-Centered Graph (ACG) Construction

Anchors in AnCMs support groupings of relevant tokens or regions via anchor-centered graphs (ACGs) (Xu et al., 2021). For every selected anchor $\bT_{\rm anchor}$, a unidirectional RNN (e.g., GRU) projects $\bT$ into “view-specific” contexts, initialized with $\bT_{\rm anchor}$: $\bT_{\rm graph} = \RNN\left(\bT;~\text{init\_hidden}=\bT_{\rm anchor}\right) \in \mathbb{R}^{M\times d}$ Each token receives a membership score

$\bs_{\rm graph} = \sigma\left(f_3(\bT_{\rm graph})\right) \in \mathbb{R}^M$

with $f_3$ a linear+sigmoid layer. Tokens with $s_{{\rm graph},i}>0.5$ are included as nodes, yielding an ACG

$\mathcal{G} = \{\bT_{\rm anchor}\} \cup \{\bT_{\rm graph}^i~|~s_{{\rm graph},i}>0.5\}$

These node sets function as discrete, focus-adaptive token collections for subsequent self-attention modules. In contrast, (Wang et al., 2022) relies on explicit anchor token concatenation within the Transformer’s input stream, providing direct positions within the model’s attention mechanism.

3. Caption Generation Architectures

AnCM-based captioning architectures are typically composed of separate visual-specific and text-specific modules. The Anchor-Captioner’s AnCM (Xu et al., 2021) includes:

Visual-specific captioner (AnCM $_v$ ): A Transformer decoder that receives global visual features and previous output tokens, producing intermediate hidden states.
Text-specific captioner (AnCM $_t$ ): A distinct Transformer decoder ingesting the ACG node set, the forward hidden state from AnCM $_v$ , and preceding tokens. The output scoring fuses a standard vocabulary head with a dynamic-pointer mechanism for copying OCR tokens.

The generation process runs in parallel for each anchor-derived ACG, producing $K$ distinct captions at inference. In the Anchor-Augment approach (Wang et al., 2022), a CLIP-based feature vector and anchor tokens are concatenated as the prefix to a GPT-2–style Transformer LLM. Standard positional encoding and multi-head self-attention enable generated tokens to reference both the global feature and all anchors.

Architectural Summary Table

Paper/Method	Input Focus	Decoder	Anchor Usage
(Xu et al., 2021) Anchor-Captioner	OCR tokens, vision	Dual Transformer	Anchor as ACG, multi-head attention
(Wang et al., 2022) Anchor-Augment	CLIP features, anchor labels	GPT-2 Transformer	Anchor tokens prepended to decoder

Both approaches demonstrate that anchor conditioning substantially enhances either diversity (multi-view) or grounding (zero-shot) relative to baseline methods.

4. Training Objectives and Loss Functions

The Anchor-Captioning Module (Xu et al., 2021) incorporates multiple objectives: $\mathcal L = \mathcal L_{\rm anchor} + \alpha\,\mathcal L_{\rm graph} + \beta\,\mathcal L_{\rm vcap} + \eta\,\mathcal L_{\rm tcap}$ where $\alpha=\beta=\eta=1$ in practice. The components involve:

$\mathcal{L}_{\rm anchor}$ : binary cross-entropy for selecting correct anchors.
$\mathcal{L}_{\rm graph}$ : binary cross-entropy for node selection in ACGs.
$\mathcal{L}_{\rm vcap}$ and $\mathcal{L}_{\rm tcap}$ : cross-entropy losses for visual and text caption output.

No explicit diversity loss is used; content diversity emerges naturally through anchor sampling. In contrast, Anchor-Augment (Wang et al., 2022) requires no auxiliary objectives beyond standard maximum likelihood estimation (MLE) over the next-token prediction, owing to the frozen CLIP encoders and the explicit anchor-injection strategy.

5. Implementation and Efficiency

In (Xu et al., 2021), canonical feature dimensions ( $d=768$ ), Transformer depths ( $L_1=2$ for fusion, $L_2=4$ for visual, $L_3=4$ for text), and batch sizes are prescribed. Visual and OCR token counts are fixed at $N=100$ , $M=50$ , respectively, with the Adamax optimizer at $2\times10^{-4}$ learning rate. At inference, $K=5$ anchors yield $K$ parallel captions.

(Wang et al., 2022) leverages a ViT-B/32 CLIP encoder and GPT-2-small (12 layers, $H=768$ ). Anchor extraction uses Faster R-CNN with a confidence threshold, and the anchor dropout probability $q=0.5$ regularizes dependence on anchors during training. End-to-end inference time (including object detection and caption generation) is approximately $1.7$ seconds per image, substantially faster than prior methods such as ZeroCap ( $\sim76.8$ s) and MAGIC ( $\sim2.9$ s).

6. Empirical Impact and Ablation Findings

Empirical evidence demonstrates substantial gains due to AnCM. The Anchor-Captioner (Xu et al., 2021) achieves state-of-the-art performance and unique multi-view caption diversity for text-based image captioning. Anchor-Augment (Wang et al., 2022) provides substantial improvements over prior zero-shot captioning methods, with increases in BLEU-4, METEOR, and CIDEr metrics across multiple benchmarks. Anchor dropout and detector threshold ablations indicate that anchor-based injection is robust to variations—so long as some anchors remain, zero-shot performance is preserved, but omitting anchors leads to pronounced degradation. Cross-domain transfer performance is also improved via anchors, as measured by BLEU-4 increases from $4.4\to10.1$ .

A plausible implication is that anchor-based conditioning acts as a crucial antidote to “contextual language prior,” explicitly grounding generation on visually or semantically derived cues.

7. Connections and Significance

AnCMs instantiate a general architectural talent for modular attention focusing and interpretable, controllable captioning. They provide principled solutions to (i) the underdetermination of global-caption models in complex scenes, and (ii) the tendency of CLIP and similar models to rely on contextually-biased language priors in the absence of explicit visual grounding.

By adopting anchors as mid-level bridging elements between perception and description, these modules offer a versatile, extensible building block for multimodal understanding and zero-shot language generation tasks. Further, both the Anchor-Captioner and Anchor-Augment instantiations demonstrate compatibility with widely adopted vision-language backbones (e.g., CLIP, GPT-2, Transformers), making them readily amenable to integration within next-generation captioning pipelines (Xu et al., 2021, Wang et al., 2022).

PDF Markdown Chat (Pro)

References (2)

Towards Accurate Text-based Image Captioning with Content Diversity Exploration (2021)

Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Anchor Captioning Module (AnCM).