Self-Generated Cross-Modal Alignment

Updated 5 July 2025

Self-Generated Cross-Modal Alignment is an approach that uses internally generated signals, such as sequential human gaze, to align visual and linguistic modalities.
It sequentially guides each step of image captioning by conditioning word generation on preceding gaze fixations, thereby enhancing descriptive clarity and natural speech patterns.
Incorporating recurrent models like LSTM for gaze encoding, the strategy outperforms static attention methods and offers valuable insights for cognitive and computational research.

A self-generated cross-modal alignment strategy is an approach in which a learning system autonomously constructs or exploits internal signals, representations, or data flows that promote consistent, temporally-ordered, or conceptually-related mappings between multiple modalities without reliance on extensive manual annotation or externally imposed alignments. In the context of the paper "Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze" (Takmaz et al., 2020), this strategy leverages the sequential structure of human gaze fixations to guide the stepwise alignment of visual and linguistic modalities during image description generation. The framework translates temporal dynamics in visual attention into actionable attention guidance for natural language generation, aiming to emulate and analyze human-like cross-modal alignment in spoken language production.

The central contribution is a model where the process of image captioning is formulated not as a simple mapping from image to text, but as a temporally sequenced alignment between visual inputs and linguistic outputs, guided by experimentally recorded human gaze patterns during spoken description. Rather than encoding the whole image with static pooling strategies, the system utilizes the actual scanpath—the ordered sequence of fixations—recorded as speakers view the image and concurrently generate an utterance.

For each word step in the generated caption, the system leverages the corresponding preceding gaze fixation or saliency mask, thus modulating the pool of visual information available for that specific word generation. This mechanism effectively injects evidence from human visual attention directly into the temporal unfolding of language production, mimicking how speakers fixate on scene elements before mentioning them. The result is a form of self-generated alignment, where the guidance signal for language is "generated" by the human's own visual-attentional encoding.

2. Model Variants and Their Use of Gaze Data

Several architectures are defined to explore different strategies for cross-modal alignment:

No-gaze baseline: The captioning model uses mean-pooled image features, ignoring fixation data entirely.
Gaze-agg: Aggregated static saliency maps (averaged over all gaze data for an image) are used, which enhances salient regions but is not temporally ordered.
Gaze-seq: Sequential, participant-specific gaze masks are applied at each word time step, meaning that the generated language at $t$ is conditioned on the scene regions fixated immediately prior.
Gaze-2seq: Beyond using temporally varying masks, this model includes a recurrent LSTM component specifically for gaze signals, with its hidden states ( $h_t^g$ ) influencing generation. This captures dependencies and "memory" across the sequence of fixations, rather than treating each mask independently.

Sequential models thus amplify the alignment between generated language and the actual dynamics of attention recorded in behavioral data, moving beyond static or merely salient-driven mappings.

3. Gaze-Driven Attention as Direct Alignment Signal

The approach is underpinned by the notion of gaze-driven attention: learned or manually crafted saliency is replaced by observed human gaze as the guiding signal for where attention should be focused in the image. This differs from bottom-up, feature-based saliency or purely softmax-learned attention distribution; the system is explicitly told where "real" attention was, linking vision and language steps. Such an attention mechanism has major implications:

It enhances the model's capacity to generate descriptions that mention small, subtle, or scene-specific objects (e.g., "donut").
It admits natural disfluencies and referential expressions typical in authentic speech, reflecting the incremental, attention-driven nature of real language production.

This mechanism not only grounds the linguistic output in perceptual evidence but also provides a computational avenue for analyzing and possibly quantifying attentional guidance during language use.

4. Cognitive and Computational Insights

Sequential gaze-to-word alignment provides empirical support for theories in psycholinguistics and cognitive neuroscience regarding the incremental and referential nature of spoken language. In particular, humans are shown to look at objects before they mention them. The explicit, stepwise alignment between gaze and word enables analysis of both the specificity and diversity in description:

Descriptions that follow actual gaze trajectories are often more natural and diverse.
The strength of cross-modal correspondence (assessed by correlating scanpath similarity and sentence similarity) predicts the uniqueness and descriptive richness of generated captions.

A specialized metric, Semantic and Sequential Distance (SSD), was introduced to combine semantic similarity (via word embeddings) and sequential word order to quantify the quality of alignment between generated and human descriptions. Lower SSD values indicate captions closer to human examples in both content and temporal structure.

5. Integration of Recurrent Gaze Encoding

The addition of an LSTM to process gaze sequences (gaze-2seq model) is significant computationally. At each word generation step $t$ , the LSTM encodes the gaze-conditioned visual features $g_t$ into a dynamic hidden state $h_t^g$ , which is used by the caption generator network. This architecture allows temporal dependencies in attentional focus (e.g., the order in which scene elements are processed) to inform language production beyond the current fixation, capturing historical context and sequential dependencies much like human short-term memory does in sequential decision-making.

This recurrent part enables abstraction from raw gaze to higher-level attentional patterns, improving both alignment and the model's ability to compress or generalize over repeated or similar visual-linguistic events in a scene.

6. Experimental Evaluation and Empirical Findings

Experiments were conducted on the Dutch Image Description and Eye-Tracking Corpus (DIDEC), containing paired spoken descriptions and gaze data. Evaluation used established image captioning benchmarks (BLEU-4, CIDEr) alongside SSD, which more directly measures the semantic and temporal alignment of human-like descriptions.

Main findings:

Sequentially processed gaze models (especially gaze-2seq) achieved lower SSD values, indicating better human-aligned captioning, compared to static attention or mean-pooling models.
On traditional metrics, all gaze-informed models outperform the no-gaze baseline, but the greatest advantage in naturalness and uniqueness is observed in the models using full sequential gaze.
Qualitative analysis shows enhanced mention of specific, sometimes subtle, image details and greater similarity to authentic human speech patterns.

In summary, integrating sequential, self-generated gaze-driven alignment—particularly via a recurrent (LSTM) processing unit—not only improves standard captioning accuracy but also produces descriptions closely matching the specificity, temporal structure, and variability of human language.

7. Implications and Future Directions

This research demonstrates that self-generated cross-modal alignment strategies grounded in behavioral data can significantly enhance both the utility and human-likeness of language generation systems. It provides a model for leveraging rich, temporally structured attentional data—not just for improved performance but also for deeper cognitive modeling. These findings open avenues for:

More robust models of descriptive language in real environments.
Quantification of cross-modal structure for cognitive neuroscience, with broader applications in incremental language understanding, grounded robotics, human-computer interaction, and adaptive educational technologies.
New evaluation metrics (like SSD) that better reflect human communication dynamics.

This framework offers a foundation for future models that incorporate various forms of sequential behavioral feedback, including eye movements, hand gestures, or embodied sensorimotor traces, to drive alignment across multimodal learning and generation tasks.

PDF Markdown Chat (Upgrade)

References (1)

1.

Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze (2020)

Follow-up Questions

We haven't generated follow-up questions for this topic yet.

Generate Now