CMPM: Cross-Modal Perception-Trace Model

Updated 20 February 2026

The paper presents a novel model that integrates human eye-tracking data to construct non-sequential, cross-modal perception traces for embedding learning.
It utilizes pre-trained word and image features reduced via auto-encoders, followed by skip-gram optimization, significantly boosting semantic similarity benchmarks.
The model’s layout sensitivity and cross-modal anchoring enable coherent concept clustering, offering new avenues for multi-modal representation learning.

The Cross-Modal Perception-Trace Model (CMPM) is a computational framework for representation learning that leverages human-inspired, cross-modal, non-sequential traces through multi-modal documents, with the goal of producing richer vector-space embeddings. Departing from conventional, heuristic context definitions (such as fixed word windows in text), CMPM directly models empirical perception sequences—so-called perception traces—derived from eye-tracking of human readers engaging with documents containing both text and images. By integrating this perceptually grounded ordering into a skip-gram–like embedding architecture, CMPM achieves substantial gains in semantic similarity tasks and concept categorization, demonstrating that human-driven, layout- and modality-sensitive context captures cognitive salience missing from prior work (Rettinger et al., 2019).

1. Architecture and Representation of CMPM

A multimedia document in CMPM is defined as a triple $(H, I, S)$ , where $H = [h_1, \dots, h_n]$ are headline text entities, $I = \{r_1, \dots, r_M\}$ are image regions, and $S = [S_1, \dots, S_m]$ are summary sentences, each a sequence of entities. The central notion is the perception trace $P$ , the ordered, human-observed interleaving of text ( $e^t$ ) and image region ( $e^i$ ) entities as encountered during reading. CMPM constructs $P$ using gaze data:

Algorithm BuildPerceptionTrace(D=(H,I,S)):
  Initialize P ← [] and SeenRegions ← ∅
  For each text entity e in H do
    Append e^t to P
    Let r = region in I aligned to e (if any)
    If r ∉ SeenRegions then Append r^i to P; SeenRegions ∪= {r}
  For j=1 to m (over sentences):
    For each text entity e in S_j do
      Append e^t to P
      If aligned region r exists and r ∉ SeenRegions then Append r^i; SeenRegions ∪= {r}
  Return P

Empirical scanpaths typically follow the pattern: headline text → aligned image region → summary sentence text → next new image region, etc., with possible revisits.

Text entities $e \in E_t$ are initialized from pre-trained word2vec vectors ( $\mathbb{R}^{300}$ ); image entities $e \in E_i$ from Inception-V3 features ( $\mathbb{R}^{2048}$ ). Dimensionality reduction is performed via separate auto-encoders (one hidden layer, 100 units, tanh), yielding unified $100$-dimensional embeddings for all entities $v_{e^m} \in \mathbb{R}^{100}$ , where $m\in\{t,i\}$ . The sequence $P$ is then treated as a sentence for skip-gram training over the combined vocabulary $E = E_t \cup E_i$ .

2. Training Procedure and Optimization

The model applies a skip-gram objective to perception traces, defining the conditional probability:

$p(e_o \mid e_c) = \frac{\exp(v_{e_o}^\top v_{e_c})}{\sum_{e' \in E} \exp(v_{e'}^\top v_{e_c})}$

for all observed (center, context) pairs in $P$ with a context window $w=5$ . The global loss is

$\mathcal{L} = -\sum_{P} \sum_{c=1}^{|P|} \sum_{\substack{-w\leq j \leq w \ j \neq 0}} \log\,p\left(P_{c+j} \mid P_c\right).$

Training proceeds as follows:

Uniform auto-encoding on text and image modalities (Adam optimizer, $5$ epochs), yielding $V^{\mathrm{reduced}} \in \mathbb{R}^{|E|\times 100}$ .
Skip-gram optimization over $55\,237$ perception traces using full soft-max (no negative sampling) and RMSProp ( $\alpha = 10^{-3}$ , $10$ epochs).
Final embeddings are computed by summing the target and context vectors for each entity, as in GloVe: $v_{e}^{\mathrm{CMPM}} = v_{e}^{(\mathrm{target})} + v_{e}^{(\mathrm{context})}$ .

No explicit regularization (dropout, $\ell_2$ , etc.) is used; the principal inductive bias is the human-driven ordering implicit in $P$ .

3. Human Perception Data and Statistical Scanpath Patterns

The perception traces are based on eye-tracking experiments with Visual Genome images, for which headlines and summary sentences are extracted and paired with image regions. Participants ( $n=28$ ) view $16$ documents, each presented with text to the left or right of the image. The eye-tracker provides fixation sequences mapped to named entities (text AOIs) and visual regions (region AOIs).

Key scanpath statistics:

Over $80\%$ exhibit the modal interleaving: headline → aligned image region → sentence → new image region, ….
Average $|P| \approx 10$ –$15$ entities per trace.
Around $20\%$ exhibit returns to previously seen AOIs, but the interleaving structure dominates.

This empirical human data distinguishes CMPM from models using heuristically defined, sequential, or unimodal contexts.

4. Evaluation and Empirical Results

4.1 Semantic Similarity Benchmarks

Performance is assessed on MEN, WordSim-353, and SimLex-999, with evaluation restricted to entity pairs found in Visual Genome ( $\approx 38$ pairs per set). Spearman’s $\rho$ is used to compare human similarity ratings with cosine similarity of the learned vectors. CMPM substantially outperforms both word2vec and GloVe baselines:

Model	$\rho$ (avg.)
CMPM (100-d)	0.755
word2vec (300-d)	0.565
GloVe (100-d)	0.451

On SimLex-999, CMPM achieves $\rho=0.462$ vs. word2vec $\rho=0.205$ (GloVe negative). This reflects a large effect size ( $\Delta\rho \approx +0.19$ ) (Rettinger et al., 2019).

4.2 Geometry and Concept Clustering

Projection to 3D via PCA demonstrates that CMPM’s embedding captures higher intrinsic structure (variance in first 3 PCs: CMPM $65.03\%$ , GloVe $32.63\%$ , word2vec $24.44\%$ ). CMPM’s principal components correspond to interpretable conceptual axes.

Cluster validity (Affinity Propagation, HDBSCAN) is measured by Silhouette Coefficient (SK) and Variance Ratio Criterion (VRC):

Model	AP (SK / VRC)	HDBSCAN (SK / VRC)
CMPM	0.2346 / 9.6993	0.1833 / 6.0923
GloVe	0.1108 / 3.8113	0.0445 / 2.0810
word2vec	0.0402 / 2.3022	0.0350 / 1.8790

CMPM forms denser, more coherent clusters with less noise.

5. Theoretical and Qualitative Advantages

Three distinct advantages of CMPM are identified:

Cross-modal anchoring: Direct coupling of textual and visual referents, so embeddings span modality boundaries and enforce joint semantics.
Layout sensitivity: Perceptually salient elements (e.g., headlines) disproportionately influence the context window, supporting rare or otherwise underrepresented entities.
Non-sequential, revisiting context: Human scanpaths yield interleaved context graphs, surpassing strict left-to-right windows by encoding returns and complex attention shifts.

These mechanisms enable CMPM to surpass prior art, none of which directly mixes text and image tokens during context construction or encodes human scanpaths.

6. Limitations and Future Research Directions

Limitations include:

Restriction to small-scale, controlled (headline + summary) documents.
Absence of negative sampling or hierarchical softmax, precluding scalability to web-scale corpora.
Only two modalities (text, static images) are modeled; video and audio remain unexplored.
No significance tests reported despite large effect sizes.

Future directions anticipated are:

Scaling CMPM to web-scale data using automatically extracted perception traces from browser logs.
Implementing learned, Transformer-style cross-modal attention.
Expanding to additional modalities (e.g., video, audio, interactive elements).
Incorporating user-goal–driven trace alignment (task-driven rather than pure-layout–driven traces) (Rettinger et al., 2019).

In summary, CMPM operationalizes human-inspired, cross-modal, and non-sequential context modeling by substituting hand-crafted word windows with empirically derived perception traces, leading to measurable advances in semantic and clustering benchmarks. The approach underlines the promise of perception-driven context for representation learning in multi-modal domains.

Markdown Upgrade to Chat

References (1)

Towards Learning Cross-Modal Perception-Trace Models (2019)

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Cross-Modal Perception-Trace Model (CMPM).