Papers
Topics
Authors
Recent
Search
2000 character limit reached

CMPM: Cross-Modal Perception-Trace Model

Updated 20 February 2026
  • The paper presents a novel model that integrates human eye-tracking data to construct non-sequential, cross-modal perception traces for embedding learning.
  • It utilizes pre-trained word and image features reduced via auto-encoders, followed by skip-gram optimization, significantly boosting semantic similarity benchmarks.
  • The model’s layout sensitivity and cross-modal anchoring enable coherent concept clustering, offering new avenues for multi-modal representation learning.

The Cross-Modal Perception-Trace Model (CMPM) is a computational framework for representation learning that leverages human-inspired, cross-modal, non-sequential traces through multi-modal documents, with the goal of producing richer vector-space embeddings. Departing from conventional, heuristic context definitions (such as fixed word windows in text), CMPM directly models empirical perception sequences—so-called perception traces—derived from eye-tracking of human readers engaging with documents containing both text and images. By integrating this perceptually grounded ordering into a skip-gram–like embedding architecture, CMPM achieves substantial gains in semantic similarity tasks and concept categorization, demonstrating that human-driven, layout- and modality-sensitive context captures cognitive salience missing from prior work (Rettinger et al., 2019).

1. Architecture and Representation of CMPM

A multimedia document in CMPM is defined as a triple (H,I,S)(H, I, S), where H=[h1,,hn]H = [h_1, \dots, h_n] are headline text entities, I={r1,,rM}I = \{r_1, \dots, r_M\} are image regions, and S=[S1,,Sm]S = [S_1, \dots, S_m] are summary sentences, each a sequence of entities. The central notion is the perception trace PP, the ordered, human-observed interleaving of text (ete^t) and image region (eie^i) entities as encountered during reading. CMPM constructs PP using gaze data:

1
2
3
4
5
6
7
8
9
10
11
Algorithm BuildPerceptionTrace(D=(H,I,S)):
  Initialize P ← [] and SeenRegions ← ∅
  For each text entity e in H do
    Append e^t to P
    Let r = region in I aligned to e (if any)
    If r ∉ SeenRegions then Append r^i to P; SeenRegions ∪= {r}
  For j=1 to m (over sentences):
    For each text entity e in S_j do
      Append e^t to P
      If aligned region r exists and r ∉ SeenRegions then Append r^i; SeenRegions ∪= {r}
  Return P

Empirical scanpaths typically follow the pattern: headline text → aligned image region → summary sentence text → next new image region, etc., with possible revisits.

Text entities eEte \in E_t are initialized from pre-trained word2vec vectors (R300\mathbb{R}^{300}); image entities eEie \in E_i from Inception-V3 features (R2048\mathbb{R}^{2048}). Dimensionality reduction is performed via separate auto-encoders (one hidden layer, 100 units, tanh), yielding unified $100$-dimensional embeddings for all entities vemR100v_{e^m} \in \mathbb{R}^{100}, where m{t,i}m\in\{t,i\}. The sequence PP is then treated as a sentence for skip-gram training over the combined vocabulary E=EtEiE = E_t \cup E_i.

2. Training Procedure and Optimization

The model applies a skip-gram objective to perception traces, defining the conditional probability:

p(eoec)=exp(veovec)eEexp(vevec)p(e_o \mid e_c) = \frac{\exp(v_{e_o}^\top v_{e_c})}{\sum_{e' \in E} \exp(v_{e'}^\top v_{e_c})}

for all observed (center, context) pairs in PP with a context window w=5w=5. The global loss is

L=Pc=1Pwjw j0logp(Pc+jPc).\mathcal{L} = -\sum_{P} \sum_{c=1}^{|P|} \sum_{\substack{-w\leq j \leq w \ j \neq 0}} \log\,p\left(P_{c+j} \mid P_c\right).

Training proceeds as follows:

  • Uniform auto-encoding on text and image modalities (Adam optimizer, $5$ epochs), yielding VreducedRE×100V^{\mathrm{reduced}} \in \mathbb{R}^{|E|\times 100}.
  • Skip-gram optimization over 5523755\,237 perception traces using full soft-max (no negative sampling) and RMSProp (α=103\alpha = 10^{-3}, $10$ epochs).
  • Final embeddings are computed by summing the target and context vectors for each entity, as in GloVe: veCMPM=ve(target)+ve(context)v_{e}^{\mathrm{CMPM}} = v_{e}^{(\mathrm{target})} + v_{e}^{(\mathrm{context})}.

No explicit regularization (dropout, 2\ell_2, etc.) is used; the principal inductive bias is the human-driven ordering implicit in PP.

3. Human Perception Data and Statistical Scanpath Patterns

The perception traces are based on eye-tracking experiments with Visual Genome images, for which headlines and summary sentences are extracted and paired with image regions. Participants (n=28n=28) view $16$ documents, each presented with text to the left or right of the image. The eye-tracker provides fixation sequences mapped to named entities (text AOIs) and visual regions (region AOIs).

Key scanpath statistics:

  • Over 80%80\% exhibit the modal interleaving: headline → aligned image region → sentence → new image region, ….
  • Average P10|P| \approx 10–$15$ entities per trace.
  • Around 20%20\% exhibit returns to previously seen AOIs, but the interleaving structure dominates.

This empirical human data distinguishes CMPM from models using heuristically defined, sequential, or unimodal contexts.

4. Evaluation and Empirical Results

4.1 Semantic Similarity Benchmarks

Performance is assessed on MEN, WordSim-353, and SimLex-999, with evaluation restricted to entity pairs found in Visual Genome (38\approx 38 pairs per set). Spearman’s ρ\rho is used to compare human similarity ratings with cosine similarity of the learned vectors. CMPM substantially outperforms both word2vec and GloVe baselines:

Model ρ\rho (avg.)
CMPM (100-d) 0.755
word2vec (300-d) 0.565
GloVe (100-d) 0.451

On SimLex-999, CMPM achieves ρ=0.462\rho=0.462 vs. word2vec ρ=0.205\rho=0.205 (GloVe negative). This reflects a large effect size (Δρ+0.19\Delta\rho \approx +0.19) (Rettinger et al., 2019).

4.2 Geometry and Concept Clustering

Projection to 3D via PCA demonstrates that CMPM’s embedding captures higher intrinsic structure (variance in first 3 PCs: CMPM 65.03%65.03\%, GloVe 32.63%32.63\%, word2vec 24.44%24.44\%). CMPM’s principal components correspond to interpretable conceptual axes.

Cluster validity (Affinity Propagation, HDBSCAN) is measured by Silhouette Coefficient (SK) and Variance Ratio Criterion (VRC):

Model AP (SK / VRC) HDBSCAN (SK / VRC)
CMPM 0.2346 / 9.6993 0.1833 / 6.0923
GloVe 0.1108 / 3.8113 0.0445 / 2.0810
word2vec 0.0402 / 2.3022 0.0350 / 1.8790

CMPM forms denser, more coherent clusters with less noise.

5. Theoretical and Qualitative Advantages

Three distinct advantages of CMPM are identified:

  • Cross-modal anchoring: Direct coupling of textual and visual referents, so embeddings span modality boundaries and enforce joint semantics.
  • Layout sensitivity: Perceptually salient elements (e.g., headlines) disproportionately influence the context window, supporting rare or otherwise underrepresented entities.
  • Non-sequential, revisiting context: Human scanpaths yield interleaved context graphs, surpassing strict left-to-right windows by encoding returns and complex attention shifts.

These mechanisms enable CMPM to surpass prior art, none of which directly mixes text and image tokens during context construction or encodes human scanpaths.

6. Limitations and Future Research Directions

Limitations include:

  • Restriction to small-scale, controlled (headline + summary) documents.
  • Absence of negative sampling or hierarchical softmax, precluding scalability to web-scale corpora.
  • Only two modalities (text, static images) are modeled; video and audio remain unexplored.
  • No significance tests reported despite large effect sizes.

Future directions anticipated are:

  • Scaling CMPM to web-scale data using automatically extracted perception traces from browser logs.
  • Implementing learned, Transformer-style cross-modal attention.
  • Expanding to additional modalities (e.g., video, audio, interactive elements).
  • Incorporating user-goal–driven trace alignment (task-driven rather than pure-layout–driven traces) (Rettinger et al., 2019).

In summary, CMPM operationalizes human-inspired, cross-modal, and non-sequential context modeling by substituting hand-crafted word windows with empirically derived perception traces, leading to measurable advances in semantic and clustering benchmarks. The approach underlines the promise of perception-driven context for representation learning in multi-modal domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Modal Perception-Trace Model (CMPM).