CMPM: Cross-Modal Perception-Trace Model
- The paper presents a novel model that integrates human eye-tracking data to construct non-sequential, cross-modal perception traces for embedding learning.
- It utilizes pre-trained word and image features reduced via auto-encoders, followed by skip-gram optimization, significantly boosting semantic similarity benchmarks.
- The model’s layout sensitivity and cross-modal anchoring enable coherent concept clustering, offering new avenues for multi-modal representation learning.
The Cross-Modal Perception-Trace Model (CMPM) is a computational framework for representation learning that leverages human-inspired, cross-modal, non-sequential traces through multi-modal documents, with the goal of producing richer vector-space embeddings. Departing from conventional, heuristic context definitions (such as fixed word windows in text), CMPM directly models empirical perception sequences—so-called perception traces—derived from eye-tracking of human readers engaging with documents containing both text and images. By integrating this perceptually grounded ordering into a skip-gram–like embedding architecture, CMPM achieves substantial gains in semantic similarity tasks and concept categorization, demonstrating that human-driven, layout- and modality-sensitive context captures cognitive salience missing from prior work (Rettinger et al., 2019).
1. Architecture and Representation of CMPM
A multimedia document in CMPM is defined as a triple , where are headline text entities, are image regions, and are summary sentences, each a sequence of entities. The central notion is the perception trace , the ordered, human-observed interleaving of text () and image region () entities as encountered during reading. CMPM constructs using gaze data:
1 2 3 4 5 6 7 8 9 10 11 |
Algorithm BuildPerceptionTrace(D=(H,I,S)):
Initialize P ← [] and SeenRegions ← ∅
For each text entity e in H do
Append e^t to P
Let r = region in I aligned to e (if any)
If r ∉ SeenRegions then Append r^i to P; SeenRegions ∪= {r}
For j=1 to m (over sentences):
For each text entity e in S_j do
Append e^t to P
If aligned region r exists and r ∉ SeenRegions then Append r^i; SeenRegions ∪= {r}
Return P |
Empirical scanpaths typically follow the pattern: headline text → aligned image region → summary sentence text → next new image region, etc., with possible revisits.
Text entities are initialized from pre-trained word2vec vectors (); image entities from Inception-V3 features (). Dimensionality reduction is performed via separate auto-encoders (one hidden layer, 100 units, tanh), yielding unified $100$-dimensional embeddings for all entities , where . The sequence is then treated as a sentence for skip-gram training over the combined vocabulary .
2. Training Procedure and Optimization
The model applies a skip-gram objective to perception traces, defining the conditional probability:
for all observed (center, context) pairs in with a context window . The global loss is
Training proceeds as follows:
- Uniform auto-encoding on text and image modalities (Adam optimizer, $5$ epochs), yielding .
- Skip-gram optimization over perception traces using full soft-max (no negative sampling) and RMSProp (, $10$ epochs).
- Final embeddings are computed by summing the target and context vectors for each entity, as in GloVe: .
No explicit regularization (dropout, , etc.) is used; the principal inductive bias is the human-driven ordering implicit in .
3. Human Perception Data and Statistical Scanpath Patterns
The perception traces are based on eye-tracking experiments with Visual Genome images, for which headlines and summary sentences are extracted and paired with image regions. Participants () view $16$ documents, each presented with text to the left or right of the image. The eye-tracker provides fixation sequences mapped to named entities (text AOIs) and visual regions (region AOIs).
Key scanpath statistics:
- Over exhibit the modal interleaving: headline → aligned image region → sentence → new image region, ….
- Average –$15$ entities per trace.
- Around exhibit returns to previously seen AOIs, but the interleaving structure dominates.
This empirical human data distinguishes CMPM from models using heuristically defined, sequential, or unimodal contexts.
4. Evaluation and Empirical Results
4.1 Semantic Similarity Benchmarks
Performance is assessed on MEN, WordSim-353, and SimLex-999, with evaluation restricted to entity pairs found in Visual Genome ( pairs per set). Spearman’s is used to compare human similarity ratings with cosine similarity of the learned vectors. CMPM substantially outperforms both word2vec and GloVe baselines:
| Model | (avg.) |
|---|---|
| CMPM (100-d) | 0.755 |
| word2vec (300-d) | 0.565 |
| GloVe (100-d) | 0.451 |
On SimLex-999, CMPM achieves vs. word2vec (GloVe negative). This reflects a large effect size () (Rettinger et al., 2019).
4.2 Geometry and Concept Clustering
Projection to 3D via PCA demonstrates that CMPM’s embedding captures higher intrinsic structure (variance in first 3 PCs: CMPM , GloVe , word2vec ). CMPM’s principal components correspond to interpretable conceptual axes.
Cluster validity (Affinity Propagation, HDBSCAN) is measured by Silhouette Coefficient (SK) and Variance Ratio Criterion (VRC):
| Model | AP (SK / VRC) | HDBSCAN (SK / VRC) |
|---|---|---|
| CMPM | 0.2346 / 9.6993 | 0.1833 / 6.0923 |
| GloVe | 0.1108 / 3.8113 | 0.0445 / 2.0810 |
| word2vec | 0.0402 / 2.3022 | 0.0350 / 1.8790 |
CMPM forms denser, more coherent clusters with less noise.
5. Theoretical and Qualitative Advantages
Three distinct advantages of CMPM are identified:
- Cross-modal anchoring: Direct coupling of textual and visual referents, so embeddings span modality boundaries and enforce joint semantics.
- Layout sensitivity: Perceptually salient elements (e.g., headlines) disproportionately influence the context window, supporting rare or otherwise underrepresented entities.
- Non-sequential, revisiting context: Human scanpaths yield interleaved context graphs, surpassing strict left-to-right windows by encoding returns and complex attention shifts.
These mechanisms enable CMPM to surpass prior art, none of which directly mixes text and image tokens during context construction or encodes human scanpaths.
6. Limitations and Future Research Directions
Limitations include:
- Restriction to small-scale, controlled (headline + summary) documents.
- Absence of negative sampling or hierarchical softmax, precluding scalability to web-scale corpora.
- Only two modalities (text, static images) are modeled; video and audio remain unexplored.
- No significance tests reported despite large effect sizes.
Future directions anticipated are:
- Scaling CMPM to web-scale data using automatically extracted perception traces from browser logs.
- Implementing learned, Transformer-style cross-modal attention.
- Expanding to additional modalities (e.g., video, audio, interactive elements).
- Incorporating user-goal–driven trace alignment (task-driven rather than pure-layout–driven traces) (Rettinger et al., 2019).
In summary, CMPM operationalizes human-inspired, cross-modal, and non-sequential context modeling by substituting hand-crafted word windows with empirically derived perception traces, leading to measurable advances in semantic and clustering benchmarks. The approach underlines the promise of perception-driven context for representation learning in multi-modal domains.