NICL: Next-item Cross-modal Contrastive Learning

Updated 12 November 2025

The paper introduces NICL, a novel contrastive learning objective that fuses cross-modal alignment with next-item transition modeling at the encoder level for recommender systems.
NICL constructs positive pairs using both same-item and next-item examples while leveraging comprehensive in-batch negatives to enforce a unified embedding space.
Empirical ablations show that NICL enhances HR@10 and cross-domain transferability, effectively mitigating cold-start issues without relying on ID-based methods.

Next-item Enhanced Cross-modal Contrastive Learning (NICL) is a contrastive learning objective explicitly designed for multi-modal recommender systems to fuse cross-modal alignment and sequential transition modeling at the encoder level. NICL is introduced in the PMMRec architecture within "Multi-Modality is All You Need for Transferable Recommender Systems" (Li et al., 2023) and is central to achieving high transferability and robustness in recommendation tasks without reliance on ID-based paradigms. The objective unifies text and vision modalities in a shared embedding space while explicitly endowing the item encoders themselves with transition (next-item) modeling capability.

1. Mathematical Formulation

NICL operates on a mini-batch of user behavior sequences. Each sequence of length $L$ for user $u$ consists of items $i^\mathit{u}_1,\,\ldots,i^\mathit{u}_L$ , and each item is represented by modality-specific encoders:

$t^\mathit{u}_l = \mathbf{t}^{cls}_{i^\mathit{u}_l}$ : text encoder output (ℓ₂-normalized)
$v^\mathit{u}_l = \mathbf{v}^{cls}_{i^\mathit{u}_l}$ : vision encoder output (ℓ₂-normalized)

The similarity function is:

$\delta(a, b) = \exp(a^\top b)$

with no learnable or tunable temperature ( $\tau=1$ ).

Given anchor $i^\mathit{u}_l$ , let

$\mathcal{N}_{i^\mathit{u}_l} = \left\{ i^\mathit{k}_j\,|\,1\leq k\leq B,\,k\neq u,\,1\leq j \leq L \right\}\,,$

the in-batch negatives for all other users $k\neq u$ , all positions $j$ .

For each $l < L$ , two symmetric losses are applied:

$\begin{aligned} \mathcal{L}^{T,V}_{i^\mathit{u}_l} &= -\log \frac{\,\delta(t^\mathit{u}_l, v^\mathit{u}_l) + \delta(t^\mathit{u}_l, v^\mathit{u}_{l+1}) + \delta(t^\mathit{u}_l, t^\mathit{u}_{l+1})\,} {\,\delta(t^\mathit{u}_l, v^\mathit{u}_l) + \sum_{i^\mathit{k}_j\in\mathcal{N}} \big[\delta(t^\mathit{u}_l, v^\mathit{k}_j) + \delta(t^\mathit{u}_l, t^\mathit{k}_j)\big]\,} \ \mathcal{L}^{V,T}_{i^\mathit{u}_l} &= -\log \frac{\,\delta(v^\mathit{u}_l, t^\mathit{u}_l) + \delta(v^\mathit{u}_l, t^\mathit{u}_{l+1}) + \delta(v^\mathit{u}_l, v^\mathit{u}_{l+1})\,} {\,\delta(v^\mathit{u}_l, t^\mathit{u}_l) + \sum_{i^\mathit{k}_j\in\mathcal{N}} \big[\delta(v^\mathit{u}_l, t^\mathit{k}_j) + \delta(v^\mathit{u}_l, v^\mathit{k}_j)\big]\,} \end{aligned}$

NICL is the average of these over both directions, summed for all users and positions:

$\mathcal{L}^{NICL} = \frac{1}{B\,(L-1)} \sum_{u=1}^{B}\sum_{l=1}^{L-1} \frac{1}{2}\left(\mathcal{L}^{T,V}_{i^\mathit{u}_l}+\mathcal{L}^{V,T}_{i^\mathit{u}_l}\right)$

This formulation explicitly acknowledges both primary (same-item, inter-modality) and next-item (cross- and intra-modality) positives, with all other positions in the batch serving as negatives in both modalities.

2. Construction of Positive and Negative Pairs

NICL distinguishes itself by the systematic inclusion of next-item transition information into contrastive objectives. The positive and negative pairings are constructed as follows:

Pair Type	Anchor	Positive Example
Inter-modality (same item)	$t^\mathit{u}_l$ / $v^\mathit{u}_l$	$v^\mathit{u}_l$ / $t^\mathit{u}_l$
Inter-modality (next item)	$t^\mathit{u}_l$ / $v^\mathit{u}_l$	$v^\mathit{u}_{l+1}$ / $t^\mathit{u}_{l+1}$
Intra-modality (next item)	$t^\mathit{u}_l$ / $v^\mathit{u}_l$	$t^\mathit{u}_{l+1}$ / $v^\mathit{u}_{l+1}$
Inter-modality negatives	$t^\mathit{u}_l$ / $v^\mathit{u}_l$	all $v^\mathit{k}_j$ / $t^\mathit{k}_j$ for $k\neq u$ , $\forall j$
Intra-modality negatives	$t^\mathit{u}_l$ / $v^\mathit{u}_l$	all $t^\mathit{k}_j$ / $v^\mathit{k}_j$ for $k\neq u$ , $\forall j$

This mixture ensures that the shared representation space encodes both modality alignment and ordered behavioral transitions.

3. Injection of Next-item Information

NICL's central innovation is its use of next-item positives, both cross-modally and intra-modally, thereby "front-loading" sequential transition knowledge into the item encoders themselves. Concretely:

For anchor $t^\mathit{u}_l$ , positives are $v^\mathit{u}_l$ (same-item, inter-modal), $v^\mathit{u}_{l+1}$ (next-item, cross-modal), and $t^\mathit{u}_{l+1}$ (next-item, intra-modal).
Symmetric construction applies for anchors in the vision modality.

Unlike objectives that only align the same item or defer transition modeling to downstream sequence modules, NICL ensures that each modality-specific item encoder, via contrastive pairing, predicts the representation of immediate sequelae in the user's interaction sequence. This mechanism injects Markovian transition patterns directly into the learned representations.

4. Integration Within the PMMRec Architecture

NICL is executed within the PMMRec pretraining pipeline as a standalone loss involving only the outputs of the modality-specific encoders, prior to any fusion or user-level modeling. The core workflow is as follows:

for each pre-training step:
    # Sample user sequences
    { s_u = [i_1^u ... i_L^u] } for u = 1,...,B

    # Item encoding
    for u in 1...B:
        for l in 1...L:
            t_l^u = TextEncoder(text_tokens(i_l^u))  # ℓ₂-normalized
            v_l^u = VisionEncoder(image_patches(i_l^u))  # ℓ₂-normalized

    # Fusion and user-level autoregression (DAP)
    for u in 1...B, l in 1...L:
        e_l^u = FusionModule([t_l^u_tokens ; v_l^u_patches])
    {h_l^u} = UserTransformer([e_1^u+pos_1 ... e_L^u+pos_L])
    L^DAP = mean next-item autoregressive loss using h_l^u vs. e_{l+1}^u

    # NICL objective—uses only t_l^u, v_l^u before fusion
    L^NICL = 0
    for u in 1...B:
        for l in 1...L-1:
            compute L^{T,V}_{i_l^u} and L^{V,T}_{i_l^u}
            L^NICL += 0.5 * (L^{T,V}_{i_l^u} + L^{V,T}_{i_l^u})
    L^NICL /= (B * (L-1))

    # Add self-supervised denoising objectives (NID, RCL)
    # ...
    # Final multi-task loss and parameter updates
    L_total = L^DAP + L^NICL + L^NID + L^RCL
    update all parameters

NICL thus sits orthogonally to downstream autoregressive or fusion-based objectives, focusing exclusively on aligning and sequence-enhancing the raw modality encodings.

5. Key Hyperparameters and Ablation Observations

Critical NICL hyperparameters and design choices are:

Batch size: $B=64$ per GPU.
Sequence length: $L$ is dataset-specific (e.g., $L=50$ ).
Similarity temperature: No explicit temperature parameter; $\delta(a, b) = \exp(a^\top b)$ .
Number of negatives: $(B-1)\times L$ per anchor, per modality (both inter- and intra-modality).
Averaging: Loss is computed over $B\cdot(L-1)$ anchors.

Ablations in the referenced work demonstrate:

Removing NICL (keeping only next-item autoregression and denoising) reduces HR@10 by up to 1–2 points.
Replacing NICL with vanilla cross-modal contrastive learning (without next-item positives) or with intra-modal enhanced contrastive learning (without cross-modal next-item positives) yields intermediate performance.
Disabling either next-item positives or intra-modality negatives each results in a 3–5% relative drop in performance, confirming the necessity of both elements in achieving optimal transfer and recommendation accuracy.

6. Functional Role and Empirical Significance

NICL is largely responsible for two key attributes of PMMRec: the integration of cross-modal (text-image) alignment and the embedding of transition (next-item) behavior patterns at the item-encoder level. By blending inter- and intra-modality negatives with next-item positives, NICL:

Aligns text and vision item representations into a unified latent space suitable for downstream recommendation.
Encodes one-step Markovian transitions within the item encoders, without depending solely on higher-level user or sequence encoders.
Empowers the modality encoders with transferability; after pretraining, these encoders generalize across domains without retraining, as demonstrated by consistent improvements over strong baselines in multi-domain benchmarks.

The mechanism further distinguishes itself from traditional contrastive objectives that either only align the same item across modalities or rely solely on sequence-level modeling for transition information.

7. Context and Implications for Transferable Recommender Systems

NICL is formulated to address two deficiencies inherent in ID-based and unimodal recommendation approaches: the lack of transferability and the cold start problem. By relying exclusively on multi-modal content and embedding the sequential inductive bias at the encoder level, PMMRec—via NICL—sidesteps ID dependence and exhibits strong cross-domain transfer adaptivity.

A plausible implication is that the introduction of structured next-item contrastive objectives generalizes to other multi-modal or sequential learning tasks beyond recommendation, wherever cross-modal alignment and temporal dependence are jointly required.

In sum, NICL is a core component enabling PMMRec to outperform prior state-of-the-art on transfer and recommendation metrics, while supporting decoupled pretraining and reusability of its constituent modules (Li et al., 2023).

PDF Markdown Chat (Pro)

References (1)

Multi-Modality is All You Need for Transferable Recommender Systems (2023)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Next-item Enhanced Cross-modal Contrastive Learning (NICL).