Papers
Topics
Authors
Recent
2000 character limit reached

Cross-Consistent Contrastive Loss (CCCL)

Updated 8 December 2025
  • The paper introduces CCCL, which regularizes intra-modal similarity by aligning text and motion distributions, addressing manifold collapse in low-data regimes.
  • CCCL enforces cross-consistency via softmaxed similarity distributions and symmetric KL divergence, integrating teacher-guided distillation for early stage alignment.
  • Empirical results on KITML and HumanML3D show significant gains, with improved retrieval metrics and reduced overfitting compared to traditional InfoNCE.

Cross-Consistent Contrastive Loss (CCCL) is a loss function for cross-modal retrieval that augments traditional contrastive losses in order to regularize the geometry of uni-modal subspaces in joint embedding models. CCCL is motivated by limitations of standard cross-modal contrastive objectives such as InfoNCE, which align matching pairs between modalities (e.g., text and motion) but leave the internal structure of each modality unconstrained. This leads to poor generalization and manifold collapse, especially in low-data regimes as typical in text-to-motion and motion-to-text retrieval tasks. CCCL introduces cross-consistency constraints by enforcing that the distributions of similarities within each modality reflect those of the cross-modal pairs, and incorporates teacher-guided distillation to seed consistent manifold geometry early in training (Messina et al., 2024).

1. Motivation and Conceptual Rationale

Standard cross-modal contrastive learning frameworks, such as InfoNCE, optimize an objective where matching text–motion pairs are brought together and non-matching pairs are pushed apart. However, these frameworks do not regularize the distribution of semantic relationships within modalities. For instance, semantically similar texts or motions may be encoded far apart in the joint embedding space if their cross-modal affinities differ, resulting in poor representation and loss of semantic structure. CCCL addresses this by introducing two cross-consistency families:

  • Uni-modal text–text consistency: Aligns the distribution of text–text similarities with the cross-modal text–motion similarity distribution.
  • Uni-modal motion–motion consistency: Imposes a similar constraint for motion–motion similarities.

By tying the ranking structure within modalities to cross-modal relationships, CCCL regularizes the geometry of both manifolds, which can reduce over-fitting and enhance robustness.

2. Mathematical Structure

Given a batch {(Ti,Mi)}i=1N\{(T_i, M_i)\}_{i=1}^N of text–motion pairs, CCCL utilizes the following encoded representations and similarity metrics:

  • Encoded text: ti=Et(Ti)\mathbf{t}_i = \mathcal{E}_t(T_i)
  • Encoded motion: mi=Em(Mi)\mathbf{m}_i = \mathcal{E}_m(M_i)
  • Similarity: s(x,y)=xy/xys(x, y) = x^\top y / \|x\| \|y\|
  • Batch size: BB
  • Temperature parameter: τ>0\tau > 0 (learned)

The standard cross-modal InfoNCE loss is:

Lnce=1Bi=1B[logexp(s(mi,ti)/τ)j=1Bexp(s(mi,tj)/τ)+logexp(s(mi,ti)/τ)j=1Bexp(s(mj,ti)/τ)]\mathcal{L}_{\rm nce} = -\frac{1}{B} \sum_{i=1}^B \Bigl[ \log \frac{\exp\bigl(s(\mathbf{m}_i,\mathbf{t}_i)/\tau\bigr)} {\sum_{j=1}^B\exp\bigl(s(\mathbf{m}_i,\mathbf{t}_j)/\tau\bigr)} + \log \frac{\exp\bigl(s(\mathbf{m}_i,\mathbf{t}_i)/\tau\bigr)} {\sum_{j=1}^B\exp\bigl(s(\mathbf{m}_j,\mathbf{t}_i)/\tau\bigr)} \Bigr]

For CCCL, four softmaxed similarity distributions over batch indices are introduced:

Sjt2m=softmaxis(tj,mi) Sjm2t=softmaxis(ti,mj) Sjt2t=softmaxis(ti,tj) Sjm2m=softmaxis(mi,mj)\begin{aligned} \mathcal{S}_j^{t2m} &= \mathrm{softmax}_i\, s(\mathbf{t}_j, \mathbf{m}_i) \ \mathcal{S}_j^{m2t} &= \mathrm{softmax}_i\, s(\mathbf{t}_i, \mathbf{m}_j) \ \mathcal{S}_j^{t2t} &= \mathrm{softmax}_i\, s(\mathbf{t}_i, \mathbf{t}_j) \ \mathcal{S}_j^{m2m} &= \mathrm{softmax}_i\, s(\mathbf{m}_i, \mathbf{m}_j) \end{aligned}

Score distribution consistency is enforced via symmetric KL divergence:

SymmKL(P,Q)=12[KL(PQ)+KL(QP)]\mathrm{SymmKL}(P, Q) = \frac{1}{2}[ \mathrm{KL}(P\|Q) + \mathrm{KL}(Q\|P) ]

Thus,

Lcross-to-t2t=1Bj=1B12[SymmKL(Sjt2m,Sjt2t)+SymmKL(Sjm2t,Sjt2t)]\mathcal{L}_{\rm cross\text{-}to\text{-}t2t} = \frac{1}{B} \sum_{j=1}^B \frac{1}{2} \left[ \mathrm{SymmKL}(\mathcal{S}_j^{t2m}, \mathcal{S}_j^{t2t}) + \mathrm{SymmKL}(\mathcal{S}_j^{m2t}, \mathcal{S}_j^{t2t}) \right ]

Lcross-to-m2m=1Bj=1B12[SymmKL(Sjt2m,Sjm2m)+SymmKL(Sjm2t,Sjm2m)]\mathcal{L}_{\rm cross\text{-}to\text{-}m2m} = \frac{1}{B} \sum_{j=1}^B \frac{1}{2} \left[ \mathrm{SymmKL}(\mathcal{S}_j^{t2m}, \mathcal{S}_j^{m2m}) + \mathrm{SymmKL}(\mathcal{S}_j^{m2t}, \mathcal{S}_j^{m2m}) \right ]

Combined cross-to-uni consistency:

Lcross-to-uni=Lcross-to-t2t+Lcross-to-m2m\mathcal{L}_{\rm cross\text{-}to\text{-}uni} = \mathcal{L}_{\rm cross\text{-}to\text{-}t2t} + \mathcal{L}_{\rm cross\text{-}to\text{-}m2m}

Teacher-guided distillation is introduced by generating oracle uni-modal distributions, StextGT\mathcal{S}^{\mathrm{textGT}}, from a frozen reference encoder (MPNet):

Lteacher-to-t2t=KL(StextGT,St2t)\mathcal{L}^{\rm teacher\text{-}to\text{-}t2t} = \mathrm{KL}(\mathcal{S}^{\mathrm{textGT}}, \mathcal{S}^{t2t})

Lteacher-to-m2m=KL(StextGT,Sm2m)\mathcal{L}^{\rm teacher\text{-}to\text{-}m2m} = \mathrm{KL}(\mathcal{S}^{\mathrm{textGT}}, \mathcal{S}^{m2m})

Lteacher-to-uni=Lteacher-to-t2t+Lteacher-to-m2m\mathcal{L}^{\rm teacher\text{-}to\text{-}uni} = \mathcal{L}^{\rm teacher\text{-}to\text{-}t2t} + \mathcal{L}^{\rm teacher\text{-}to\text{-}m2m}

The overall CCCL objective:

L=Lnce+λLcross-to-uni+(1λ)Lteacher-to-uni\mathcal{L} = \mathcal{L}_{\rm nce} + \lambda\,\mathcal{L}_{\rm cross\text{-}to\text{-}uni} + (1-\lambda)\,\mathcal{L}^{\rm teacher\text{-}to\text{-}uni}

Annealing parameter λ\lambda is scheduled linearly from 0 (teacher-only) to 1 (self-consistency), facilitating a transition from teacher supervision to self-distillation.

3. Model Integration and Training Pipeline

CCCL integrates into a joint-embedding network comprising:

  • ACTORStyleEncoder for text and MoT++ for motion, both transformer-based and outputting 256-dimensional mean vectors.
  • VAE-style decoder heads for both modalities; only μ\mu used in CCCL.
  • CCCL is computed per batch in conjunction with KL regularizer (VAE prior shaping) and motion reconstruction loss.
  • Optimization via Adam (LR=5×1055 \times 10^{-5}, 250 epochs), gradient clipping, learned temperature τ\tau.
  • Teacher model MPNet is frozen and supplies StextGT\mathcal{S}^{\mathrm{textGT}}.

Motion data is down-sampled to 200 frames using joint grouping in MoT++.

4. Empirical Results and Ablation

CCCL was empirically validated on the KIT Motion-Language (KITML) and HumanML3D datasets as well as cross-dataset protocols. Key findings:

  • On KITML, MoT+++CCCL reached RsumR_{sum} = 550.6 vs. 477.6 for TMR with InfoNCE, an ≈15% relative gain.
  • HumanML3D saw RsumR_{sum} improvement from ≈489 to 495 and 10–20% median rank reduction.
  • Cross-dataset retrieval (trained on HumanML3D, tested on KITML): RsumR_{sum} improvement from 516.6 to 527.4.
  • Ablation demonstrated that self-consistent CCCL outperformed InfoNCE+filtering even absent teacher; prolonged teacher schedule degraded performance.
  • Motion-to-motion retrieval improved mAP from 0.778 to ≈0.785, reflecting enhanced uni-modal geometry.

Limitations include the dependence on a stable teacher such as MPNet, scheduling sensitivity of λ\lambda, and modest computational overhead from added KL terms.

5. Hyperparameters and Operational Guidance

Table: Notable hyperparameter settings

Hyperparameter Typical Values Remarks
Batch size BB 32–64 B=32B=32 recommended for gradient stability
Temperature τ\tau Init: 0.07, trained Learned parameter for softmax distributions
λ\lambda scheduling tstart=40t_{\rm start}=40, tend=100t_{\rm end}=100 (of 250) Teacher early, self-consistency late
Motion sampling 200 frames Down-sampling in MoT++
Optimizer Adam, LR=5×1055\times10^{-5} Gradient clipping, scheduler warm-up

Combining multiple datasets (KITML and HumanML3D) during training is shown to markedly improve model robustness.

6. Generalization and Broader Impact

CCCL generalizes to other cross-modal contrastive setups in low-data regimes. For image–text retrieval, CCCL can enforce image–image and text–text score alignment against the image–text distribution. Adaptations to audio–text, video–audio, or higher-order multi-modal fusion follow the same pattern: define softmaxed cross-modal score distributions, derive uni-modal distribution, and enforce alignment via symmetric KL. The teacher-to-self annealing is functionally related to curriculum learning and is applicable to other self-distillation approaches.

A plausible implication is stabilization of joint embeddings in multi-modal fusion models by systematically preventing modality collapse and improving retrieval performance across modalities.

7. Limitations and Future Directions

CCCL necessitates a reliable teacher for uni-modal initialization and careful tuning of the annealing schedule. The computational overhead from additional KL terms is modest but non-negligible. Empirical validation establishes robustness, but broader applicability to other modalities relies on analogous score-alignment strategies and stable teacher initialization. Extensions may involve curriculum learning variants of teacher annealing, and investigation into optimal cross-modal–uni-modal score alignment techniques for diverse architectures (Messina et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Cross-Consistent Contrastive Loss (CCCL).