Papers
Topics
Authors
Recent
Search
2000 character limit reached

EmoCapCLIP: Dual-Branch Emotion Recognition

Updated 16 March 2026
  • The paper introduces a dual-branch model that combines global and local contrastive learning to align facial features with detailed caption semantics.
  • It leverages the EmoCap100K dataset, featuring over 107K images with multi-level captions, significantly enhancing semantic granularity and model supervision.
  • Empirical results demonstrate substantial zero-shot and few-shot performance gains, outperforming FLIP baselines in both static and dynamic emotion recognition tasks.

EmoCapCLIP is a dual-branch contrastive learning framework for facial emotion representation, introduced in conjunction with the large-scale EmoCap100K dataset. It leverages semantically rich natural language captions as supervisory signals, employing a joint global-local contrastive scheme and a cross-modal guided positive mining module to effectively align facial image regions and multi-level caption semantics. EmoCapCLIP demonstrates superior performance in zero-shot and few-shot emotion recognition tasks compared to previous category-based or CLIP-family baselines, particularly excelling in both static and dynamic facial expression recognition scenarios (Sun et al., 28 Jul 2025).

1. EmoCap100K Dataset Construction

The EmoCap100K dataset represents the foundational resource for EmoCapCLIP, capturing 107,134 facial images extracted from over 1,000 movies spanning a broad diversity of genres, poses, and emotional contexts. Each image is paired with a multi-level natural language caption averaging 267 words, which far exceeds the semantic bandwidth of existing facial emotion corpora (e.g., 18-word captions in MAFW). The captions decompose into:

  • Global sentence: A summary of the overall affective state (e.g., “A woman appears nervously apprehensive with furrowed brows and pursed lips”).
  • Local sentences: Each describes a distinctive facial region or action unit (e.g., “Her eyebrows draw together in a tight frown”).
  • Summary sentence: An integrative statement fusing global and local affective cues.

Dataset annotation harnesses Gemini-1.5-Flash for structured prompt-based caption generation. This automated process covers over 703 unique emotion-descriptive terms, including nuanced compound emotions (“anxiously excited”, “bitterly disappointed”). The resultant corpus offers unprecedented semantic diversity and granularity for facial affect supervision (Sun et al., 28 Jul 2025).

2. Dual-Branch Model Architecture

EmoCapCLIP adopts a CLIP-style dual-encoder architecture. Key architectural features include:

  • Image Encoder: Vision Transformer (ViT), producing patch-level embeddings RIRP×D\mathbf{R}_I \in \mathbb{R}^{P\times D} and a global facial embedding gIRD\mathbf{g}_I \in \mathbb{R}^D (from the [CLS] token).
  • Text Encoder: Transformer architecture, yielding global caption embeddings gTRD\mathbf{g}_T \in \mathbb{R}^D and local embedding set {rTj}\{\mathbf{r}_T^{j}\} for each local sentence.

EmoCapCLIP bifurcates into two contrastive branches:

  • Global Branch: Aligns image-level (global) embeddings with global/summarized text.
  • Local Branch: For each of MM local descriptors per image, aligns local sentence embedding rTj\mathbf{r}_T^{j} with a regionally pooled image embedding:

rIj^=softmax((rTjWQ)(RIWK)/D)(RIWV)\widehat{\mathbf{r}_I^{j}} = \mathrm{softmax}((\mathbf{r}_T^{j}W^Q)(\mathbf{R}_I W^K)^\top/\sqrt{D})\,(\mathbf{R}_I W^V)

where cross-attention incorporates the local text as a query over the image patches (Sun et al., 28 Jul 2025).

3. Joint Global–Local Contrastive Learning

The training objective is a combination of global and local InfoNCE losses, with further refinement through inter- and intra-sample local contrastive terms:

  • Global InfoNCE: Symmetric contrastive loss aligns global image and text vectors:

Lglobal=1Ni=1N[logexp(cos(gIi,gTi)/τ)n=1Nexp(cos(gIi,gTn)/τ)+logexp(cos(gTi,gIi)/τ)n=1Nexp(cos(gTi,gIn)/τ)]\mathcal{L}_{\mathrm{global}} = -\frac{1}{N}\sum_{i=1}^N\left[ \log\frac{\exp(\cos(\mathbf{g}_I^i,\mathbf{g}_T^i)/\tau)}{\sum_{n=1}^N\exp(\cos(\mathbf{g}_I^i,\mathbf{g}_T^n)/\tau)} + \log\frac{\exp(\cos(\mathbf{g}_T^i,\mathbf{g}_I^i)/\tau)}{\sum_{n=1}^N\exp(\cos(\mathbf{g}_T^i,\mathbf{g}_I^n)/\tau)} \right]

  • Local InfoNCE: Two levels—
    • Intra-sample: Uses MM local sentences/regions within an image as negatives.
    • Inter-sample: Uses corresponding local features across images.

The total loss includes weighted global and local terms with a learnable temperature τ\tau:

Loverall=Lg+α(Lrintra+Lrinter)\mathcal{L}_{\mathrm{overall}} = \mathcal{L}_g + \alpha (\mathcal{L}_r^{\mathrm{intra}} + \mathcal{L}_r^{\mathrm{inter}})

with Lg\mathcal{L}_g, Lrintra\mathcal{L}_r^{\mathrm{intra}}, and Lrinter\mathcal{L}_r^{\mathrm{inter}} as defined above, α\alpha controlling local branch weighting (Sun et al., 28 Jul 2025).

4. Cross-Modal Guided Positive Mining (CMGPM)

Semantic heterogeneity across facial expressions implies that a rigid pairing of only strictly identical captions/faces as positives is insufficient: different textual descriptions (e.g., “joyful grin” vs. “broad smile”) may both aptly describe the same affective state. EmoCapCLIP addresses this with CMGPM:

  • For each image anchor, computes cosine similarities to all text encodings in the batch.
  • Determines positive sets PTi\mathcal{P}_T^i via threshold σ\sigma (sp>σs_p > \sigma) and top-KK selection.
  • Weights each additional positive by similarity: λTi,p=sp\lambda_T^{i,p} = s_p.
  • Extends the InfoNCE losses, symmetrized for cross-modal mining.

After t0t_0 epochs (e.g., t0=10t_0=10), CMGPM is incorporated, ensuring robust alignment across semantically related descriptions and visual cues (Sun et al., 28 Jul 2025).

5. Training Protocol and Hyperparameters

  • Encoders: ViT-B/32 or ViT-L/14 (149M–428M parameters).
  • Optimization: AdamW with β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, weight decay $0.05$, batch size $256$, initial learning rate 1×1041\times 10^{-4} with linear decay over $30$ epochs.
  • Positive mining: M=3M=3 local sentences per image, CMGPM threshold σ=0.8\sigma=0.8, top-K=5K=5.
  • Temperature: τ\tau initialized to $0.07$, optimized end-to-end (Sun et al., 28 Jul 2025).

6. Empirical Evaluation and Ablation

EmoCapCLIP demonstrates substantial improvements over both FLIP and Exp-CLIP on a broad suite of benchmarks:

Benchmark Baseline WAR (%) EmoCapCLIP WAR (%) Absolute Gain
RAF-DB 49.3 (FLIP) 67.0 +17.7
AffectNet-7 38.3 (FLIP) 51.9 +13.6
FERPlus 44.8 (FLIP) 55.1 +10.3
  • Zero-shot static FER: Absolute improvement of +17.7%+17.7\% WAR on RAF-DB; also large gains on AffectNet-7/8 and FERPlus.
  • Zero-shot dynamic FER: Outperforms all image/video CLIP baselines on datasets such as DFEW (42.2% vs 27.5%) and CREMA-D (48.8% vs 24.9%).
  • Few-shot FER: Excels in kk-shot settings (k=1,,16k=1,\dots,16), with especially prominent gains in low-kk scenarios.
  • Cross-dataset generalization: Zero-shot accuracy matches or exceeds fully supervised models on datasets including AffectNet-7 and SFEW2.0.

Ablation reveals that each architectural element, including local contrast and CMGPM, incrementally improves accuracy, with local inter-sample mining and CMGPM providing key advances in semantic discrimination (Sun et al., 28 Jul 2025).

7. Analysis, Limitations, and Prospective Extensions

EmoCapCLIP substantiates the efficacy of richly structured, multi-level natural language supervision for capturing nuanced facial affect representations, enabling robust zero/few-shot generalization. The combined global-local architecture and semantic positive mining ensure alignment of both holistic and region-specific emotional features.

Key limitations include:

  • Dependence on a proprietary MLLM (Gemini-1.5-Flash) for captioning, which may impact reproducibility.
  • Local pooling granularity: The current local branch uses whole-patch pooling; direct attention to facial subregions (e.g., eyes, mouth) could further sharpen alignment.
  • Temporal modeling: Frame pooling suffices for videos, but explicit temporal encoding may benefit dynamic micro-expression recognition.

Future directions include leveraging open-source captioners to broaden access, refining region-specific pooling, and integrating temporal modules for dynamic affect tracking. The approach also suggests further exploration of semantic transfer for continuous emotion regression and multi-modal affect analysis (Sun et al., 28 Jul 2025).


For implementation resources, code and dataset access are available at https://github.com/sunlicai/EmoCapCLIP (Sun et al., 28 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EmoCapCLIP.