GSE: General Sticker Encoder

Updated 13 November 2025

General Sticker Encoder (GSE) is a vision-based model that rigorously defines sticker semantic similarity using binary classification and cosine similarity.
It leverages foundation vision models with multi-source training and a human-validated Triple-S benchmark to improve sticker retrieval and emotion classification.
GSE’s lightweight architecture built on a CLIP backbone and MLP projection outperforms baselines with higher accuracy, AUC, and retrieval metrics.

Stickers, widely employed as concise visual artifacts in digital communication, pose significant challenges for computational semantics due to their diversity, symbolism, and context dependence. The General Sticker Encoder (GSE) offers a formal, lightweight approach for embedding and evaluating semantic relations among stickers, grounded in the rigorous definition of sticker semantic similarity and deployed with a novel benchmark (Triple-S). By leveraging foundation vision models and multi-source training, GSE sets a new standard for sticker understanding, retrieval, and multimodal content generation.

1. Formal Definition of Sticker Semantic Similarity

Sticker semantic similarity is rigorously defined as a binary classification task operating on raw pixel representations of sticker pairs, $s_i$ , $s_j \in \mathcal{X}$ , producing a label $y_{ij}\in\{0,1\}$ that signifies semantic equivalence or distinction. The core mechanism involves an embedding model $g_\phi: \mathcal{X} \rightarrow \mathbb{R}^d$ , which maps each input to a $d$ -dimensional semantic space. Similarity is quantified by:

$\mathrm{sim}_\phi(s_i, s_j) = \frac{g_\phi(s_i)\cdot g_\phi(s_j)}{\|g_\phi(s_i)\|\,\|g_\phi(s_j)\|}$

A binary prediction $\hat y_{ij}$ is subsequently derived by thresholding $\mathrm{sim}_\phi$ at $\tau$ : $\hat y_{ij}=1$ if $\mathrm{sim}_\phi(s_i,s_j)\ge\tau$ , and $0$ otherwise.

This formalization captures nuanced semantic phenomena, transcending superficial pixel similarity and incorporating elements such as emotion, expressive intent, and action. The approach is compatible with scalable embedding architectures and enables standardized downstream evaluation.

2. Construction and Properties of the Triple-S Benchmark

Triple-S is introduced as the initial human-validated benchmark addressing sticker semantic similarity. The annotation pipeline commences with a base pool of 1,116 stickers (StickerQueries), where annotators perform two separate iterations: for each anchor sticker $s_i$ , 20 candidate stickers are assessed for semantic similarity, resulting in semantic sets $\mathcal S_i^{(1)}$ and $\mathcal S_i^{(2)}$ .

Pair classification follows:

Positive pairs: Co-occur in at least two semantic sets:

$y_{ij}=1 \quad \iff \quad \Big|\{\mathcal S_k^{(t)}: s_i,s_j\in\mathcal S_k^{(t)}\}\Big|\ge2$

Negative pairs: Appear jointly in candidate sets but not in semantic sets, with both restricting textual overlap ( $|Q_i\cap Q_j|<3$ ) and ensuring low query similarity ( $\mathrm{sim}_{\mathrm{text}}<0.7$ ).

All automatically generated pairs are subjected to manual validation for correctness.

Benchmark statistics:

Split	Positive Pairs	Negative Pairs	Unique Stickers
Train	394	371	477
Test	62	78	153
Total	453	449	630

With 905 annotated pairs and 49 balanced annotators, Triple-S embodies high granularity, capturing both subtle emotion/action matches and hard negative cases. This resource is pivotal for evaluating model semantic generalization in a controlled, task-specific regime.

3. GSE Model Architecture and Training Protocol

GSE’s architecture is anchored to the CLIP vision backbone (e.g., ViT-B/32), generating 512-dimensional visual feature representations. This is transformed by a single-layer MLP projection head to retain dimensionality ( $d=512$ ), favoring a lightweight, exclusively vision-based approach.

Training Regime

Datasets:
- Triple-S: 905 human-curated pairs
- MultiChat: 603,351 positive/75,855 negative sticker–dialogue pairs annotated by intention
- Aggregated: 604,116 training pairs, 75,995 validation pairs
Preprocessing: Resize to $224\times 224$ , apply random crop, horizontal flip, and color jitter, normalize by ImageNet mean/std.
Optimization: CLIP vision encoder is fine-tuned (text encoder only invoked for loss calculation, not inference). AdamW optimizer ( $\mathrm{lr}=1\times10^{-4}$ , weight decay $1\times10^{-4}$ ), batch size 32, 5 epochs on single NVIDIA A100 GPU. Hyperparameter selection is grid-searched against validation.

Objective Functions

Two losses are considered:

Image–text contrastive loss (InfoNCE):

$\mathcal{L}_{\mathrm{itc}} = -\frac{1}{2N}\sum_{i=1}^N \left[ \log\frac{\exp(\mathrm{sim}(g_\phi(s_i), h_\psi(t_i))/\tau)}{\sum_{j=1}^N \exp(\mathrm{sim}(g_\phi(s_i), h_\psi(t_j))/\tau)} + \log\frac{\exp(\mathrm{sim}(g_\phi(s_i), h_\psi(t_i))/\tau)}{\sum_{j=1}^N \exp(\mathrm{sim}(g_\phi(s_j), h_\psi(t_i))/\tau)} \right]$

where $h_\psi(t)$ denotes the CLIP text encoder output for each sticker’s utterance.

Binary cross-entropy loss (optional at test):

$\mathcal{L}_{\mathrm{BCE}} = -y\log\sigma(\mathrm{sim}_\phi) - (1 - y)\log(1-\sigma(\mathrm{sim}_\phi))$

At inference, the system operates purely on the visual encoder and similarity thresholding; no further loss computation is necessary.

4. Evaluation Paradigm and Empirical Results

GSE performance is scrutinized across three protocol axes: semantic similarity (Triple-S, WXChallenge), emotion classification (SER30K, MET-Meme), and sticker-to-sticker retrieval (SER30K, WXChallenge).

Triple-S/WXChallenge (semantic similarity): Metrics include Accuracy, Precision, Recall, F1, ROC AUC.
SER30K, MET-Meme (emotion classification): Nearest-prototype assignment via cosine similarity; metrics are Accuracy, Precision, Recall, F1.
SER30K, WXChallenge (retrieval): Recall@K for $K = 5,10,20,100$ .

Comparative Metrics (select examples):

Model	Accuracy	AUC	Recall	F1	Precision
CLIP	0.607	0.651	0.668	0.496	0.394
ViT	0.325	0.556	0.971	0.454	0.297
GSE	0.665	0.706	0.642	0.526	0.446

Downstream emotion classification (SER30K, zero-shot): GSE achieves 31.69% accuracy vs. best pretrained at 21.39%.
Sticker-to-sticker retrieval (WXChallenge Recall@5): CLIP 0.371 vs. GSE 0.543 (+38-54%).

This suggests that GSE's adaptation to sticker-specific semantics directly enhances both similarity classification and downstream retrieval/navigation tasks, surpassing previous foundations by substantial margins across several metrics.

5. Downstream Tasks, Ablations, and Implementation

GSE’s feature representations generalize to emotion classification and sticker retrieval. For emotion classification, a nearest-prototype approach is used over seven categories (e.g., joy, anger, sadness), demonstrating strong performance both in zero-shot and with trained multimodal heads (MGHFT+GSE).

Sticker-to-sticker retrieval involves grouping by identical textual query or emotion label and ranking gallery stickers by cosine similarity. GSE achieves notably higher recall across both compact (SER30K) and large-scale (WXChallenge) datasets.

Ablation Findings

Ablation studies emphasize contributions from both data sources:

CLIP baseline (no fine-tuning): WXChallenge Acc 60.66 / F1 49.58 / Prec 39.42 / AUC 65.05.
+MultiChat only: incremental improvement.
+Triple-S only: moderate precision gain.
Combined (GSE): 66.51 Acc / 52.60 F1 / 44.57 Prec / 70.61 AUC, outperforming all component and baseline models.

As noted, MultiChat provides scale whereas Triple-S delivers annotation quality, and their union optimally balances precision and recall.

Implementation

Training is feasible on a single NVIDIA A100 GPU. The CLIP image encoder is fine-tuned for 5 epochs; the threshold $\tau$ for similarity is optimized for F1 on validation data. The codebase, pretrained GSE weights, and all associated benchmarks are publicly available at https://anonymous.4open.science/r/triple-s-6E65/.

6. Significance and Future Directions

By both formalizing the sticker semantic similarity task and generating a benchmark with rich human annotation, GSE establishes the first rigorous framework for sticker understanding. The architecture—vision-only, CLIP-based, minimal in complexity—offers rapid adaptation for both retrieval and multimodal generation tasks. Empirical evidence attests to strong zero-shot generalization and state-of-the-art performance in emotion classification and semantic retrieval.

A plausible implication is that future advances in sticker semantics will require continued refinement of annotated resources (expanding or diversifying Triple-S), exploration of more sophisticated multimodal objectives, and adaptation to evolving sticker “languages” across platforms. This foundational work provides standardized tools and baselines for subsequent research and industry deployments in sticker-based communication systems.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to General Sticker Encoder (GSE).