GSE: General Sticker Encoder
- General Sticker Encoder (GSE) is a vision-based model that rigorously defines sticker semantic similarity using binary classification and cosine similarity.
- It leverages foundation vision models with multi-source training and a human-validated Triple-S benchmark to improve sticker retrieval and emotion classification.
- GSE’s lightweight architecture built on a CLIP backbone and MLP projection outperforms baselines with higher accuracy, AUC, and retrieval metrics.
Stickers, widely employed as concise visual artifacts in digital communication, pose significant challenges for computational semantics due to their diversity, symbolism, and context dependence. The General Sticker Encoder (GSE) offers a formal, lightweight approach for embedding and evaluating semantic relations among stickers, grounded in the rigorous definition of sticker semantic similarity and deployed with a novel benchmark (Triple-S). By leveraging foundation vision models and multi-source training, GSE sets a new standard for sticker understanding, retrieval, and multimodal content generation.
1. Formal Definition of Sticker Semantic Similarity
Sticker semantic similarity is rigorously defined as a binary classification task operating on raw pixel representations of sticker pairs, , , producing a label that signifies semantic equivalence or distinction. The core mechanism involves an embedding model , which maps each input to a -dimensional semantic space. Similarity is quantified by:
A binary prediction is subsequently derived by thresholding at : if , and $0$ otherwise.
This formalization captures nuanced semantic phenomena, transcending superficial pixel similarity and incorporating elements such as emotion, expressive intent, and action. The approach is compatible with scalable embedding architectures and enables standardized downstream evaluation.
2. Construction and Properties of the Triple-S Benchmark
Triple-S is introduced as the initial human-validated benchmark addressing sticker semantic similarity. The annotation pipeline commences with a base pool of 1,116 stickers (StickerQueries), where annotators perform two separate iterations: for each anchor sticker , 20 candidate stickers are assessed for semantic similarity, resulting in semantic sets and .
Pair classification follows:
- Positive pairs: Co-occur in at least two semantic sets:
- Negative pairs: Appear jointly in candidate sets but not in semantic sets, with both restricting textual overlap () and ensuring low query similarity ().
All automatically generated pairs are subjected to manual validation for correctness.
Benchmark statistics:
| Split | Positive Pairs | Negative Pairs | Unique Stickers |
|---|---|---|---|
| Train | 394 | 371 | 477 |
| Test | 62 | 78 | 153 |
| Total | 453 | 449 | 630 |
With 905 annotated pairs and 49 balanced annotators, Triple-S embodies high granularity, capturing both subtle emotion/action matches and hard negative cases. This resource is pivotal for evaluating model semantic generalization in a controlled, task-specific regime.
3. GSE Model Architecture and Training Protocol
GSE’s architecture is anchored to the CLIP vision backbone (e.g., ViT-B/32), generating 512-dimensional visual feature representations. This is transformed by a single-layer MLP projection head to retain dimensionality (), favoring a lightweight, exclusively vision-based approach.
Training Regime
- Datasets:
- Triple-S: 905 human-curated pairs
- MultiChat: 603,351 positive/75,855 negative sticker–dialogue pairs annotated by intention
- Aggregated: 604,116 training pairs, 75,995 validation pairs
- Preprocessing: Resize to , apply random crop, horizontal flip, and color jitter, normalize by ImageNet mean/std.
- Optimization: CLIP vision encoder is fine-tuned (text encoder only invoked for loss calculation, not inference). AdamW optimizer (, weight decay ), batch size 32, 5 epochs on single NVIDIA A100 GPU. Hyperparameter selection is grid-searched against validation.
Objective Functions
Two losses are considered:
- Image–text contrastive loss (InfoNCE):
where denotes the CLIP text encoder output for each sticker’s utterance.
- Binary cross-entropy loss (optional at test):
At inference, the system operates purely on the visual encoder and similarity thresholding; no further loss computation is necessary.
4. Evaluation Paradigm and Empirical Results
GSE performance is scrutinized across three protocol axes: semantic similarity (Triple-S, WXChallenge), emotion classification (SER30K, MET-Meme), and sticker-to-sticker retrieval (SER30K, WXChallenge).
- Triple-S/WXChallenge (semantic similarity): Metrics include Accuracy, Precision, Recall, F1, ROC AUC.
- SER30K, MET-Meme (emotion classification): Nearest-prototype assignment via cosine similarity; metrics are Accuracy, Precision, Recall, F1.
- SER30K, WXChallenge (retrieval): Recall@K for .
Comparative Metrics (select examples):
| Model | Accuracy | AUC | Recall | F1 | Precision |
|---|---|---|---|---|---|
| CLIP | 0.607 | 0.651 | 0.668 | 0.496 | 0.394 |
| ViT | 0.325 | 0.556 | 0.971 | 0.454 | 0.297 |
| GSE | 0.665 | 0.706 | 0.642 | 0.526 | 0.446 |
- Downstream emotion classification (SER30K, zero-shot): GSE achieves 31.69% accuracy vs. best pretrained at 21.39%.
- Sticker-to-sticker retrieval (WXChallenge Recall@5): CLIP 0.371 vs. GSE 0.543 (+38-54%).
This suggests that GSE's adaptation to sticker-specific semantics directly enhances both similarity classification and downstream retrieval/navigation tasks, surpassing previous foundations by substantial margins across several metrics.
5. Downstream Tasks, Ablations, and Implementation
GSE’s feature representations generalize to emotion classification and sticker retrieval. For emotion classification, a nearest-prototype approach is used over seven categories (e.g., joy, anger, sadness), demonstrating strong performance both in zero-shot and with trained multimodal heads (MGHFT+GSE).
Sticker-to-sticker retrieval involves grouping by identical textual query or emotion label and ranking gallery stickers by cosine similarity. GSE achieves notably higher recall across both compact (SER30K) and large-scale (WXChallenge) datasets.
Ablation Findings
Ablation studies emphasize contributions from both data sources:
- CLIP baseline (no fine-tuning): WXChallenge Acc 60.66 / F1 49.58 / Prec 39.42 / AUC 65.05.
- +MultiChat only: incremental improvement.
- +Triple-S only: moderate precision gain.
- Combined (GSE): 66.51 Acc / 52.60 F1 / 44.57 Prec / 70.61 AUC, outperforming all component and baseline models.
As noted, MultiChat provides scale whereas Triple-S delivers annotation quality, and their union optimally balances precision and recall.
Implementation
Training is feasible on a single NVIDIA A100 GPU. The CLIP image encoder is fine-tuned for 5 epochs; the threshold for similarity is optimized for F1 on validation data. The codebase, pretrained GSE weights, and all associated benchmarks are publicly available at https://anonymous.4open.science/r/triple-s-6E65/.
6. Significance and Future Directions
By both formalizing the sticker semantic similarity task and generating a benchmark with rich human annotation, GSE establishes the first rigorous framework for sticker understanding. The architecture—vision-only, CLIP-based, minimal in complexity—offers rapid adaptation for both retrieval and multimodal generation tasks. Empirical evidence attests to strong zero-shot generalization and state-of-the-art performance in emotion classification and semantic retrieval.
A plausible implication is that future advances in sticker semantics will require continued refinement of annotated resources (expanding or diversifying Triple-S), exploration of more sophisticated multimodal objectives, and adaptation to evolving sticker “languages” across platforms. This foundational work provides standardized tools and baselines for subsequent research and industry deployments in sticker-based communication systems.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free