EmoLoom-2B: Lightweight Emotion Pipeline
- EmoLoom-2B is a pipeline that converts small-scale language models into joint emotion classifiers and VAD regressors using a strict JSON input-output contract.
- It employs integrated loss functions and semantic regularization, combining binary cross-entropy for emotion tags with Euclidean regression for VAD predictions.
- The framework uses controlled data augmentation and mixture sampling to optimize performance metrics while ensuring reproducibility and operational efficiency.
EmoLoom-2B is a lightweight, reproducible pipeline designed to convert small-scale LLMs (SLMs) with under 2 billion parameters into robust joint emotion classifiers and Valence-Arousal-Dominance (VAD) regressors. The framework centralizes protocol-faithful implementation, enforcing a strict JSON input-output contract, deterministic decoding procedures, and semantic regularization. Targeted toward rapid evaluation and budget-aware screening, EmoLoom-2B systematically eliminates common sources of avoidable variance while maximizing both coverage and format reliability in emotion understanding tasks (Li et al., 3 Jan 2026).
1. Protocol-True Pipeline Design
EmoLoom-2B is architected around a unified JSON interaction "contract" for both training and inference. Each input utterance is processed by a fixed prompt requiring output as a single line of JSON containing three elements: a multi-hot list of emotion labels ("labels"); a dictionary of VAD values ("vad") where rounded to two decimals; and a concise English rationale. Example output:
0
Parse validity is rigorously enforced via tail-scanning for JSON structure; only outputs marked ParseOK=1 are included in downstream metric calculations (Macro-F1, Macro-R, 1–RMSE VAD), and the ParseOK rate itself is reported to quantify formatter reliability.
All decoding utilizes the KV-off paradigm: use_cache=false, deterministic hyperparameters (temperature=0, top_p=1.0), and greedy next-token selection—the same schema for both self-distillation and inference. This approach neutralizes implementation-dependent discrepancies in key-value cache handling across hardware/software platforms, ensuring that measured improvements reflect model fidelity rather than artifacts of decoding strategy.
2. Loss Architecture and Semantic Regularization
The foundational objective combines multi-label binary cross-entropy for emotion tags and Euclidean () regression for VAD prediction:
Three orthogonal semantic regularizers further constrain the model:
- VAD-Preserving Consistency: Outputs are mapped onto VAD space using NRC-VAD lexicon lookups for each token, and an aggregated text VAD vector is computed. The additional loss
enforces continuity between surface generation and numeric affect prediction.
- Lightweight Appraisal-Atom Verifier: A compact classifier predicts scores for discrete appraisal atoms (goal attainment, controllability, certainty, fairness) given . Targets 0 are derived from gold labels or heuristics; the regularization penalty is
1
This is a soft constraint active only during training, not requiring expanded explanations nor generator modification.
- Valence Flip Symmetry: By constructing lexically mirrored pairs 2 swapping polarity-laden tokens (e.g., "great" 3 "terrible"), the model is trained to output valence scores symmetric around 0.5:
4
The total objective is a weighted sum:
5
Weights are selected via development set sweep (e.g. 6, 7, 8).
3. Data Augmentation and Mixture Sampling
EmoLoom-2B utilizes Valence Flip augmentation by introducing polarity-mirrored pairs through lexical substitution or scenario rewrite, simultaneously training on both 9 and 0 to enforce inversion behavior. This technique is found to mitigate valence drift and encourage robust polarity mapping.
Training proceeds via A/B mixture sampling, interleaving GoEmotions (“A”) and EmpatheticDialogues (“B”) at controlled ratios (1 studied for 20:80, 50:50, 80:20). Selection probability for each batch is governed by
2
where 3 is a running entropy-based confidence estimate, and temperature 4 cools linearly with training step. Early epochs promote diversity through high 5, while late-phase training consolidates on the target ratio. Empirically, the 20:80 ratio yields optimal trade-offs for Macro-F1 and VAD.
The supervised fine-tuning loop incorporates out-of-memory (OOM) remediation (max_len reduction, gradient accumulation bump) and is detailed algorithmically in the original publication.
4. Experimental Configuration and Backbone Selection
For backbone evaluation, two ~1.8B parameter models (Qwen-1.8B-Chat and InternLM2-1.8B-SFT) were screened under the identical KV-off protocol over 1 hour, with Qwen-1.8B-Chat selected for downstream use due to a +1σ lead in composite metrics. Training was conducted on a single GPU (≥24GB VRAM) using PyTorch 2.3.1, bf16/TF32 precision, gradient checkpointing, AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1), batch size 1 with accumulation to effective batch ≈128, cosine-decay learning rate (peak ~1–2e-5), input truncation at 1536 tokens, and generation cutoff at 64 tokens. Deterministic seeds and full config/hashes ensure auditability and reproducibility across runs.
5. Task Evaluation, Metrics, and Ablation
Development experiments (GoEmotions + EmpatheticDialogues) compared three A:B mixture ratios, with results summarized in Table 1 (JSON-valid outputs only):
| Mix Ratio | Macro-F1 | Macro-R | VAD (1–RMSE) |
|---|---|---|---|
| 20:80 | 0.3500† | 0.2693 | 0.9417† |
| 50:50 | 0.3470‡ | 0.2657‡ | 0.9337‡ |
| 80:20 | 0.3341 | 0.2509 | 0.9135 |
† denotes best; ‡ denotes second best.
The 20:80 mix converged to the lowest loss (≈0.1604 at step 90) and delivered the highest joint performance for multi-label classification and VAD regression.
Cross-corpus generalization was vetted in a one-hour budgeted run on DailyDialog, with the 20:80 model achieving Macro-F1≈0.3071, VAD(1–RMSE)≈0.8066, and ParseOK≈0.976 (6=6,261).
Ablation studies revealed the following impacts:
- Excluding 7: Increased VAD error (by 0.006–0.012) and decreased Macro-F1 (by 0.004–0.007), with more valence drift under flip augmentation.
- Excluding 8: Decreased Macro-F1 by 0.010–0.015, with pronounced effect on fairness and controllability categories; VAD unchanged.
- Removing flip pairs: Flip-symmetry metric 9 doubled (≈0.06→0.11), Macro-F1 down by ≈0.003.
- Deviations from linear temperature cooling reduced coverage or convergence on high-entropy samples.
- Mix ratio and semantic regularizer weights exhibited ranges of broad optimality, indicating robustness to these hyperparameters.
6. Operational Characteristics, Auditability, and Recommendations
EmoLoom-2B implements a rapid “quick-eval” audit mode, capping wall-clock time (e.g., 60 min) and recording ETA to enable standardized backbone comparison under strict protocol constraints. Full auditability is achieved by manifesting data splits with SHA-1 checksum, deterministic seeds, OOM healing routines, KV-off, and prompt standardization. Re-entrancy guarantees identical metrics, outputs, and logs across repeated runs.
Practically, EmoLoom-2B is recommended as an initial screening filter for backbone selection and preliminary validation of joint emotion/VAD capabilities, preceding investment in larger-scale or multimodal architectures. The semantic regularizers (VAD consistency, appraisal verifier, valence flip) add minimal computational burden yet provide measurable gains in robustness and format reliability.
In summary, EmoLoom-2B delivers a reproducible, minimal-overhead pipeline for small-model emotion understanding, integrating strict JSON protocols, fair decoding, lexicon-weak supervision, and semantic constraints. Its performance and operational traits position it as a dependable resource for constrained screening and prototyping in affective language research (Li et al., 3 Jan 2026).