EmoLoom-2B: Lightweight Emotion Pipeline

Updated 10 January 2026

EmoLoom-2B is a pipeline that converts small-scale language models into joint emotion classifiers and VAD regressors using a strict JSON input-output contract.
It employs integrated loss functions and semantic regularization, combining binary cross-entropy for emotion tags with Euclidean regression for VAD predictions.
The framework uses controlled data augmentation and mixture sampling to optimize performance metrics while ensuring reproducibility and operational efficiency.

EmoLoom-2B is a lightweight, reproducible pipeline designed to convert small-scale LLMs (SLMs) with under 2 billion parameters into robust joint emotion classifiers and Valence-Arousal-Dominance (VAD) regressors. The framework centralizes protocol-faithful implementation, enforcing a strict JSON input-output contract, deterministic decoding procedures, and semantic regularization. Targeted toward rapid evaluation and budget-aware screening, EmoLoom-2B systematically eliminates common sources of avoidable variance while maximizing both coverage and format reliability in emotion understanding tasks (Li et al., 3 Jan 2026).

1. Protocol-True Pipeline Design

EmoLoom-2B is architected around a unified JSON interaction "contract" for both training and inference. Each input utterance $x$ is processed by a fixed prompt requiring output as a single line of JSON containing three elements: a multi-hot list of emotion labels ("labels"); a dictionary of VAD values ("vad") where $v,a,d \in [0,1]$ rounded to two decimals; and a concise English rationale. Example output:

1	{"labels":["disgust"],"vad":{"v":0.42,"a":0.21,"d":0.49},"rationale":"tone of displeasure"}

Parse validity is rigorously enforced via tail-scanning for JSON structure; only outputs marked ParseOK=1 are included in downstream metric calculations (Macro-F1, Macro-R, 1–RMSE VAD), and the ParseOK rate itself is reported to quantify formatter reliability.

All decoding utilizes the KV-off paradigm: use_cache=false, deterministic hyperparameters (temperature=0, top_p=1.0), and greedy next-token selection—the same schema for both self-distillation and inference. This approach neutralizes implementation-dependent discrepancies in key-value cache handling across hardware/software platforms, ensuring that measured improvements reflect model fidelity rather than artifacts of decoding strategy.

2. Loss Architecture and Semantic Regularization

The foundational objective combines multi-label binary cross-entropy for emotion tags and Euclidean ( $\ell_2$ ) regression for VAD prediction:

$L_\text{cls} = \frac{1}{K} \sum_{k=1}^{K} \left[ -y_k \log p_k - (1-y_k) \log(1-p_k) \right]$

$L_\text{reg} = \| v - \hat{v} \|_2^2$

Three orthogonal semantic regularizers further constrain the model:

VAD-Preserving Consistency: Outputs are mapped onto VAD space using NRC-VAD lexicon lookups for each token, and an aggregated text VAD vector $v_\text{text}(a)$ is computed. The additional loss

$L_\text{vad} = \| v_\text{text}(a) - \hat{v} \|_2$

enforces continuity between surface generation and numeric affect prediction.

Lightweight Appraisal-Atom Verifier: A compact classifier $f_\text{app}$ predicts scores $s \in [0,1]^M$ for discrete appraisal atoms (goal attainment, controllability, certainty, fairness) given $(x, a)$ . Targets $\tilde{s} \in \{0,1\}^M$ are derived from gold labels or heuristics; the regularization penalty is

$L_\text{app} = \frac{1}{M} \sum_{m=1}^M [ -\tilde{s}_m \log s_m - (1-\tilde{s}_m) \log(1-s_m) ]$

This is a soft constraint active only during training, not requiring expanded explanations nor generator modification.

Valence Flip Symmetry: By constructing lexically mirrored pairs $(x, x')$ swapping polarity-laden tokens (e.g., "great" $\leftrightarrow$ "terrible"), the model is trained to output valence scores symmetric around 0.5:

$L_\text{flip} = \left| (v(x) - 0.5) + (v(x') - 0.5) \right|$

The total objective is a weighted sum:

$L = \lambda_\text{cls} L_\text{cls} + \lambda_\text{reg} L_\text{reg} + \lambda_\text{vad} L_\text{vad} + \lambda_\text{app} L_\text{app} + \lambda_\text{flip} L_\text{flip}$

Weights are selected via development set sweep (e.g. $\lambda_\text{vad} \approx 1.0$ , $\lambda_\text{app} \approx 0.5$ , $\lambda_\text{flip} \approx 0.3$ ).

3. Data Augmentation and Mixture Sampling

EmoLoom-2B utilizes Valence Flip augmentation by introducing polarity-mirrored pairs through lexical substitution or scenario rewrite, simultaneously training on both $x$ and $x'$ to enforce inversion behavior. This technique is found to mitigate valence drift and encourage robust polarity mapping.

Training proceeds via A/B mixture sampling, interleaving GoEmotions (“A”) and EmpatheticDialogues (“B”) at controlled ratios ( $w_A:w_B$ studied for 20:80, 50:50, 80:20). Selection probability for each batch is governed by

$p(s) = \text{softmax} \left( \frac{w_s / \text{conf}_s}{T_t} \right), \quad s \in \{A,B\}$

where $\text{conf}_s$ is a running entropy-based confidence estimate, and temperature $T_t$ cools linearly with training step. Early epochs promote diversity through high $T_t$ , while late-phase training consolidates on the target ratio. Empirically, the 20:80 ratio yields optimal trade-offs for Macro-F1 and VAD.

The supervised fine-tuning loop incorporates out-of-memory (OOM) remediation (max_len reduction, gradient accumulation bump) and is detailed algorithmically in the original publication.

4. Experimental Configuration and Backbone Selection

For backbone evaluation, two ~1.8B parameter models (Qwen-1.8B-Chat and InternLM2-1.8B-SFT) were screened under the identical KV-off protocol over 1 hour, with Qwen-1.8B-Chat selected for downstream use due to a +1σ lead in composite metrics. Training was conducted on a single GPU (≥24GB VRAM) using PyTorch 2.3.1, bf16/TF32 precision, gradient checkpointing, AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1), batch size 1 with accumulation to effective batch ≈128, cosine-decay learning rate (peak ~1–2e-5), input truncation at 1536 tokens, and generation cutoff at 64 tokens. Deterministic seeds and full config/hashes ensure auditability and reproducibility across runs.

5. Task Evaluation, Metrics, and Ablation

Development experiments (GoEmotions + EmpatheticDialogues) compared three A:B mixture ratios, with results summarized in Table 1 (JSON-valid outputs only):

Mix Ratio	Macro-F1	Macro-R	VAD (1–RMSE)
20:80	0.3500†	0.2693	0.9417†
50:50	0.3470‡	0.2657‡	0.9337‡
80:20	0.3341	0.2509	0.9135

† denotes best; ‡ denotes second best.

The 20:80 mix converged to the lowest loss (≈0.1604 at step 90) and delivered the highest joint performance for multi-label classification and VAD regression.

Cross-corpus generalization was vetted in a one-hour budgeted run on DailyDialog, with the 20:80 model achieving Macro-F1≈0.3071, VAD(1–RMSE)≈0.8066, and ParseOK≈0.976 ( $N_\text{val}$ =6,261).

Ablation studies revealed the following impacts:

Excluding $L_\text{vad}$ : Increased VAD error (by 0.006–0.012) and decreased Macro-F1 (by 0.004–0.007), with more valence drift under flip augmentation.
Excluding $L_\text{app}$ : Decreased Macro-F1 by 0.010–0.015, with pronounced effect on fairness and controllability categories; VAD unchanged.
Removing flip pairs: Flip-symmetry metric $S_\text{flip}$ doubled (≈0.06→0.11), Macro-F1 down by ≈0.003.
Deviations from linear temperature cooling reduced coverage or convergence on high-entropy samples.
Mix ratio and semantic regularizer weights exhibited ranges of broad optimality, indicating robustness to these hyperparameters.

6. Operational Characteristics, Auditability, and Recommendations

EmoLoom-2B implements a rapid “quick-eval” audit mode, capping wall-clock time (e.g., 60 min) and recording ETA to enable standardized backbone comparison under strict protocol constraints. Full auditability is achieved by manifesting data splits with SHA-1 checksum, deterministic seeds, OOM healing routines, KV-off, and prompt standardization. Re-entrancy guarantees identical metrics, outputs, and logs across repeated runs.

Practically, EmoLoom-2B is recommended as an initial screening filter for backbone selection and preliminary validation of joint emotion/VAD capabilities, preceding investment in larger-scale or multimodal architectures. The semantic regularizers (VAD consistency, appraisal verifier, valence flip) add minimal computational burden yet provide measurable gains in robustness and format reliability.

In summary, EmoLoom-2B delivers a reproducible, minimal-overhead pipeline for small-model emotion understanding, integrating strict JSON protocols, fair decoding, lexicon-weak supervision, and semantic constraints. Its performance and operational traits position it as a dependable resource for constrained screening and prototyping in affective language research (Li et al., 3 Jan 2026).

PDF Markdown Chat (Pro)

References (1)

EmoLoom-2B: Fast Base-Model Screening for Emotion Classification and VAD with Lexicon-Weak Supervision and KV-Off Evaluation (2026)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to EmoLoom-2B.