Papers
Topics
Authors
Recent
2000 character limit reached

EmoLoom-2B: Lightweight Emotion Pipeline

Updated 10 January 2026
  • EmoLoom-2B is a pipeline that converts small-scale language models into joint emotion classifiers and VAD regressors using a strict JSON input-output contract.
  • It employs integrated loss functions and semantic regularization, combining binary cross-entropy for emotion tags with Euclidean regression for VAD predictions.
  • The framework uses controlled data augmentation and mixture sampling to optimize performance metrics while ensuring reproducibility and operational efficiency.

EmoLoom-2B is a lightweight, reproducible pipeline designed to convert small-scale LLMs (SLMs) with under 2 billion parameters into robust joint emotion classifiers and Valence-Arousal-Dominance (VAD) regressors. The framework centralizes protocol-faithful implementation, enforcing a strict JSON input-output contract, deterministic decoding procedures, and semantic regularization. Targeted toward rapid evaluation and budget-aware screening, EmoLoom-2B systematically eliminates common sources of avoidable variance while maximizing both coverage and format reliability in emotion understanding tasks (Li et al., 3 Jan 2026).

1. Protocol-True Pipeline Design

EmoLoom-2B is architected around a unified JSON interaction "contract" for both training and inference. Each input utterance xx is processed by a fixed prompt requiring output as a single line of JSON containing three elements: a multi-hot list of emotion labels ("labels"); a dictionary of VAD values ("vad") where v,a,d[0,1]v,a,d \in [0,1] rounded to two decimals; and a concise English rationale. Example output:

1
{"labels":["disgust"],"vad":{"v":0.42,"a":0.21,"d":0.49},"rationale":"tone of displeasure"}

Parse validity is rigorously enforced via tail-scanning for JSON structure; only outputs marked ParseOK=1 are included in downstream metric calculations (Macro-F1, Macro-R, 1–RMSE VAD), and the ParseOK rate itself is reported to quantify formatter reliability.

All decoding utilizes the KV-off paradigm: use_cache=false, deterministic hyperparameters (temperature=0, top_p=1.0), and greedy next-token selection—the same schema for both self-distillation and inference. This approach neutralizes implementation-dependent discrepancies in key-value cache handling across hardware/software platforms, ensuring that measured improvements reflect model fidelity rather than artifacts of decoding strategy.

2. Loss Architecture and Semantic Regularization

The foundational objective combines multi-label binary cross-entropy for emotion tags and Euclidean (2\ell_2) regression for VAD prediction:

Lcls=1Kk=1K[yklogpk(1yk)log(1pk)]L_\text{cls} = \frac{1}{K} \sum_{k=1}^{K} \left[ -y_k \log p_k - (1-y_k) \log(1-p_k) \right]

Lreg=vv^22L_\text{reg} = \| v - \hat{v} \|_2^2

Three orthogonal semantic regularizers further constrain the model:

  • VAD-Preserving Consistency: Outputs are mapped onto VAD space using NRC-VAD lexicon lookups for each token, and an aggregated text VAD vector vtext(a)v_\text{text}(a) is computed. The additional loss

Lvad=vtext(a)v^2L_\text{vad} = \| v_\text{text}(a) - \hat{v} \|_2

enforces continuity between surface generation and numeric affect prediction.

  • Lightweight Appraisal-Atom Verifier: A compact classifier fappf_\text{app} predicts scores s[0,1]Ms \in [0,1]^M for discrete appraisal atoms (goal attainment, controllability, certainty, fairness) given (x,a)(x, a). Targets s~{0,1}M\tilde{s} \in \{0,1\}^M are derived from gold labels or heuristics; the regularization penalty is

Lapp=1Mm=1M[s~mlogsm(1s~m)log(1sm)]L_\text{app} = \frac{1}{M} \sum_{m=1}^M [ -\tilde{s}_m \log s_m - (1-\tilde{s}_m) \log(1-s_m) ]

This is a soft constraint active only during training, not requiring expanded explanations nor generator modification.

  • Valence Flip Symmetry: By constructing lexically mirrored pairs (x,x)(x, x') swapping polarity-laden tokens (e.g., "great" \leftrightarrow "terrible"), the model is trained to output valence scores symmetric around 0.5:

Lflip=(v(x)0.5)+(v(x)0.5)L_\text{flip} = \left| (v(x) - 0.5) + (v(x') - 0.5) \right|

The total objective is a weighted sum:

L=λclsLcls+λregLreg+λvadLvad+λappLapp+λflipLflipL = \lambda_\text{cls} L_\text{cls} + \lambda_\text{reg} L_\text{reg} + \lambda_\text{vad} L_\text{vad} + \lambda_\text{app} L_\text{app} + \lambda_\text{flip} L_\text{flip}

Weights are selected via development set sweep (e.g. λvad1.0\lambda_\text{vad} \approx 1.0, λapp0.5\lambda_\text{app} \approx 0.5, λflip0.3\lambda_\text{flip} \approx 0.3).

3. Data Augmentation and Mixture Sampling

EmoLoom-2B utilizes Valence Flip augmentation by introducing polarity-mirrored pairs through lexical substitution or scenario rewrite, simultaneously training on both xx and xx' to enforce inversion behavior. This technique is found to mitigate valence drift and encourage robust polarity mapping.

Training proceeds via A/B mixture sampling, interleaving GoEmotions (“A”) and EmpatheticDialogues (“B”) at controlled ratios (wA:wBw_A:w_B studied for 20:80, 50:50, 80:20). Selection probability for each batch is governed by

p(s)=softmax(ws/confsTt),s{A,B}p(s) = \text{softmax} \left( \frac{w_s / \text{conf}_s}{T_t} \right), \quad s \in \{A,B\}

where confs\text{conf}_s is a running entropy-based confidence estimate, and temperature TtT_t cools linearly with training step. Early epochs promote diversity through high TtT_t, while late-phase training consolidates on the target ratio. Empirically, the 20:80 ratio yields optimal trade-offs for Macro-F1 and VAD.

The supervised fine-tuning loop incorporates out-of-memory (OOM) remediation (max_len reduction, gradient accumulation bump) and is detailed algorithmically in the original publication.

4. Experimental Configuration and Backbone Selection

For backbone evaluation, two ~1.8B parameter models (Qwen-1.8B-Chat and InternLM2-1.8B-SFT) were screened under the identical KV-off protocol over 1 hour, with Qwen-1.8B-Chat selected for downstream use due to a +1σ lead in composite metrics. Training was conducted on a single GPU (≥24GB VRAM) using PyTorch 2.3.1, bf16/TF32 precision, gradient checkpointing, AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1), batch size 1 with accumulation to effective batch ≈128, cosine-decay learning rate (peak ~1–2e-5), input truncation at 1536 tokens, and generation cutoff at 64 tokens. Deterministic seeds and full config/hashes ensure auditability and reproducibility across runs.

5. Task Evaluation, Metrics, and Ablation

Development experiments (GoEmotions + EmpatheticDialogues) compared three A:B mixture ratios, with results summarized in Table 1 (JSON-valid outputs only):

Mix Ratio Macro-F1 Macro-R VAD (1–RMSE)
20:80 0.3500† 0.2693 0.9417†
50:50 0.3470‡ 0.2657‡ 0.9337‡
80:20 0.3341 0.2509 0.9135

† denotes best; ‡ denotes second best.

The 20:80 mix converged to the lowest loss (≈0.1604 at step 90) and delivered the highest joint performance for multi-label classification and VAD regression.

Cross-corpus generalization was vetted in a one-hour budgeted run on DailyDialog, with the 20:80 model achieving Macro-F1≈0.3071, VAD(1–RMSE)≈0.8066, and ParseOK≈0.976 (NvalN_\text{val}=6,261).

Ablation studies revealed the following impacts:

  • Excluding LvadL_\text{vad}: Increased VAD error (by 0.006–0.012) and decreased Macro-F1 (by 0.004–0.007), with more valence drift under flip augmentation.
  • Excluding LappL_\text{app}: Decreased Macro-F1 by 0.010–0.015, with pronounced effect on fairness and controllability categories; VAD unchanged.
  • Removing flip pairs: Flip-symmetry metric SflipS_\text{flip} doubled (≈0.06→0.11), Macro-F1 down by ≈0.003.
  • Deviations from linear temperature cooling reduced coverage or convergence on high-entropy samples.
  • Mix ratio and semantic regularizer weights exhibited ranges of broad optimality, indicating robustness to these hyperparameters.

6. Operational Characteristics, Auditability, and Recommendations

EmoLoom-2B implements a rapid “quick-eval” audit mode, capping wall-clock time (e.g., 60 min) and recording ETA to enable standardized backbone comparison under strict protocol constraints. Full auditability is achieved by manifesting data splits with SHA-1 checksum, deterministic seeds, OOM healing routines, KV-off, and prompt standardization. Re-entrancy guarantees identical metrics, outputs, and logs across repeated runs.

Practically, EmoLoom-2B is recommended as an initial screening filter for backbone selection and preliminary validation of joint emotion/VAD capabilities, preceding investment in larger-scale or multimodal architectures. The semantic regularizers (VAD consistency, appraisal verifier, valence flip) add minimal computational burden yet provide measurable gains in robustness and format reliability.

In summary, EmoLoom-2B delivers a reproducible, minimal-overhead pipeline for small-model emotion understanding, integrating strict JSON protocols, fair decoding, lexicon-weak supervision, and semantic constraints. Its performance and operational traits position it as a dependable resource for constrained screening and prototyping in affective language research (Li et al., 3 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to EmoLoom-2B.