Papers
Topics
Authors
Recent
Search
2000 character limit reached

EmoLoom-2B: Lightweight Emotion Pipeline

Updated 10 January 2026
  • EmoLoom-2B is a pipeline that converts small-scale language models into joint emotion classifiers and VAD regressors using a strict JSON input-output contract.
  • It employs integrated loss functions and semantic regularization, combining binary cross-entropy for emotion tags with Euclidean regression for VAD predictions.
  • The framework uses controlled data augmentation and mixture sampling to optimize performance metrics while ensuring reproducibility and operational efficiency.

EmoLoom-2B is a lightweight, reproducible pipeline designed to convert small-scale LLMs (SLMs) with under 2 billion parameters into robust joint emotion classifiers and Valence-Arousal-Dominance (VAD) regressors. The framework centralizes protocol-faithful implementation, enforcing a strict JSON input-output contract, deterministic decoding procedures, and semantic regularization. Targeted toward rapid evaluation and budget-aware screening, EmoLoom-2B systematically eliminates common sources of avoidable variance while maximizing both coverage and format reliability in emotion understanding tasks (Li et al., 3 Jan 2026).

1. Protocol-True Pipeline Design

EmoLoom-2B is architected around a unified JSON interaction "contract" for both training and inference. Each input utterance xx is processed by a fixed prompt requiring output as a single line of JSON containing three elements: a multi-hot list of emotion labels ("labels"); a dictionary of VAD values ("vad") where v,a,d[0,1]v,a,d \in [0,1] rounded to two decimals; and a concise English rationale. Example output:

Lcls=1Kk=1K[yklogpk(1yk)log(1pk)]L_\text{cls} = \frac{1}{K} \sum_{k=1}^{K} \left[ -y_k \log p_k - (1-y_k) \log(1-p_k) \right]0

Parse validity is rigorously enforced via tail-scanning for JSON structure; only outputs marked ParseOK=1 are included in downstream metric calculations (Macro-F1, Macro-R, 1–RMSE VAD), and the ParseOK rate itself is reported to quantify formatter reliability.

All decoding utilizes the KV-off paradigm: use_cache=false, deterministic hyperparameters (temperature=0, top_p=1.0), and greedy next-token selection—the same schema for both self-distillation and inference. This approach neutralizes implementation-dependent discrepancies in key-value cache handling across hardware/software platforms, ensuring that measured improvements reflect model fidelity rather than artifacts of decoding strategy.

2. Loss Architecture and Semantic Regularization

The foundational objective combines multi-label binary cross-entropy for emotion tags and Euclidean (2\ell_2) regression for VAD prediction:

Lcls=1Kk=1K[yklogpk(1yk)log(1pk)]L_\text{cls} = \frac{1}{K} \sum_{k=1}^{K} \left[ -y_k \log p_k - (1-y_k) \log(1-p_k) \right]

Lreg=vv^22L_\text{reg} = \| v - \hat{v} \|_2^2

Three orthogonal semantic regularizers further constrain the model:

  • VAD-Preserving Consistency: Outputs are mapped onto VAD space using NRC-VAD lexicon lookups for each token, and an aggregated text VAD vector vtext(a)v_\text{text}(a) is computed. The additional loss

Lvad=vtext(a)v^2L_\text{vad} = \| v_\text{text}(a) - \hat{v} \|_2

enforces continuity between surface generation and numeric affect prediction.

  • Lightweight Appraisal-Atom Verifier: A compact classifier fappf_\text{app} predicts scores s[0,1]Ms \in [0,1]^M for discrete appraisal atoms (goal attainment, controllability, certainty, fairness) given (x,a)(x, a). Targets v,a,d[0,1]v,a,d \in [0,1]0 are derived from gold labels or heuristics; the regularization penalty is

v,a,d[0,1]v,a,d \in [0,1]1

This is a soft constraint active only during training, not requiring expanded explanations nor generator modification.

  • Valence Flip Symmetry: By constructing lexically mirrored pairs v,a,d[0,1]v,a,d \in [0,1]2 swapping polarity-laden tokens (e.g., "great" v,a,d[0,1]v,a,d \in [0,1]3 "terrible"), the model is trained to output valence scores symmetric around 0.5:

v,a,d[0,1]v,a,d \in [0,1]4

The total objective is a weighted sum:

v,a,d[0,1]v,a,d \in [0,1]5

Weights are selected via development set sweep (e.g. v,a,d[0,1]v,a,d \in [0,1]6, v,a,d[0,1]v,a,d \in [0,1]7, v,a,d[0,1]v,a,d \in [0,1]8).

3. Data Augmentation and Mixture Sampling

EmoLoom-2B utilizes Valence Flip augmentation by introducing polarity-mirrored pairs through lexical substitution or scenario rewrite, simultaneously training on both v,a,d[0,1]v,a,d \in [0,1]9 and 2\ell_20 to enforce inversion behavior. This technique is found to mitigate valence drift and encourage robust polarity mapping.

Training proceeds via A/B mixture sampling, interleaving GoEmotions (“A”) and EmpatheticDialogues (“B”) at controlled ratios (2\ell_21 studied for 20:80, 50:50, 80:20). Selection probability for each batch is governed by

2\ell_22

where 2\ell_23 is a running entropy-based confidence estimate, and temperature 2\ell_24 cools linearly with training step. Early epochs promote diversity through high 2\ell_25, while late-phase training consolidates on the target ratio. Empirically, the 20:80 ratio yields optimal trade-offs for Macro-F1 and VAD.

The supervised fine-tuning loop incorporates out-of-memory (OOM) remediation (max_len reduction, gradient accumulation bump) and is detailed algorithmically in the original publication.

4. Experimental Configuration and Backbone Selection

For backbone evaluation, two ~1.8B parameter models (Qwen-1.8B-Chat and InternLM2-1.8B-SFT) were screened under the identical KV-off protocol over 1 hour, with Qwen-1.8B-Chat selected for downstream use due to a +1σ lead in composite metrics. Training was conducted on a single GPU (≥24GB VRAM) using PyTorch 2.3.1, bf16/TF32 precision, gradient checkpointing, AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1), batch size 1 with accumulation to effective batch ≈128, cosine-decay learning rate (peak ~1–2e-5), input truncation at 1536 tokens, and generation cutoff at 64 tokens. Deterministic seeds and full config/hashes ensure auditability and reproducibility across runs.

5. Task Evaluation, Metrics, and Ablation

Development experiments (GoEmotions + EmpatheticDialogues) compared three A:B mixture ratios, with results summarized in Table 1 (JSON-valid outputs only):

Mix Ratio Macro-F1 Macro-R VAD (1–RMSE)
20:80 0.3500† 0.2693 0.9417†
50:50 0.3470‡ 0.2657‡ 0.9337‡
80:20 0.3341 0.2509 0.9135

† denotes best; ‡ denotes second best.

The 20:80 mix converged to the lowest loss (≈0.1604 at step 90) and delivered the highest joint performance for multi-label classification and VAD regression.

Cross-corpus generalization was vetted in a one-hour budgeted run on DailyDialog, with the 20:80 model achieving Macro-F1≈0.3071, VAD(1–RMSE)≈0.8066, and ParseOK≈0.976 (2\ell_26=6,261).

Ablation studies revealed the following impacts:

  • Excluding 2\ell_27: Increased VAD error (by 0.006–0.012) and decreased Macro-F1 (by 0.004–0.007), with more valence drift under flip augmentation.
  • Excluding 2\ell_28: Decreased Macro-F1 by 0.010–0.015, with pronounced effect on fairness and controllability categories; VAD unchanged.
  • Removing flip pairs: Flip-symmetry metric 2\ell_29 doubled (≈0.06→0.11), Macro-F1 down by ≈0.003.
  • Deviations from linear temperature cooling reduced coverage or convergence on high-entropy samples.
  • Mix ratio and semantic regularizer weights exhibited ranges of broad optimality, indicating robustness to these hyperparameters.

6. Operational Characteristics, Auditability, and Recommendations

EmoLoom-2B implements a rapid “quick-eval” audit mode, capping wall-clock time (e.g., 60 min) and recording ETA to enable standardized backbone comparison under strict protocol constraints. Full auditability is achieved by manifesting data splits with SHA-1 checksum, deterministic seeds, OOM healing routines, KV-off, and prompt standardization. Re-entrancy guarantees identical metrics, outputs, and logs across repeated runs.

Practically, EmoLoom-2B is recommended as an initial screening filter for backbone selection and preliminary validation of joint emotion/VAD capabilities, preceding investment in larger-scale or multimodal architectures. The semantic regularizers (VAD consistency, appraisal verifier, valence flip) add minimal computational burden yet provide measurable gains in robustness and format reliability.

In summary, EmoLoom-2B delivers a reproducible, minimal-overhead pipeline for small-model emotion understanding, integrating strict JSON protocols, fair decoding, lexicon-weak supervision, and semantic constraints. Its performance and operational traits position it as a dependable resource for constrained screening and prototyping in affective language research (Li et al., 3 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EmoLoom-2B.