Emotion-Weighted Flan-T5 Embeddings

Updated 12 November 2025

The paper presents a hierarchical fusion framework that integrates Flan-T5 semantic embeddings with emotion lexicon projections to capture nuanced emotional cues.
It employs multi-level attention and pooling strategies to combine phrase, sentence, and session-level information, ensuring robust affective representation.
Applications in psychotherapy chatbots and affective music generation demonstrate its potential to improve emotionally congruent response generation and expressive control.

Emotion-weighted Flan-T5 text embeddings constitute a family of neural text representations designed to encode both semantic and affective information in a unified embedding space. These embeddings apply hierarchical fusion and emotion lexicon integration to Flan-T5–based semantic encodings, producing dense vectors sensitive to emotional cues at the phrase, sentence, and session/document levels. They can be directly injected into downstream models for contexts requiring affective nuance, such as psychotherapy chatbots or emotion-controllable music generation applications. Two influential works, "Emotion-Aware Embedding Fusion in LLMs (Flan-T5, LLAMA 2, DeepSeek-R1, and ChatGPT 4) for Intelligent Response Generation" (Rasool et al., 2024) and "SyMuPe: Affective and Controllable Symbolic Music Performance" (Borovik et al., 5 Nov 2025), present complementary frameworks for constructing and deploying these embeddings in real-world deployments, each leveraging Flan-T5 as a core semantic backbone and introducing specialized fusion, conditioning, and weighting strategies to achieve emotion awareness.

1. Hierarchical Emotion-Semantic Fusion Frameworks

Central to both cited works is the explicit decomposition of textual documents or queries into nested hierarchical levels, at which semantic features from models such as Flan-T5 are interleaved with projected emotion cues. For psychotherapy transcript modeling (Rasool et al., 2024), the process operates at three tiers: phrase (or word), sentence, and session. Each phrase or clause $P_{ij}$ in a transcript receives:

a semantic embedding $e_{sem}^{(ij)}$ via the pre-trained Flan-T5 encoder,
an emotion embedding $e_{emo}^{(ij)}$ by aggregating and projecting scores from multiple lexica (NRC, VADER, WordNet, SentiWordNet).

These vectors are concatenated and passed through a dedicated fusion network at each hierarchy: $h^{(1)}_{ij} = \mathrm{ReLU}(W^{(1)}\, [\,e_{sem}^{(ij)};\,e_{emo}^{(ij)}\,] + b^{(1)}),$ with analogous routines for sentence and session levels: $h^{(2)}_i = \mathrm{ReLU}(W^{(2)}\,[\,h^{(1)}_i;\,h^{(2)}_{i-1}\,]+b^{(2)}),$

$h^{(3)} = \mathrm{ReLU}(W^{(3)}\,[\,h^{(2)}_N;\,h^{(3)}_{prev}\,]+b^{(3)}).$

This approach generalizes naturally to other domains where documents have explicit or latent segmentations.

2. Emotion Lexicon Augmentation and Projection

The emotion-weighted aspect is realized through the comprehensive integration of standardized affective lexica. At the word or phrase level, the framework consults:

NRC Emotion Lexicon ( $s_{NRC}(w) \in \mathbb{R}^8$ ),
VADER ( $s_{VADER}(w) \in \mathbb{R}$ ),
WordNet synset features ( $s_{WN}(w) \in \mathbb{R}^k$ , $k$ unspecified),
SentiWordNet ( $s_{SWN}(w) \in \mathbb{R}^2$ ).

The concatenated raw vector $s(w)$ is mapped into a low-dimensional dense space via a linear layer and activation: $e_{emo}(w) = \phi \bigl( W_{emo}\,s(w) + b_{emo} \bigr)$ where $\phi$ is typically ReLU or tanh; $d_{emo}$ (the emotion embedding dimension) is not specified but defaults to a typical value such as $64$ or $128$.

This systematic lexicon fusion enables the embedding layer to encode nuanced affective signals beyond what contemporary LLM tokenizers supply by default.

3. Pooling, Attention, and Weighted Fusion Mechanisms

In the hierarchical fusion schema, variable-length sets (phrases, sentences) are aggregated using attention-based or pooling strategies. Self-attention, as operationalized at the token level,

$a_{ij} = \frac{\exp\bigl(e_i \cdot e_j \bigr)}{\sum_k \exp\bigl(e_i \cdot e_k \bigr)}, \qquad \tilde{e}_i = \sum_j a_{ij} e_j,$

serves to dynamically weight the contribution of emotionally salient tokens. Multi-head variants follow the standard Transformer formalism.

Phrase- and sentence-level pooling may adopt softmax-weighted summations: $p = \mathrm{softmax}(W_p H^\top) H$ or simply averaging over the constituent vectors, enabling the model to propagate prominent emotional features upward in the hierarchy.

The final fused embedding, after combining the semantic ( $h_{sem}$ ) and emotion ( $h_{emo}$ ) summaries, is produced as: $e_{final} = W_f [h_{sem};\,h_{emo}] + b_f,$ with $e_{final}$ added into the original Flan-T5 embedding stack (token + position + emotion).

4. Applications in Psychotherapy and Symbolic Music Generation

Emotion-weighted Flan-T5 embeddings have been evaluated in divergent but affect-intensive domains. In psychotherapy chatbot construction (Rasool et al., 2024), they enable LLMs to retrieve, integrate, and generate responses that are simultaneously contextually relevant and affectively congruent, informed by similarity search over $L_2$ -normalized embeddings indexed via FAISS.

In affective symbolic music performance (Borovik et al., 5 Nov 2025), SyMuPe/PianoFlow encodes emotion-driven control signals as follows:

Defines a bank of thirty-three “prototype” Flan-T5 embeddings, each derived by averaging over sixteen prompt templates per emotion.
At inference, an emotion classifier produces soft probability weights for each prototype per musical bar, yielding the weighted sum

$e^{\text{bar}_j} = \sum_{k=1}^{33} w_j^{(k)}\,e^{(k)}.$

The resulting embedding is concatenated with score- and performance-level control tokens, then injected before the mid-point of the transformer (between layers $4$ and $5$ of $8$).
Classifier-free guidance and control dropout (p=0.2) are used to prevent over-reliance on any control channel.

This design enables nuanced and context-appropriate expressive rendering, as confirmed by both qualitative listening and ablation studies.

5. Training Objectives, Regularization, and Embedding Management

Both reference implementations fine-tune their models exclusively with standard next-token cross-entropy (for text) or conditional flow matching objectives (for sequence generation), with no auxiliary emotion-specific loss. In music generation, regularization is further supported by multi-mask sampling (random notes, bars, full sequences) to enhance robustness. Control dropout during training compels the model to generalize in both the presence and absence of emotional signals.

Embeddings used for retrieval or downstream integration are $L_2$ -normalized prior to FAISS-based indexing, supporting high-performance dense similarity search at scale.

6. Experimental Results and Empirical Insights

Performance metrics articulated in these studies include:

For the music emotion classifier: top-1 accuracy 32\%, top-3 57\%, top-5 73\% on a curated short-excerpt dataset (Borovik et al., 5 Nov 2025).
In ablation, the absence of emotion-weighted embeddings reduces controllability to the unconditional baseline, indicating critical importance for stylistic diversity.
In qualitative evaluations, prescribed emotion prompts (“anger”, “dreamy”, “thunderstorms and lightning”) are reflected in attribute-controllable generation (velocity, sustain, tempo).

No explicit numerical ablation is provided for the psychotherapy context, but the fusion architecture is empirically validated for producing empathetic, context-aware chatbot responses (Rasool et al., 2024).

7. Limitations, Unspecified Parameters, and Future Directions

Certain architectural and hyper-parameter choices are left unspecified (e.g., $d_{model}$ , $d_{emo}$ , exact pooling strategies at all hierarchy levels), with typical defaults suggested. No explicit regularization or auxiliary objectives tailored to emotion consistency are applied—A plausible implication is that further gains may accrue from such additions. Both works rely on the static coverage of emotion lexica and a fixed set of prompt templates, which may not exhaustively span all affective nuances in naturalistic data.

Continued work may investigate adaptive lexicon expansion, learnable template generation, and the integration of contrastive or consistency losses to further entrench affectively grounded representations. Cross-domain application—transferring embeddings between text, music, and other modalities—remains an active and promising area.