Emotion-Weighted Flan-T5 Embeddings
- The paper presents a hierarchical fusion framework that integrates Flan-T5 semantic embeddings with emotion lexicon projections to capture nuanced emotional cues.
- It employs multi-level attention and pooling strategies to combine phrase, sentence, and session-level information, ensuring robust affective representation.
- Applications in psychotherapy chatbots and affective music generation demonstrate its potential to improve emotionally congruent response generation and expressive control.
Emotion-weighted Flan-T5 text embeddings constitute a family of neural text representations designed to encode both semantic and affective information in a unified embedding space. These embeddings apply hierarchical fusion and emotion lexicon integration to Flan-T5–based semantic encodings, producing dense vectors sensitive to emotional cues at the phrase, sentence, and session/document levels. They can be directly injected into downstream models for contexts requiring affective nuance, such as psychotherapy chatbots or emotion-controllable music generation applications. Two influential works, "Emotion-Aware Embedding Fusion in LLMs (Flan-T5, LLAMA 2, DeepSeek-R1, and ChatGPT 4) for Intelligent Response Generation" (Rasool et al., 2 Oct 2024) and "SyMuPe: Affective and Controllable Symbolic Music Performance" (Borovik et al., 5 Nov 2025), present complementary frameworks for constructing and deploying these embeddings in real-world deployments, each leveraging Flan-T5 as a core semantic backbone and introducing specialized fusion, conditioning, and weighting strategies to achieve emotion awareness.
1. Hierarchical Emotion-Semantic Fusion Frameworks
Central to both cited works is the explicit decomposition of textual documents or queries into nested hierarchical levels, at which semantic features from models such as Flan-T5 are interleaved with projected emotion cues. For psychotherapy transcript modeling (Rasool et al., 2 Oct 2024), the process operates at three tiers: phrase (or word), sentence, and session. Each phrase or clause in a transcript receives:
- a semantic embedding via the pre-trained Flan-T5 encoder,
- an emotion embedding by aggregating and projecting scores from multiple lexica (NRC, VADER, WordNet, SentiWordNet).
These vectors are concatenated and passed through a dedicated fusion network at each hierarchy: with analogous routines for sentence and session levels:
This approach generalizes naturally to other domains where documents have explicit or latent segmentations.
2. Emotion Lexicon Augmentation and Projection
The emotion-weighted aspect is realized through the comprehensive integration of standardized affective lexica. At the word or phrase level, the framework consults:
- NRC Emotion Lexicon (),
- VADER (),
- WordNet synset features (, unspecified),
- SentiWordNet ().
The concatenated raw vector is mapped into a low-dimensional dense space via a linear layer and activation: where is typically ReLU or tanh; (the emotion embedding dimension) is not specified but defaults to a typical value such as $64$ or $128$.
This systematic lexicon fusion enables the embedding layer to encode nuanced affective signals beyond what contemporary LLM tokenizers supply by default.
3. Pooling, Attention, and Weighted Fusion Mechanisms
In the hierarchical fusion schema, variable-length sets (phrases, sentences) are aggregated using attention-based or pooling strategies. Self-attention, as operationalized at the token level,
serves to dynamically weight the contribution of emotionally salient tokens. Multi-head variants follow the standard Transformer formalism.
Phrase- and sentence-level pooling may adopt softmax-weighted summations: or simply averaging over the constituent vectors, enabling the model to propagate prominent emotional features upward in the hierarchy.
The final fused embedding, after combining the semantic () and emotion () summaries, is produced as: with added into the original Flan-T5 embedding stack (token + position + emotion).
4. Applications in Psychotherapy and Symbolic Music Generation
Emotion-weighted Flan-T5 embeddings have been evaluated in divergent but affect-intensive domains. In psychotherapy chatbot construction (Rasool et al., 2 Oct 2024), they enable LLMs to retrieve, integrate, and generate responses that are simultaneously contextually relevant and affectively congruent, informed by similarity search over -normalized embeddings indexed via FAISS.
In affective symbolic music performance (Borovik et al., 5 Nov 2025), SyMuPe/PianoFlow encodes emotion-driven control signals as follows:
- Defines a bank of thirty-three “prototype” Flan-T5 embeddings, each derived by averaging over sixteen prompt templates per emotion.
- At inference, an emotion classifier produces soft probability weights for each prototype per musical bar, yielding the weighted sum
- The resulting embedding is concatenated with score- and performance-level control tokens, then injected before the mid-point of the transformer (between layers $4$ and $5$ of $8$).
- Classifier-free guidance and control dropout (p=0.2) are used to prevent over-reliance on any control channel.
This design enables nuanced and context-appropriate expressive rendering, as confirmed by both qualitative listening and ablation studies.
5. Training Objectives, Regularization, and Embedding Management
Both reference implementations fine-tune their models exclusively with standard next-token cross-entropy (for text) or conditional flow matching objectives (for sequence generation), with no auxiliary emotion-specific loss. In music generation, regularization is further supported by multi-mask sampling (random notes, bars, full sequences) to enhance robustness. Control dropout during training compels the model to generalize in both the presence and absence of emotional signals.
Embeddings used for retrieval or downstream integration are -normalized prior to FAISS-based indexing, supporting high-performance dense similarity search at scale.
6. Experimental Results and Empirical Insights
Performance metrics articulated in these studies include:
- For the music emotion classifier: top-1 accuracy 32\%, top-3 57\%, top-5 73\% on a curated short-excerpt dataset (Borovik et al., 5 Nov 2025).
- In ablation, the absence of emotion-weighted embeddings reduces controllability to the unconditional baseline, indicating critical importance for stylistic diversity.
- In qualitative evaluations, prescribed emotion prompts (“anger”, “dreamy”, “thunderstorms and lightning”) are reflected in attribute-controllable generation (velocity, sustain, tempo).
No explicit numerical ablation is provided for the psychotherapy context, but the fusion architecture is empirically validated for producing empathetic, context-aware chatbot responses (Rasool et al., 2 Oct 2024).
7. Limitations, Unspecified Parameters, and Future Directions
Certain architectural and hyper-parameter choices are left unspecified (e.g., , , exact pooling strategies at all hierarchy levels), with typical defaults suggested. No explicit regularization or auxiliary objectives tailored to emotion consistency are applied—A plausible implication is that further gains may accrue from such additions. Both works rely on the static coverage of emotion lexica and a fixed set of prompt templates, which may not exhaustively span all affective nuances in naturalistic data.
Continued work may investigate adaptive lexicon expansion, learnable template generation, and the integration of contrastive or consistency losses to further entrench affectively grounded representations. Cross-domain application—transferring embeddings between text, music, and other modalities—remains an active and promising area.