AlignCap: Aligning Speech Emotion Captioning to Human Preferences (2410.19134v1)
Abstract: Speech Emotion Captioning (SEC) has gradually become an active research task. The emotional content conveyed through human speech are often complex, and classifying them into fixed categories may not be enough to fully capture speech emotions. Describing speech emotions through natural language may be a more effective approach. However, existing SEC methods often produce hallucinations and lose generalization on unseen speech. To overcome these problems, we propose AlignCap, which Aligning Speech Emotion Captioning to Human Preferences based on LLM with two properties: 1) Speech-Text Alignment, which minimizing the divergence between the LLM's response prediction distributions for speech and text inputs using knowledge distillation (KD) Regularization. 2) Human Preference Alignment, where we design Preference Optimization (PO) Regularization to eliminate factuality and faithfulness hallucinations. We also extract emotional clues as a prompt for enriching fine-grained information under KD-Regularization. Experiments demonstrate that AlignCap presents stronger performance to other state-of-the-art methods on Zero-shot SEC task.
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshop, pages 65–72.
- Dwformer: Dynamic window transformer for speech emotion recognition. In ICASSP, pages 1–5.
- NNIME: the NTHU-NTUA chinese interactive multimodal emotion corpus. In ACII, pages 292–298.
- Ultrafeedback: Boosting language models with high-quality feedback. CoRR, abs/2310.01377.
- Training audio captioning models without audio. In ICASSP, pages 371–375.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. TASLP, 29:3451–3460.
- LoRA: Low-rank adaptation of large language models. In ICLR.
- LoRA: Low-rank adaptation of large language models.
- Wavllm: Towards robust and adaptive speech large language model. CoRR, abs/2404.00656.
- Understanding and constructing latent modality structures in multi-modal representation learning. In CVPR, pages 7661–7671.
- Theodoros Kouzelis and Vassilis Katsouros. 2023. Weakly-supervised automated audio captioning via text only training. In DCASE.
- Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR, pages 2584–2593.
- Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. In ACM MM, page 9610–9614.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In ACL.
- Improved image captioning via policy gradient optimization of spider. In ICCV, pages 873–881.
- Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. CoRR, abs/2303.17395.
- Training language models to follow instructions with human feedback. In NeurIPS, volume 35, pages 27730–27744.
- Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318.
- Language models are unsupervised multitask learners.
- Direct preference optimization: Your language model is secretly a reward model. In NeurIPS, volume 36, pages 53728–53741.
- Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC, page 20.
- Zero-shot audio captioning with audio-language model guidance and audio context keywords. CoRR, abs/2311.08396.
- Emotional cues extraction and fusion for multi-modal emotion prediction and recognition in conversation. CoRR, abs/2408.04547.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
- Cider: Consensus-based image description evaluation. In CVPR, pages 4566–4575.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP, pages 1–5.
- Secap: Speech emotion captioning with large language model. In AAAI, pages 19323–19331.
- TEAL: tokenize and embed ALL for multi-modal large language models. CoRR, abs/2311.04589.
- Emo-dna: Emotion decoupling and alignment learning for cross-corpus speech emotion recognition. In ACM MM, pages 5956–5965.
- Self-rewarding language models. CoRR, abs/2401.10020.
- OPT: open pre-trained transformer language models. CoRR, abs/2205.01068.
- Speechtokenizer: Unified speech tokenizer for speech language models. In ICLR.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.