Fine-grained Cross-modal Control for Expressive Multi-speaker Dialogue
Develop models and training strategies that achieve fine-grained cross-modal control across speech, vision, and text to generate expressive multi-speaker dialogues, enabling nuanced controllability of interactional style while maintaining cross-modal alignment.
References
Although these advances establish the modality-specific foundations for conveying semantic information, achieving fine-grained cross-modal control across speech, vision, and text for expressive multi-speaker dialogue remains an open challenge.
— From Natural Alignment to Conditional Controllability in Multimodal Dialogue
(2603.29162 - Jin et al., 31 Mar 2026) in Section 2.2, Dialogue Generation from Multimodality