Fine-grained Cross-modal Control for Expressive Multi-speaker Dialogue

Develop models and training strategies that achieve fine-grained cross-modal control across speech, vision, and text to generate expressive multi-speaker dialogues, enabling nuanced controllability of interactional style while maintaining cross-modal alignment.

Background

The paper surveys recent advances in spoken dialogue generation and cinematic video generation, noting that while modality-specific capabilities have improved, coordinated control across modalities for dialogues remains difficult.

To address this gap, the authors introduce MM-DIA and MM-DIA-BENCH to facilitate research on conditional multimodal dialogue generation, but they explicitly acknowledge that achieving fine-grained cross-modal control is still unresolved.

References

Although these advances establish the modality-specific foundations for conveying semantic information, achieving fine-grained cross-modal control across speech, vision, and text for expressive multi-speaker dialogue remains an open challenge.

From Natural Alignment to Conditional Controllability in Multimodal Dialogue  (2603.29162 - Jin et al., 31 Mar 2026) in Section 2.2, Dialogue Generation from Multimodality