Do unpaired auxiliary modalities improve textual tasks?

Determine whether unpaired auxiliary modalities such as images and audio can provide useful information to improve performance on textual tasks within the Unpaired Multimodal Representation Learning (Uml) framework introduced in this work.

Background

The paper proposes Unpaired Multimodal Representation Learning (Uml), which shares model weights across modalities to leverage unpaired data and improve unimodal representations. Theoretical results show that unpaired auxiliary modalities can increase Fisher information about shared latent variables, and extensive experiments demonstrate gains for image and audio classification.

However, while the empirical evaluation covers image and audio as target modalities, the authors explicitly note that it is not yet established whether auxiliary modalities can similarly benefit textual tasks. Establishing this would close a key loop: showing that the same unpaired cross-modal training paradigm can enhance language representations using non-text data.

References

Furthermore, we evaluate how multimodal data enhances image and audio classification; it remains to show if they can, in turn, offer useful information for textual tasks.

— Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models (2510.08492 - Gupta et al., 9 Oct 2025) in Conclusions and Limitations (Section: Conclusions and Limitations)

Do unpaired auxiliary modalities improve textual tasks?

Sponsor

Background

References

Related Problems