Open questions on generalizability to additional modalities beyond image, text, and audio

Ascertain the generalizability of results on training-free missing modality prediction from image, text, and audio to additional modalities, specifically video, tabular data, and sensor streams.

Background

The study evaluates three modalities (image, text, audio) and demonstrates benefits from mining and verification components, but explicitly acknowledges limits in scope. The authors flag uncertainty about whether the observed findings and approaches extend to other salient modalities.

This open question is critical for real-world deployment, as many applications rely on modalities such as video, tabular data, and sensor streams, whose characteristics and constraints may differ from those examined.

References

Second, our evaluation is limited to three modalities, leaving open questions about generalizability to others such as video, tabular data, or sensor streams—modalities of growing importance in real-world applications.

— How Far Are We from Generating Missing Modalities with Foundation Models? (2506.03530 - Ke et al., 4 Jun 2025) in Discussion — Limitations

Open questions on generalizability to additional modalities beyond image, text, and audio

Background

References

Related Problems