Trade-off between multilingual parameter allocation and visual understanding

Ascertain how allocating parameters to additional languages in multilingual bidirectional vision–language encoders for document retrieval trades off against understanding of the visual modality, and quantify the extent to which increasing the number of supported languages penalizes English retrieval performance.

Background

The paper focuses on English-only training and evaluation, while suggesting the potential to release multilingual variants. Multilingual scaling introduces a resource allocation tension between language capacity and visual modality understanding, which may affect English retrieval quality.

Clarifying and quantifying this trade-off would inform architecture and data budgeting for multilingual visual document retrievers, guiding parameter distribution across languages without unduly harming performance in primary target languages.

References

While we expect the broad trends to generalize and see clear value in releasing multilingual variants, it remains unclear how allocating parameters to additional languages trades off against the understanding of the vision modality, and to what extent this penalizes English retrieval performance as the number of languages are scaled \citep{pmlr-v202-fernandes23a}.

— ModernVBERT: Towards Smaller Visual Document Retrievers (2510.01149 - Teiletche et al., 1 Oct 2025) in Conclusion, Future Work and Limitations

Trade-off between multilingual parameter allocation and visual understanding

Background

References

Related Problems