Generalization of NanoVDR beyond visual document retrieval with text-only queries

Determine whether the NanoVDR asymmetric cross-modal distillation framework generalizes beyond visual document retrieval first-stage ranking with text-only queries to other retrieval settings.

Background

The paper evaluates NanoVDR exclusively on visual document retrieval tasks where queries are text-only and the student encoder retrieves against document embeddings produced by a frozen vision-language teacher. The authors note that they have not tested the framework in other retrieval scenarios.

Assessing generalization would verify whether the proposed asymmetric distillation approach applies to alternative retrieval configurations (e.g., different query modalities or retrieval stages) beyond the studied setting.

References

We evaluate exclusively on visual document retrieval (first-stage ranking) with text-only queries; whether the framework generalizes to other retrieval settings is left to future investigation.

— NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval (2603.12824 - Liu et al., 13 Mar 2026) in Limitations

Generalization of NanoVDR beyond visual document retrieval with text-only queries

Background

References

Related Problems