Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and Gallery Banks (2310.11612v1)

Published 17 Oct 2023 in cs.LG

Abstract: In this work, we present a post-processing solution to address the hubness problem in cross-modal retrieval, a phenomenon where a small number of gallery data points are frequently retrieved, resulting in a decline in retrieval performance. We first theoretically demonstrate the necessity of incorporating both the gallery and query data for addressing hubness as hubs always exhibit high similarity with gallery and query data. Second, building on our theoretical results, we propose a novel framework, Dual Bank Normalization (DBNorm). While previous work has attempted to alleviate hubness by only utilizing the query samples, DBNorm leverages two banks constructed from the query and gallery samples to reduce the occurrence of hubs during inference. Next, to complement DBNorm, we introduce two novel methods, dual inverted softmax and dual dynamic inverted softmax, for normalizing similarity based on the two banks. Specifically, our proposed methods reduce the similarity between hubs and queries while improving the similarity between non-hubs and queries. Finally, we present extensive experimental results on diverse language-grounded benchmarks, including text-image, text-video, and text-audio, demonstrating the superior performance of our approaches compared to previous methods in addressing hubness and boosting retrieval performance. Our code is available at https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (97)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 1708–1718. IEEE.
  2. Richard Bellman. 2015. Adaptive Control Processes - A Guided Tour (Reprint from 1961), volume 2045 of Princeton Legacy Library. Princeton University Press.
  3. Cross modal retrieval with querybank normalisation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 5184–5195. IEEE.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  5. Visual consensus modeling for video-text retrieval. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 167–175. AAAI Press.
  6. David Chen and William Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 190–200, Portland, Oregon, USA. Association for Computational Linguistics.
  7. LiteVL: Efficient video-language learning with enhanced spatial-temporal modeling. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7985–7997, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  8. Fine-grained video-text retrieval with hierarchical graph reasoning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 10635–10644. Computer Vision Foundation / IEEE.
  9. UNITER: UNiversal Image-TExt Representation Learning. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX, volume 12375 of Lecture Notes in Computer Science, pages 104–120. Springer.
  10. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. CoRR, abs/2109.04290.
  11. Teachtext: Crossmodal generalized distillation for text-video retrieval. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 11563–11573. IEEE.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  13. Improving zero-shot learning by mitigating the hubness problem. ArXiv:1412.6568 [cs].
  14. Clotho: an Audio Captioning Dataset. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740. ISSN: 2379-190X.
  15. Multi-modal Alignment using Representation Codebook. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15630–15639, New Orleans, LA, USA. IEEE.
  16. Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970.
  17. Multi-modal cross-domain alignment network for video moment retrieval. IEEE Transactions on Multimedia.
  18. You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2448–2460.
  19. Hierarchical local-global transformer for temporal sentence grounding. IEEE Transactions on Multimedia.
  20. Vision-and-language or vision-for-language? on cross-modal influence in multimodal transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9847–9857, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  21. Multi-modal transformer for video retrieval. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IV, volume 12349 of Lecture Notes in Computer Science, pages 214–229. Springer.
  22. Large-scale adversarial training for vision-and-language representation learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  23. CLIP2TV: an empirical study on transformer-based methods for video-text retrieval. CoRR, abs/2111.05610.
  24. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans. Pattern Anal. Mach. Intell., 35(12):2916–2929.
  25. X-pool: Cross-modal language-video attention for text-video retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 4996–5005. IEEE.
  26. Localized centering: Reducing hubness in large-sample data. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA, pages 2645–2651. AAAI Press.
  27. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778. IEEE Computer Society.
  28. Localizing Moments in Video with Natural Language. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5804–5813, Venice. IEEE.
  29. Learning cross-modal retrieval with noisy labels. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 5403–5413. Computer Vision Foundation / IEEE.
  30. Improving bilingual lexicon induction for low frequency words. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1310–1314, Online. Association for Computational Linguistics.
  31. Hubless nearest neighbor search for bilingual lexicon induction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4072–4080, Florence, Italy. Association for Computational Linguistics.
  32. A contextual dissimilarity measure for accurate and efficient image search. In 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), 18-23 June 2007, Minneapolis, Minnesota, USA. IEEE Computer Society.
  33. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 4904–4916. PMLR. ISSN: 2640-3498.
  34. TIGEr: Text-to-image grounding for image caption evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2141–2152, Hong Kong, China. Association for Computational Linguistics.
  35. Eamonn Keogh and Abdullah Mueen. 2017. Curse of Dimensionality, pages 314–315. Springer US, Boston, MA.
  36. AudioCaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, Minneapolis, Minnesota. Association for Computational Linguistics.
  37. Learning cross-modal contrastive features for video domain adaptation. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 13598–13607. IEEE.
  38. Nice: Cvpr 2023 challenge on zero-shot image captioning.
  39. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 5583–5594. PMLR. ISSN: 2640-3498.
  40. Audio Retrieval with Natural Language Queries: A Benchmark Study. IEEE Transactions on Multimedia, pages 1–1. Conference Name: IEEE Transactions on Multimedia.
  41. Word translation without parallel data. In International conference on learning representations.
  42. Hubness and pollution: Delving into cross-space mapping for zero-shot learning. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 270–280, Beijing, China. Association for Computational Linguistics.
  43. Less is more: Clipbert for video-and-language learning via sparse sampling. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 7331–7341. Computer Vision Foundation / IEEE.
  44. Align before fuse: Vision and language representation learning with momentum distillation. In Advances in neural information processing systems.
  45. HERO: Hierarchical encoder for Video+Language omni-representation pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2046–2065, Online. Association for Computational Linguistics.
  46. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX, volume 12375 of Lecture Notes in Computer Science, pages 121–137. Springer.
  47. Microsoft COCO: common objects in context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, volume 8693 of Lecture Notes in Computer Science, pages 740–755. Springer.
  48. Cross-modal deep variational hashing. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 4097–4105. IEEE Computer Society.
  49. Cross-modal discrete representation learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3013–3035, Dublin, Ireland. Association for Computational Linguistics.
  50. Fangyu Liu and Rongtian Ye. 2019. A strong and robust baseline for text-image matching. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 169–176, Florence, Italy. Association for Computational Linguistics.
  51. HAL: Improved Text-Image Matching by Mitigating Visual Semantic Hubs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11563–11571.
  52. Inflate and shrink:enriching and reducing interactions for fast text-image retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9796–9809, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  53. HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11895–11905, Montreal, QC, Canada. IEEE.
  54. Use what you have: Video retrieval using representations from collaborative experts. In 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9-12, 2019, page 279. BMVA Press.
  55. Ts2-net: Token shift and selection transformer for text-video retrieval. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XIV, volume 13674 of Lecture Notes in Computer Science, pages 319–335. Springer.
  56. Clip4clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304.
  57. Conditioned masked language and image modeling for image-text dense retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 130–140, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  58. X-CLIP: end-to-end multi-grained contrastive learning for video-text retrieval. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, pages 638–647. ACM.
  59. Thinking fast and slow: Efficient text-to-visual retrieval with transformers. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 9826–9836. Computer Vision Foundation / IEEE.
  60. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. ArXiv:1804.02516 [cs].
  61. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, ICMR 2018, Yokohama, Japan, June 11-14, 2018, pages 19–27. ACM.
  62. Audio retrieval with natural language queries. In Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pages 2411–2415. ISCA.
  63. Ambient sound provides supervision for visual learning. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, volume 9905 of Lecture Notes in Computer Science, pages 801–816. Springer.
  64. Exposing the limits of video-text models through contrast sets. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3574–3586, Seattle, United States. Association for Computational Linguistics.
  65. Normalized contrastive learning for text-video retrieval. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 248–260, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  66. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, pages 8024–8035. Curran Associates, Inc.
  67. Support-set bottlenecks for video-text representation learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  68. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. Int. J. Comput. Vis., 123(1):74–93.
  69. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  70. Language models are unsupervised multitask learners.
  71. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 11(86):2487–2531.
  72. On the existence of obstinate results in vector space models. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 186–193, Geneva Switzerland. ACM.
  73. Fusion of linguistic, neural and sentence-transformer features for improved term alignment. In Proceedings of the BUCC Workshop within LREC 2022, pages 61–66, Marseille, France. European Language Resources Association.
  74. RoMe: A robust metric for evaluating natural language generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5645–5657, Dublin, Ireland. Association for Computational Linguistics.
  75. Local and global scaling reduce hubs in space. J. Mach. Learn. Res., 13:2871–2902.
  76. FLAVA: A foundational language and vision alignment model. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 15617–15629. IEEE.
  77. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In 5th international conference on learning representations, ICLR 2017, toulon, france, april 24-26, 2017, conference track proceedings. OpenReview.net. Tex.bibsource: dblp computer science bibliography, https://dblp.org tex.biburl: https://dblp.org/rec/conf/iclr/SmithTHH17.bib tex.timestamp: Thu, 25 Jul 2019 14:25:47 +0200.
  78. Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 3027–3035. IEEE.
  79. Investigating the effectiveness of laplacian-based kernels in hub reduction. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, July 22-26, 2012, Toronto, Ontario, Canada. AAAI Press.
  80. Centering similarity measures to reduce hubs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 613–623, Seattle, Washington, USA. Association for Computational Linguistics.
  81. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6558–6569, Florence, Italy. Association for Computational Linguistics.
  82. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1494–1504, Denver, Colorado. Association for Computational Linguistics.
  83. Boosting video-text retrieval with explicit high-level semantics. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, pages 4887–4898. ACM.
  84. Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision. IEEE Transactions on Multimedia, pages 1–11.
  85. Searching privately by imperceptible lying: A novel private hashing method with differential privacy. In Proceedings of the 28th ACM International Conference on Multimedia, page 2700–2709.
  86. Yimu Wang and Peng Shi. 2023. Video-text retrieval by supervised multi-space multi-grained alignment.
  87. Piecewise hashing: A deep hashing method for large-scale fine-grained search. In Pattern Recognition and Computer Vision - Third Chinese Conference, PRCV 2020, Nanjing, China, October 16-18, 2020, Proceedings, Part II, pages 432–444.
  88. Deep unified cross-modality hashing by pairwise data alignment. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 1129–1135.
  89. RaP: Redundancy-aware video-language pre-training for text-video retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3036–3047, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  90. VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6787–6800, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  91. MSR-VTT: A large video description dataset for bridging video and language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 5288–5296. IEEE Computer Society.
  92. Multimodal federated learning via contrastive representation ensemble. In The Eleventh International Conference on Learning Representations.
  93. Vision guided generative pre-trained language models for multimodal abstractive summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3995–4007, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  94. Lihi Zelnik-Manor and Pietro Perona. 2004. Self-tuning spectral clustering. In Advances in Neural Information Processing Systems 17 [Neural Information Processing Systems, NIPS 2004, December 13-18, 2004, Vancouver, British Columbia, Canada], pages 1601–1608.
  95. Cross-Modal and Hierarchical Modeling of Video and Text. In Computer Vision – ECCV 2018, volume 11217, pages 385–401, Cham. Springer International Publishing. Series Title: Lecture Notes in Computer Science.
  96. Centerclip: Token clustering for efficient text-video retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 970–981, New York, NY, USA. Association for Computing Machinery.
  97. Re-ranking person re-identification with k-reciprocal encoding. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3652–3661. IEEE Computer Society.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yimu Wang (18 papers)
  2. Xiangru Jian (14 papers)
  3. Bo Xue (12 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.