Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Retrieval-based Disentangled Representation Learning with Natural Language Supervision (2212.07699v2)

Published 15 Dec 2022 in cs.CL, cs.AI, and cs.CV

Abstract: Disentangled representation learning remains challenging as the underlying factors of variation in the data do not naturally exist. The inherent complexity of real-world data makes it unfeasible to exhaustively enumerate and encapsulate all its variations within a finite set of factors. However, it is worth noting that most real-world data have linguistic equivalents, typically in the form of textual descriptions. These linguistic counterparts can represent the data and effortlessly decomposed into distinct tokens. In light of this, we present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning. Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish dimensions that capture intrinsic characteristics within data through its natural language counterpart, thus facilitating disentanglement. We extensively assess the performance of VDR across 15 retrieval benchmark datasets, covering text-to-text and cross-modal retrieval scenarios, as well as human evaluation. Our experimental results compellingly demonstrate the superiority of VDR over previous bi-encoder retrievers with comparable model size and training costs, achieving an impressive 8.7% improvement in NDCG@10 on the BEIR benchmark, a 5.3% increase on MS COCO, and a 6.0% increase on Flickr30k in terms of mean recall in the zero-shot setting. Moreover, The results from human evaluation indicate that interpretability of our method is on par with SOTA captioning models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Robust cross-modal representation learning with progressive self-distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16430–16441, 2022.
  2. Contrastively disentangled sequential variational autoencoder. Advances in Neural Information Processing Systems, 34:10105–10118, 2021.
  3. Sparterm: Learning term-based sparse representation for fast text retrieval. arXiv preprint arXiv:2010.00768, 2020.
  4. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.
  5. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  6. Yoshua Bengio et al. Learning deep architectures for ai. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
  7. Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932, 2020.
  8. Prototypical contrastive language image pretraining. arXiv preprint arXiv:2206.10996, 2022.
  9. Weakly supervised disentanglement by pairwise similarities. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  3495–3502, 2020.
  10. Isolating sources of disentanglement in variational autoencoders. Advances in neural information processing systems, 31, 2018.
  11. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 29, 2016.
  12. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  13. Democratizing contrastive language-image pre-training: A clip benchmark of data, model, and supervision. arXiv preprint arXiv:2203.05796, 2022.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, 2019.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  16. Splade v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086, 2021a.
  17. Splade: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.  2288–2292, 2021b.
  18. An image is worth more than a thousand words: Towards disentanglement in the wild. Advances in Neural Information Processing Systems, 34:9216–9228, 2021.
  19. Hiclip: Contrastive language-image pretraining with hierarchy-aware attention. arXiv preprint arXiv:2303.02995, 2023.
  20. Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819, 2021.
  21. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
  22. Towards unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021.
  23. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2020.
  24. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6769–6781, 2020.
  25. Disentangling by factorising. In International Conference on Machine Learning, pp.  2649–2658. PMLR, 2018.
  26. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  27. Uniclip: Unified framework for contrastive language-image pre-training. arXiv preprint arXiv:2209.13430, 2022.
  28. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.  12888–12900. PMLR, 2022a.
  29. Generative time series forecasting with diffusion, denoise, and disentanglement. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022b.
  30. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208, 2021.
  31. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pp.  163–173, 2021.
  32. Retromae: Pre-training retrieval-oriented transformers via masked auto-encoder. arXiv preprint arXiv:2205.12035, 2022.
  33. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp.  4114–4124. PMLR, 2019a.
  34. Disentangling factors of variation using few labels. arXiv preprint arXiv:1905.01258, 2019b.
  35. Weakly-supervised disentanglement without compromises. In International Conference on Machine Learning, pp.  6348–6359. PMLR, 2020.
  36. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  37. Disentangled graph convolutional networks. In International conference on machine learning, pp.  4212–4221. PMLR, 2019.
  38. Christopher D Manning. An introduction to information retrieval. Cambridge university press, 2009.
  39. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.
  40. Symmetry-induced disentanglement on graphs. Advances in Neural Information Processing Systems, 35:31497–31511, 2022.
  41. An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval, 13(1):1–126, 2018.
  42. Unsupervised learning of equivariant structure from sequences. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
  43. Slip: Self-supervision meets language-image pre-training. In European Conference on Computer Vision, pp.  529–544. Springer, 2022.
  44. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp.  2641–2649, 2015.
  45. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.  8748–8763. PMLR, 2021.
  46. Deep visual analogy-making. Advances in neural information processing systems, 28, 2015.
  47. A thorough examination on zero-shot dense retrieval. arXiv preprint arXiv:2204.12755, 2022.
  48. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  49. Structuring representations using group invariants. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
  50. Lexmae: Lexicon-bottlenecked pretraining for large-scale retrieval. arXiv preprint arXiv:2208.14754, 2022a.
  51. Unifier: A unified retriever for large-scale retrieval. arXiv preprint arXiv:2205.11194, 2022b.
  52. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663, 2021.
  53. Simlm: Pre-training with representation bottleneck for dense passage retrieval. arXiv preprint arXiv:2207.02578, 2022a.
  54. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022b.
  55. Deep generative model for periodic graphs. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022c.
  56. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808, 2020.
  57. Filip: Fine-grained interactive language-image pre-training. In International Conference on Learning Representations, 2021.
  58. Dense text retrieval based on pretrained language models: A survey. arXiv preprint arXiv:2211.14876, 2022.
  59. Master: Multi-task pre-trained bottlenecked masked autoencoders are better dense retrievers. arXiv preprint arXiv:2212.07841, 2022.
Citations (3)

Summary

We haven't generated a summary for this paper yet.