Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pixel Sentence Representation Learning (2402.08183v1)

Published 13 Feb 2024 in cs.CL and cs.CV

Abstract: Pretrained LLMs are long known to be subpar in capturing sentence and document-level semantics. Though heavily investigated, transferring perturbation-based methods from unsupervised visual representation learning to NLP remains an unsolved problem. This is largely due to the discreteness of subword units brought by tokenization of LLMs, limiting small perturbations of inputs to form semantics-preserved positive pairs. In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process. Drawing from cognitive and linguistic sciences, we introduce an unsupervised visual sentence representation learning framework, employing visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to texts to be perceived as continuous. Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision, achieving comparable performance in semantic textual similarity (STS) to existing state-of-the-art NLP methods. Additionally, we unveil our method's inherent zero-shot cross-lingual transferability and a unique leapfrogging pattern across languages during iterative training. To our knowledge, this is the first representation learning method devoid of traditional LLMs for understanding sentence and document semantics, marking a stride closer to human-like textual comprehension. Our code is available at https://github.com/gowitheflow-1998/Pixel-Linguist

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Findings of the 2019 conference on machine translation (wmt19). ACL.
  2. A large annotated corpus for learning natural language inference. In Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, pages 632–642. Association for Computational Linguistics (ACL).
  3. Semeval-2017 task 1: Semantic textual similarity multilingual and cross-lingual focused evaluation. In The 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14.
  4. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR.
  5. Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  7. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500.
  8. Ethayarajh, K. (2019). How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55–65.
  9. Cert: Contrastive self-supervised learning for language understanding. arXiv preprint arXiv:2005.12766.
  10. The ‘good enough’approach to language comprehension. Language and linguistics compass, 1(1-2):71–83.
  11. Representation degeneration problem in training natural language generation models. In International Conference on Learning Representations.
  12. Simcse: Simple contrastive learning of sentence embeddings. In 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, pages 6894–6910. Association for Computational Linguistics (ACL).
  13. A dual-route approach to orthographic processing. Frontiers in psychology, 2:54.
  14. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009.
  15. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738.
  16. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118.
  17. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8018–8025.
  18. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  19. Word translation without parallel data. In International Conference on Learning Representations.
  20. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096.
  21. On the sentence embeddings from pre-trained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9119–9130.
  22. Textbugger: Generating adversarial text against real-world applications. In 26th Annual Network and Distributed System Security Symposium.
  23. May, P. (2021). Machine translated multilingual sts benchmark dataset.
  24. Coco-lm: Correcting and contrasting text sequences for language model pretraining. Advances in Neural Information Processing Systems, 34:23102–23114.
  25. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. arXiv preprint arXiv:2005.05909.
  26. Mteb: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2006–2029.
  27. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  28. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  29. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  30. Rawlinson, G. (1976). The significance of letter position in word recognition. In PhD thesis. University of Nottingham.
  31. Psychology of reading.
  32. Task-oriented intrinsic evaluation of semantic textual similarity. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 87–96.
  33. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
  34. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4512–4525.
  35. Language modelling with pixels. In The Eleventh International Conference on Learning Representations.
  36. Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1351–1361.
  37. Neural machine translation of rare words with subword units. In 54th Annual Meeting of the Association for Computational Linguistics, pages 1715–1725. Association for Computational Linguistics (ACL).
  38. Whitening sentence representations for better semantics and faster retrieval. arXiv preprint arXiv:2103.15316.
  39. Taylor, O. (2004). Pango, an open-source unicode text layout engine. In Proceedings of 25th Internationalization and Unicode Conference.
  40. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  41. Tiedemann, J. (2012). Parallel data, tools and interfaces in opus. In Lrec, volume 2012, pages 2214–2218.
  42. Clippo: Image-and-language understanding from pixels only. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11006–11017.
  43. Normface: L2 hypersphere embedding for face verification. In Proceedings of the 25th ACM international conference on Multimedia, pages 1041–1049.
  44. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368.
  45. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR.
  46. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122.
  47. Clear: Contrastive learning for sentence representation. arXiv preprint arXiv:2012.15466.
  48. Length is a curse and a blessing for document-level semantics. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1385–1396.
  49. On isotropy, contextualization and learning dynamics of contrastive-based sentence representation learning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12266–12283.
  50. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306.
  51. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Chenghao Xiao (21 papers)
  2. Zhuoxu Huang (4 papers)
  3. Danlu Chen (8 papers)
  4. G Thomas Hudson (8 papers)
  5. Yizhi Li (43 papers)
  6. Haoran Duan (36 papers)
  7. Chenghua Lin (127 papers)
  8. Jie Fu (229 papers)
  9. Jungong Han (111 papers)
  10. Noura Al Moubayed (40 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets