RETVec: Resilient and Efficient Text Vectorizer (2302.09207v3)
Abstract: This paper describes RETVec, an efficient, resilient, and multilingual text vectorizer designed for neural-based text processing. RETVec combines a novel character encoding with an optional small embedding model to embed words into a 256-dimensional vector space. The RETVec embedding model is pre-trained using pair-wise metric learning to be robust against typos and character-level adversarial attacks. In this paper, we evaluate and compare RETVec to state-of-the-art vectorizers and word embeddings on popular model architectures and datasets. These comparisons demonstrate that RETVec leads to competitive, multilingual models that are significantly more resilient to typos and adversarial text attacks. RETVec is available under the Apache 2 license at https://github.com/google-research/retvec.
- Deep Learning with Python. https://www.manning.com/books/deep-learning-with-python.
- Better Language Models and Their Implications. https://openai.com/blog/better-language-models/, Feb. 2019.
- Pythia: A suite for analyzing large language models across training and scaling, 2023.
- Enriching Word Vectors with Subword Information, June 2017.
- PaLM: Scaling Language Modeling with Pathways, Oct. 2022.
- Machine learning for email spam filtering: review, approaches and open research problems. Heliyon, 5(6):e01802, 2019.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019.
- MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages, June 2022.
- Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers, May 2018.
- Learning Word Vectors for 157 Languages, Mar. 2018.
- A Large-Scale Query Spelling Correction Corpus. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, pages 1261–1264, New York, NY, USA, Aug. 2017. Association for Computing Machinery.
- Triplet-Center Loss for Multi-View 3D Object Retrieval, Mar. 2018.
- Transformer Quality in Linear Time, June 2022.
- NeuSpell: A Neural Spelling Correction Toolkit. Comment: Accepted at EMNLP 2020 (system demonstrations).
- R. Johnson and T. Zhang. Deep Pyramid Convolutional Neural Networks for Text Categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 562–570, Vancouver, Canada, 2017. Association for Computational Linguistics.
- Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431, Valencia, Spain, Apr. 2017. Association for Computational Linguistics.
- T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, Aug. 2018.
- TextBugger: Generating Adversarial Text Against Real-world Applications. In Proceedings 2019 Network and Distributed System Security Symposium, 2019.
- J. McAuley and J. Leskovec. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, pages 165–172, Hong Kong China, Oct. 2013. ACM.
- Efficient Estimation of Word Representations in Vector Space, Sept. 2013.
- TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP, Oct. 2020.
- GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, Oct. 2014. Association for Computational Linguistics.
- Combating Adversarial Misspellings with Robust Word Recognition, Aug. 2019.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, July 2020.
- FaceNet: A Unified Embedding for Face Recognition and Clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, June 2015.
- Neural Machine Translation of Rare Words with Subword Units, June 2016.
- Circle Loss: A Unified Perspective of Pair Similarity Optimization, June 2020.
- Neural discrete representation learning, 2018.
- Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019.
- Multi-Similarity Loss with General Pair Weighting for Deep Metric Learning, Mar. 2020.
- Distance metric learning, with application to clustering with side-information.
- ByT5: Towards a token-free future with pre-trained byte-to-byte models, Mar. 2022.
- Character-level Convolutional Networks for Text Classification, Apr. 2016.
- Z. Zhang and V. Saligrama. Zero-Shot Learning via Joint Latent Similarity Embedding, Aug. 2016.
- S. Zhuang and G. Zuccon. Dealing with Typos for BERT-based Passage Retrieval and Ranking, Sept. 2021.