Papers
Topics
Authors
Recent
Search
2000 character limit reached

Vectorizing string entries for data processing on tables: when are larger language models better?

Published 15 Dec 2023 in stat.ML and cs.LG | (2312.09634v1)

Abstract: There are increasingly efficient data processing pipelines that work on vectors of numbers, for instance most machine learning models, or vector databases for fast similarity search. These require converting the data to numbers. While this conversion is easy for simple numerical and categorical entries, databases are strife with text entries, such as names or descriptions. In the age of LLMs, what's the best strategies to vectorize tables entries, baring in mind that larger models entail more operational complexity? We study the benefits of LLMs in 14 analytical tasks on tables while varying the training size, as well as for a fuzzy join benchmark. We introduce a simple characterization of a column that reveals two settings: 1) a dirty categories setting, where strings share much similarities across entries, and conversely 2) a diverse entries setting. For dirty categories, pretrained LLMs bring little-to-no benefit compared to simpler string models. For diverse entries, we show that larger LLMs improve data processing. For these we investigate the complexity-performance tradeoffs and show that they reflect those of classic text embedding: larger models tend to perform better, but it is useful to fine tune them for embedding purposes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Y. Han, C. Liu, and P. Wang, “A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge,” Oct. 2023.
  2. L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based models still outperform deep learning on tabular data?” Jul. 2022.
  3. P. Cerda and G. Varoquaux, “Encoding high-cardinality string categorical variables,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 3, pp. 1164–1176, Mar. 2022.
  4. A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of Tricks for Efficient Text Classification,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, M. Lapata, P. Blunsom, and A. Koller, Eds.   Valencia, Spain: Association for Computational Linguistics, Apr. 2017, pp. 427–431.
  5. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” May 2019.
  6. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and Efficient Foundation Language Models,” Feb. 2023.
  7. A. S. Luccioni, Y. Jernite, and E. Strubell, “Power Hungry Processing: Watts Driving the Cost of AI Deployment?” Nov. 2023.
  8. A. Cvetkov-Iliev, A. Allauzen, and G. Varoquaux, “Relational Data Embeddings for Feature Enrichment with Background Information,” Machine Learning, vol. 112, 2022.
  9. P. Li, X. Cheng, X. Chu, Y. He, and S. Chaudhuri, “Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples,” Mar. 2021.
  10. D. Micci-Barreca, “A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems.” SIGKDD Explorations, vol. 3, pp. 27–32, Jul. 2001.
  11. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  12. J. M. Kanter and K. Veeramachaneni, “Deep feature synthesis: Towards automating data science endeavors,” in 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).   Campus des Cordeliers, Paris, France: IEEE, Oct. 2015, pp. 1–10.
  13. K. V. Carballo, L. Na, Y. Ma, L. Boussioux, C. Zeng, L. R. Soenksen, and D. Bertsimas, “TabText: A Flexible and Contextual Approach to Tabular Data Representation,” Jul. 2023.
  14. L. Chen, G. Varoquaux, and F. M. Suchanek, “Imputing out-of-vocabulary embeddings with LOVE makes language models robust with little cost,” arXiv preprint arXiv:2203.07860, 2022.
  15. N. Hollmann, S. Müller, and F. Hutter, “Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering,” Sep. 2023.
  16. S. Hegselmann, A. Buendia, H. Lang, M. Agrawal, X. Jiang, and D. Sontag, “TabLLM: Few-shot Classification of Tabular Data with Large Language Models,” Mar. 2023.
  17. T. Dinh, Y. Zeng, R. Zhang, Z. Lin, M. Gira, S. Rajput, J.-y. Sohn, D. Papailiopoulos, and K. Lee, “LIFT: Language-Interfaced Fine-Tuning for Non-language Machine Learning Tasks,” Advances in Neural Information Processing Systems, vol. 35, pp. 11 763–11 784, Dec. 2022.
  18. G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,” Nov. 2018.
  19. W. Gurnee and M. Tegmark, “Language Models Represent Space and Time,” Oct. 2023.
  20. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom, “Llama 2: Open Foundation and Fine-Tuned Chat Models,” Jul. 2023.
  21. B. Li, H. Zhou, J. He, M. Wang, Y. Yang, and L. Li, “On the Sentence Embeddings from Pre-trained Language Models,” Nov. 2020.
  22. T. Gao, X. Yao, and D. Chen, “SimCSE: Simple Contrastive Learning of Sentence Embeddings,” May 2022.
  23. J. Ni, G. H. Ábrego, N. Constant, J. Ma, K. B. Hall, D. Cer, and Y. Yang, “Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models,” Dec. 2021.
  24. A. Neelakantan, T. Xu, R. Puri, A. Radford, J. M. Han, J. Tworek, Q. Yuan, N. Tezak, J. W. Kim, C. Hallacy, J. Heidecke, P. Shyam, B. Power, T. E. Nekoul, G. Sastry, G. Krueger, D. Schnurr, F. P. Such, K. Hsu, M. Thompson, T. Khan, T. Sherbakov, J. Jang, P. Welinder, and L. Weng, “Text and Code Embeddings by Contrastive Pre-Training,” Jan. 2022.
  25. L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei, “Text Embeddings by Weakly-Supervised Contrastive Pre-training,” Dec. 2022.
  26. T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov, “Natural Questions: A Benchmark for Question Answering Research,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 452–466, 2019.
  27. N. Thakur, N. Reimers, J. Daxenberger, and I. Gurevych, “Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks,” Apr. 2021.
  28. N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “MTEB: Massive Text Embedding Benchmark,” Mar. 2023.
  29. X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu, “TURL: Table Understanding through Representation Learning,” Dec. 2020.
  30. T. Zhang, X. Yue, Y. Li, and H. Sun, “TableLlama: Towards Open Large Generalist Models for Tables,” Nov. 2023.
  31. G. C. P. S. G. C. K. P. G. Y. P. D. Das Sanjib, Doan AnHai, “The Magellan Data Repository.”
  32. S. Crossley, A. Heintz, J. S. Choi, J. Batchelor, M. Karimi, and A. Malatinszky, “A large-scaled corpus for assessing text readability,” Behavior Research Methods, vol. 55, no. 2, pp. 491–507, 2023.
  33. R. X. de Azambuja, A. J. Morais, and V. Filipe, “X-Wines: A Wine Dataset for Recommender Systems and Machine Learning,” Big Data and Cognitive Computing, vol. 7, no. 1, p. 20, Mar. 2023.
  34. “Largest Dataset Analyzed - Poll Results and Trends,” https://www.kdnuggets.com/largest-dataset-analyzed-poll-results-and-trends.
  35. S. Xiao, Z. Liu, P. Zhang, and N. Muennighof, “C-Pack: Packaged Resources To Advance General Chinese Embedding,” Sep. 2023.
  36. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” Jul. 2019.
  37. A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7B,” Oct. 2023.
  38. S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. van der Wal, “Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling,” May 2023.
  39. N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” Aug. 2019.
  40. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, Jun. 2017.
  41. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling Laws for Neural Language Models,” Jan. 2020.
  42. F. Timbers, “Transformer inference tricks,” https://www.artfintel.com/p/transformer-inference-tricks, Sep. 2023.
  43. Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, D. Y. Fu, Z. Xie, B. Chen, C. Barrett, J. E. Gonzalez, P. Liang, C. Ré, I. Stoica, and C. Zhang, “FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU,” Jun. 2023.
  44. K. Meng, D. Bau, A. Andonian, and Y. Belinkov, “Locating and Editing Factual Associations in GPT,” Jan. 2023.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.