Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMs with User-defined Prompts as Generic Data Operators for Reliable Data Processing (2312.16351v1)

Published 26 Dec 2023 in cs.DB and cs.AI

Abstract: Data processing is one of the fundamental steps in machine learning pipelines to ensure data quality. Majority of the applications consider the user-defined function (UDF) design pattern for data processing in databases. Although the UDF design pattern introduces flexibility, reusability and scalability, the increasing demand on machine learning pipelines brings three new challenges to this design pattern -- not low-code, not dependency-free and not knowledge-aware. To address these challenges, we propose a new design pattern that LLMs could work as a generic data operator (LLM-GDO) for reliable data cleansing, transformation and modeling with their human-compatible performance. In the LLM-GDO design pattern, user-defined prompts (UDPs) are used to represent the data processing logic rather than implementations with a specific programming language. LLMs can be centrally maintained so users don't have to manage the dependencies at the run-time. Fine-tuning LLMs with domain-specific data could enhance the performance on the domain-specific tasks which makes data processing knowledge-aware. We illustrate these advantages with examples in different data processing tasks. Furthermore, we summarize the challenges and opportunities introduced by LLMs to provide a complete view of this design pattern for more discussions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. M. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S. A. Hong, A. Konwinski, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe et al., “Accelerating the machine learning lifecycle with mlflow.” IEEE Data Eng. Bull., vol. 41, no. 4, pp. 39–45, 2018.
  2. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  3. A. Nambiar and D. Mundra, “An overview of data warehouse and data lake in modern enterprise data management,” Big Data and Cognitive Computing, vol. 6, no. 4, p. 132, 2022.
  4. M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin et al., “Apache spark: a unified engine for big data processing,” Communications of the ACM, vol. 59, no. 11, pp. 56–65, 2016.
  5. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  6. OpenAI, “Gpt-4 technical report,” 2023.
  7. J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,” arXiv preprint arXiv:2206.07682, 2022.
  8. R. Y. Maragheh, L. Morishetti, R. Giahi, K. Nag, J. Xu, J. Cho, E. Korpeoglu, S. Kumar, and K. Achan, “Llm-based aspect augmentations for recommendation systems,” 2023.
  9. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  10. S. Hegselmann, A. Buendia, H. Lang, M. Agrawal, X. Jiang, and D. Sontag, “Tabllm: Few-shot classification of tabular data with large language models,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2023, pp. 5549–5581.
  11. J. Qi, S. Huang, Z. Luan, C. Fung, H. Yang, and D. Qian, “Loggpt: Exploring chatgpt for log-based anomaly detection,” arXiv preprint arXiv:2309.01189, 2023.
  12. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
  13. T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” arXiv preprint arXiv:2305.14314, 2023.
  14. J. Chen, L. Ma, X. Li, N. Thakurdesai, J. Xu, J. H. Cho, K. Nag, E. Korpeoglu, S. Kumar, and K. Achan, “Knowledge graph completion models are few-shot learners: An empirical study of relation labeling in e-commerce with llms,” arXiv preprint arXiv:2305.09858, 2023.
  15. OpenAI, “Chatgpt 3.5 turbo,” OpenAI API, 2022, accessed: [Oct 25, 2023]. [Online]. Available: https://openai.com
  16. O. Owoputi, B. O’Connor, C. Dyer, K. Gimpel, N. Schneider, and N. A. Smith, “Improved part-of-speech tagging for online conversational text with word clusters,” in Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies, 2013, pp. 380–390.
  17. H. Hu, C. Zhang, Y. Luo, Y. Wang, J. Han, and E. Ding, “Wordsup: Exploiting word annotations for character based text detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 4940–4949.
  18. G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, “Deep learning for anomaly detection: A review,” ACM computing surveys (CSUR), vol. 54, no. 2, pp. 1–38, 2021.
  19. C.-Y. Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y. Fujii, A. Ratner, R. Krishna, C.-Y. Lee, and T. Pfister, “Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes,” arXiv preprint arXiv:2305.02301, 2023.
  20. X. Zhu, J. Li, Y. Liu, C. Ma, and W. Wang, “A survey on model compression for large language models,” arXiv preprint arXiv:2308.07633, 2023.
  21. M. Zhang, O. Press, W. Merrill, A. Liu, and N. A. Smith, “How language model hallucinations can snowball,” arXiv preprint arXiv:2305.13534, 2023.
  22. Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of thought prompting in large language models,” arXiv preprint arXiv:2210.03493, 2022.
  23. J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and J. Han, “Large language models can self-improve,” arXiv preprint arXiv:2210.11610, 2022.
  24. Y. Chang, X. Wang, J. Wang, Y. Wu, K. Zhu, H. Chen, L. Yang, X. Yi, C. Wang, Y. Wang et al., “A survey on evaluation of large language models,” arXiv preprint arXiv:2307.03109, 2023.
  25. S. Ghayyur, J. Averitt, E. Lin, E. Wallace, A. Deshpande, and H. Luthi, “Panel: Privacy challenges and opportunities in LLM-Based chatbot applications.”   Santa Clara, CA: USENIX Association, Sep. 2023.
  26. T. Fan, Y. Kang, G. Ma, W. Chen, W. Wei, L. Fan, and Q. Yang, “Fate-llm: A industrial grade federated learning framework for large language models,” arXiv preprint arXiv:2310.10049, 2023.

Summary

We haven't generated a summary for this paper yet.