Advancing Single and Multi-task Text Classification through Large Language Model Fine-tuning (2412.08587v2)

Published 11 Dec 2024 in cs.CL and cs.AI

Abstract: Both encoder-only models (e.g., BERT, RoBERTa) and LLMs (LLMs, e.g., Llama3) have been widely used for text classification tasks. However, there is a lack of systematic studies comparing the performance of encoder-based models and LLMs in text classification, particularly when fine-tuning is involved. This study employed a diverse range of models and methods, varying in size and architecture, and including both fine-tuned and pre-trained approaches. We first assessed the performances of these LLMs on the 20 Newsgroups (20NG) and MASSIVE datasets, comparing them to encoder-only RoBERTa models. Additionally, we explored the multi-task capabilities of both model types by combining multiple classification tasks, including intent detection and slot-filling, into a single model using data from both datasets. Our results indicate that fully fine-tuned Llama3-70B models outperform RoBERTa-large and other decoder LLMs across various classification tasks and datasets. Moreover, the consolidated multi-task fine-tuned LLMs matched the performance of dual-model setups in both tasks across both datasets. Overall, our study provides a comprehensive benchmark of encoder-only and LLM models on text classification tasks and demonstrates a method to combine two or more fully fine-tuned decoder LLMs for reduced latency and equivalent performance.

Summary

The paper’s main contribution is evaluating fine-tuned LLMs like Llama3-70B, which achieve high accuracy (91.9% and 96.5%) on diverse text classification tasks.
It demonstrates that consolidating multiple tasks into one multi-task model can rival dual-model systems while reducing latency and resource consumption.
Comparative analysis between encoder-only and decoder models underscores the potential of fine-tuned LLMs to enhance AI applications in real-world scenarios.

Advancing Single- and Multi-Task Text Classification through LLM Fine-Tuning

The paper under review offers a thorough examination of the performance of LLMs in both single- and multi-task text classification contexts. This paper bridges a significant research gap by systematically comparing encoder-only models, specifically RoBERTa, with a broad spectrum of decoder models, including Llama2, Llama3, GPT-3.5, GPT-4, and GPT-4o, both pre-trained and fine-tuned.

The authors set forth a two-pronged contribution: first, they evaluate the efficacy of fine-tuned LLMs like Llama3-70B against established models such as RoBERTa-large, and second, they explore the potential of consolidated multi-task models, showing that a cohesive model can deliver comparable performance to separate dual-model systems on distinct tasks, while reducing latency and resource usage.

Key Findings and Numerical Highlights

The paper’s critical numerical results demonstrate that the fine-tuned Llama3-70B model yields superior accuracy in text classification tasks compared to other models. Specifically, Llama3-70B achieves an accuracy of 91.9% on the twenty-class task and 96.5% on the seven-class task on the 20NG dataset, surpassing RoBERTa-large's performance. Additionally, similar trends are observed in the MASSIVE en-US dataset, where this model achieves 90.8% accuracy in intent classification and 86.0% F1-score in slot-filling tasks.

When the researchers consolidated the tasks into a single multi-task model, they observed that the performance remained robust compared to a dual-model setup, suggesting that LLMs can efficiently handle aggregated task structures without a notable compromise in performance metrics.

Implications and Future Directions

The implications of this research are profound for AI applications, particularly in the field of AI agents. The improved performance of fine-tuned LLMs indicates that such models can significantly enhance the precision of AI applications, which rely heavily on accurate text classification for functions like intent detection and automated customer support.

On a theoretical level, this paper pushes the envelope on our understanding of LLM capabilities in multi-task learning. The ability of fine-tuned LLMs to adapt to diverse tasks with minimal prompt tuning suggests possible future methodologies in fine-tuning and training regimes for LLMs that preserve accuracy while optimizing computational efficiency.

Potential future research avenues outlined by the authors include expanding the range of text classification tasks to better understand LLM capabilities in more varied contexts, such as joint class classification and generation tasks. Additionally, the exploration of task combination approaches could yield insights into optimizing and enhancing multitask learning frameworks.

Conclusion

In conclusion, this rigorous investigation of LLMs signifies a notable step forward in understanding their role in text classification tasks. By benchmarking a variety of models and configurations, the paper provides clear evidence that fine-tuned LLMs, particularly of larger sizes, can substantially outperform traditional encoder-only models. Furthermore, the successful consolidation of tasks within a single LLM model illustrates an innovative stride towards more efficient AI applications, paving the way for future research and development in large-scale LLM deployment in practical settings.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1867324675745099889