This paper (Basyal et al., 2023 ) explores the application of LLMs for text summarization, comparing the performance of three specific models: MPT-7b-instruct, Falcon-7b-instruct, and OpenAI's text-davinci-003. The primary goal is to provide a comparative analysis to help researchers and practitioners understand how these models perform on different datasets and evaluation metrics, contributing to the practical implementation of LLMs in NLP applications.
The paper outlines text summarization as a crucial NLP task, particularly relevant in the age of big data for condensing large volumes of text. It distinguishes between two main approaches: abstractive (generating new text) and extractive (selecting existing sentences). The paper focuses on the abstractive capabilities of LLMs. Summarization techniques can also be categorized as supervised (requiring labeled data) or unsupervised (using algorithms to find relevant information). LLMs generally leverage their pre-training on vast amounts of text and fine-tuning for specific tasks, often falling into the supervised or instruction-tuned category.
The paper evaluates the LLMs on two standard text summarization datasets:
- CNN/Daily Mail 3.0.0: Comprises news articles with journalist-written highlights (summaries). This dataset supports both extractive and abstractive summarization evaluation. Each entry includes an article body and corresponding highlights.
- XSum: Designed for "extreme summarization," where the goal is to produce a single-sentence summary capturing the core idea of a news article. Each entry contains a document and a concise summary.
To assess the quality of the generated summaries, the paper uses widely accepted evaluation metrics:
- BLEU Score: Measures the similarity between generated text and reference text based on n-gram overlap, commonly used for machine translation but also applicable to summarization quality assessment.
- ROUGE Score: Specifically designed for summarization, measuring the overlap of n-grams (ROUGE-N) and longest common subsequences (ROUGE-L) between the generated summary and reference summaries.
- BERT Score: Leverages contextual embeddings from the BERT model to evaluate similarity, aiming to capture semantic meaning beyond simple word overlap.
For the experiments, the authors used a controlled setup for inference:
- Models: falcon-7b-instruct, mpt-7b-instruct, and text-davinci-003.
- Sample Size: 25 test samples from each dataset (CNN/Daily Mail and XSum).
- Hyperparameters: A temperature of 0.1 (aiming for less randomness and more deterministic outputs) and a maximum token length of 100 for the generated summary were used.
- Implementation: The inference process utilized LangChain and Hugging Face pipelines for prompt engineering and model execution.
- Infrastructure: The experiments were run on custom Google Compute Engine (GCE) Virtual Machine instances equipped with NVIDIA T4 GPUs, highlighting the computational resources required for running these models.
The results, based on the evaluation metrics, indicated that text-davinci-003 consistently outperformed both MPT-7b-instruct and Falcon-7b-instruct on both datasets across ROUGE and BERT scores. While the raw BLEU scores for the 7B models appeared extremely low in the provided table (potentially indicating a calculation artifact for very short samples or mismatch), the relative ROUGE and BERT scores show text-davinci-003 as superior. The paper attributes this higher performance primarily to text-davinci-003 being a significantly larger model (175 billion parameters) compared to the 7 billion parameters of MPT and Falcon. Among the 7B models, MPT-7b-instruct performed slightly better than Falcon-7b-instruct.
These findings have practical implications for implementing text summarization systems:
- Model Selection: For achieving the highest quality abstractive summaries, larger, more powerful models like OpenAI's text-davinci-003 (or its successors via API) demonstrated the best results, albeit at potentially higher computational cost and API usage fees.
- Resource Requirements: Deploying open-source 7B models like MPT or Falcon for inference requires substantial GPU resources (e.g., NVIDIA T4, as used in the paper), which is a key consideration for practitioners setting up infrastructure.
- Frameworks: The use of libraries like LangChain and Hugging Face pipelines is demonstrated as a practical approach for integrating LLMs into summarization workflows, handling tasks like prompt engineering and model interaction.
- Performance Trade-offs: While smaller models (7B) might be more feasible to deploy on less powerful hardware or for self-hosting, they may yield lower summarization quality compared to state-of-the-art proprietary models.
The paper suggests future enhancements including evaluating larger open-source models (e.g., MPT-30b-instruct, Falcon-40b-instruct), using larger datasets and sample sizes, exploring different hyperparameter settings (like varying temperature and output length), incorporating human evaluation metrics, and fine-tuning models on specific domains for improved performance.
In summary, the paper provides a comparative benchmark for text summarization using select LLMs, confirming the superior performance of a larger proprietary model (text-davinci-003) over smaller open-source alternatives (MPT-7b-instruct, Falcon-7b-instruct) in this task, while highlighting the practical tools and infrastructure required for their application.