Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 54 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 105 tok/s Pro
Kimi K2 182 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4 40 tok/s Pro
2000 character limit reached

How Effective are Generative Large Language Models in Performing Requirements Classification? (2504.16768v1)

Published 23 Apr 2025 in cs.CL, cs.AI, and cs.SE

Abstract: In recent years, transformer-based LLMs have revolutionised NLP, with generative models opening new possibilities for tasks that require context-aware text generation. Requirements engineering (RE) has also seen a surge in the experimentation of LLMs for different tasks, including trace-link detection, regulatory compliance, and others. Requirements classification is a common task in RE. While non-generative LLMs like BERT have been successfully applied to this task, there has been limited exploration of generative LLMs. This gap raises an important question: how well can generative LLMs, which produce context-aware outputs, perform in requirements classification? In this study, we explore the effectiveness of three generative LLMs-Bloom, Gemma, and Llama-in performing both binary and multi-class requirements classification. We design an extensive experimental study involving over 400 experiments across three widely used datasets (PROMISE NFR, Functional-Quality, and SecReq). Our study concludes that while factors like prompt design and LLM architecture are universally important, others-such as dataset variations-have a more situational impact, depending on the complexity of the classification task. This insight can guide future model development and deployment strategies, focusing on optimising prompt structures and aligning model architectures with task-specific needs for improved performance.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

An Analysis of the Effectiveness of Generative LLMs in Requirements Classification

The paper "How Effective are Generative LLMs in Performing Requirements Classification?" investigates the application of generative LLMs to the task of requirements classification within Requirements Engineering (RE). This paper represents a significant contribution to understanding the potential and limitations of generative LLMs in this domain, typically dominated by non-generative models like BERT.

Experimental Design and Objectives

The research explores three generative LLMs—Bloom, Gemma, and Llama—evaluating their performance through extensive experimentation involving over 400 different conditions across datasets, models, prompts, and tasks. The primary aim is to determine how well these models function compared to non-generative counterparts, especially considering previous success observed with non-generative models in RE tasks.

Key Findings

  1. Performance Metrics:
    • Bloom demonstrated superior precision (up to 0.77), particularly in binary classification tasks.
    • Gemma excelled in recall (average of 0.51 in binary classification), indicating its strength in identifying requirements across varying conditions.
    • Llama provided balanced results, suggesting it is well-suited for both binary and multi-class tasks.
  2. Prompt Engineering:
    • Assertion-based prompts appeared most effective, underscoring the importance of prompt design in leveraging model capabilities fully.
  3. Dataset Robustness:
    • Generative models maintained consistent performance despite variations in dataset format, highlighting their robustness in processing requirements inputs.
  4. Comparative Analysis with Non-Generative Models:
    • Generative models did not uniformly surpass non-generative models like All-Mini and SBERT in multi-class classification scenarios, where non-generative models showed better performance.

Implications

The paper's findings indicate that while generative LLMs bring promising capabilities for requirements classification, the choice of model and prompt structure can significantly influence outcomes. These models exhibit strength in certain metrics (e.g., recall for Gemma) but haven't fully eclipsed non-generative models in broader settings, particularly in complex multi-class tasks.

Speculations and Future Directions

Given these results, continued exploration of generative LLMs should consider focusing on optimizing prompts and refining architectural strategies that enhance multi-class discrimination capabilities. As transformer architectures continue to evolve, the potential of generative LLMs to integrate seamlessly into AI-driven requirements engineering frameworks could become more promising, especially with advancements in fine-tuning and task adaptation methodologies.

In summary, this paper contributes a foundational understanding of generative LLMs' role in requirements classification, offering groundwork for future research that could better integrate these models into RE practices.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com