Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 73 tok/s

Gemini 2.5 Pro 42 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 34 tok/s Pro

GPT-4o 96 tok/s Pro

Kimi K2 191 tok/s Pro

GPT OSS 120B 454 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Classification or Prompting: A Case Study on Legal Requirements Traceability (2502.04916v2)

Published 7 Feb 2025 in cs.SE

Abstract: New regulations are continuously introduced to ensure that software development complies with the ethical concerns and prioritizes public safety. A prerequisite for demonstrating compliance involves tracing software requirements to legal provisions. Requirements traceability is a fundamental task where requirements engineers are supposed to analyze technical requirements against target artifacts, often under limited time budget. Doing this analysis manually for complex systems with hundreds of requirements is infeasible. The legal dimension introduces additional challenges that only exacerbate manual effort. In this paper, we investigate two automated solutions based on LLMs to predict trace links between requirements and legal provisions. The first solution, Kashif, is a classifier that leverages sentence transformers. The second solution prompts a recent generative LLM based on Rice, a prompt engineering framework. On a benchmark dataset, we empirically evaluate Kashif and compare it against a baseline classifier from the literature. Kashif can identify trace links with an average recall of ~67%, outperforming the baseline with a substantial gain of 54 percentage points (pp) in recall. However, on unseen, more complex requirements documents traced to the European general data protection regulation (GDPR), Kashif performs poorly, yielding an average recall of 15%. On the same documents, however, our Rice-based solution yields an average recall of 84%, with a remarkable gain of about 69 pp over Kashif. Our results suggest that requirements traceability in the legal context cannot be simply addressed by building classifiers, as such solutions do not generalize and fail to perform well on complex regulations and requirements. Resorting to generative LLMs, with careful prompt engineering, is thus a more promising alternative.

Summary

The paper introduces two innovative approaches by framing legal traceability as a classification task with Sentence Transformers and as a prompting task with GPT-4.
The classification method (K) outperforms a baseline on HIPAA with a 35 pp F1-score gain but struggles with transferability when applied to GDPR.
The GPT-4 prompting approach (Rice) achieves an 84% recall and significantly reduces manual review tasks despite generating some false positives.

This paper addresses the challenge of Legal Requirements Traceability (LRT), which involves linking software requirements to provisions in legal regulations like GDPR or HIPAA. LRT is crucial for ensuring software compliance but is hindered by the complexity of legal text, the gap between legal and technical language, and the significant manual effort required. Existing automated approaches often rely on older NLP techniques, have limited evaluations, and struggle with transferability across different regulations or domains.

To overcome these limitations, the paper proposes and evaluates two novel automated approaches:

K (trace linK identificAtion ... using sentence transFormers): This approach frames LRT as a classification problem. It utilizes Sentence Transformers (ST), specifically pre-trained LLMs fine-tuned for semantic understanding, to compute the similarity between software requirements and legal provisions.
- Implementation: K involves preparing training data (requirement-provision pairs with trace links), selecting a base ST model (identified as ST29 - `paraphrase-multilingual-mpnet-base-v2' via experiments), fine-tuning it on an LRT dataset (HIPAA), preprocessing input requirements, computing cosine similarity scores between requirements and all provisions, and predicting trace links based on a threshold ( $\theta$ ).
- Thresholding Methods: Three variants were tested: Kconstant (fixed $\theta=0.5$ ), Kdynamic (threshold based on similarity to negative examples), and KΔ (threshold based on the maximum drop in sorted similarity scores).
Rice: This approach leverages LLMs, specifically OpenAI's GPT-4o, through structured prompt engineering based on the Rice framework (Role, Instruction, Context, Examples).
- Implementation: A detailed prompt was designed incorporating context about the LRT task, few-shot examples (requirement, linked provisions, and rationale), specific instructions (consider all provisions, use reasoning for indirect links, prioritize recall, provide rationale), and a defined output format. The prompt was applied to each requirement individually.

The paper also re-implements a Baseline (B) approach from prior work [cleland:2010, Guo:17], which uses a probabilistic method based on indicator terms (keywords) found in requirements and provisions.

Evaluation and Results:

The approaches were evaluated using the HIPAA benchmark dataset and a newly curated dataset comprising four requirements documents (shall-requirements and user stories) traced against 26 GDPR provisions pertinent to software.

RQ1 (Best ST Model): Comparing 38 ST models in a zero-shot setting on HIPAA using AUC, ST29 (`paraphrase-multilingual-mpnet-base-v2') yielded the best performance (AUC=0.859), highlighting that top-performing general NLP models aren't always best for specific RE tasks.
RQ2 (K vs. Baseline on HIPAA): Using leave-one-out cross-validation, Kconstant significantly outperformed the baseline B, achieving an average F1-score of ~57% compared to B's ~22% (+35 pp gain). Kconstant provided the best precision-recall balance among K variants, though KΔ achieved the highest recall (80%) at the cost of very low precision.
RQ3 (K on GDPR Dataset): When applied to the unseen GDPR dataset without re-tuning, Kconstant's performance dropped sharply (average recall ≈ 15%, success rate ≈ 44%). While better than the zero-shot ST29 model, this showed limitations in K's ability to generalize to more complex regulations and diverse requirements types. K also lacked explanatory power (no rationale).
RQ4 (Rice vs. K on GDPR Dataset): The Rice approach using GPT-4o significantly outperformed K, achieving an average recall of 84% and a success rate of ~89% at the requirement level. Although Rice generated many false positives (FPs) and few exact matches (most were partial matches with 1-3 FPs), it successfully identified the vast majority of true trace links and provided rationales for its predictions. This drastically reduces the analyst's workload, requiring them to vet only ~12% of potential provisions while finding 84% of actual links.

Conclusions:

The paper concludes that while classifier-based approaches like K improve upon older methods for simpler LRT tasks (HIPAA), they struggle with the complexity and transferability needed for regulations like GDPR. LLM-based prompting, as implemented in Rice, demonstrates superior performance and practical utility for complex LRT by leveraging the model's reasoning capabilities and providing explanatory rationales, even without specific training data beyond the few-shot examples. Despite generating FPs, this approach significantly reduces manual effort, making it a more promising direction for future LRT research and practice.