- The paper introduces two innovative approaches by framing legal traceability as a classification task with Sentence Transformers and as a prompting task with GPT-4.
- The classification method (K) outperforms a baseline on HIPAA with a 35 pp F1-score gain but struggles with transferability when applied to GDPR.
- The GPT-4 prompting approach (Rice) achieves an 84% recall and significantly reduces manual review tasks despite generating some false positives.
This paper addresses the challenge of Legal Requirements Traceability (LRT), which involves linking software requirements to provisions in legal regulations like GDPR or HIPAA. LRT is crucial for ensuring software compliance but is hindered by the complexity of legal text, the gap between legal and technical language, and the significant manual effort required. Existing automated approaches often rely on older NLP techniques, have limited evaluations, and struggle with transferability across different regulations or domains.
To overcome these limitations, the paper proposes and evaluates two novel automated approaches:
- K (trace linK identificAtion ... using sentence transFormers): This approach frames LRT as a classification problem. It utilizes Sentence Transformers (ST), specifically pre-trained LLMs fine-tuned for semantic understanding, to compute the similarity between software requirements and legal provisions.
- Implementation: K involves preparing training data (requirement-provision pairs with trace links), selecting a base ST model (identified as ST29 - `paraphrase-multilingual-mpnet-base-v2' via experiments), fine-tuning it on an LRT dataset (HIPAA), preprocessing input requirements, computing cosine similarity scores between requirements and all provisions, and predicting trace links based on a threshold (θ).
- Thresholding Methods: Three variants were tested: K<sub>constant</sub> (fixed θ=0.5), K<sub>dynamic</sub> (threshold based on similarity to negative examples), and K<sub>Δ</sub> (threshold based on the maximum drop in sorted similarity scores).
- Rice: This approach leverages LLMs, specifically OpenAI's GPT-4o, through structured prompt engineering based on the Rice framework (Role, Instruction, Context, Examples).
- Implementation: A detailed prompt was designed incorporating context about the LRT task, few-shot examples (requirement, linked provisions, and rationale), specific instructions (consider all provisions, use reasoning for indirect links, prioritize recall, provide rationale), and a defined output format. The prompt was applied to each requirement individually.
The paper also re-implements a Baseline (B) approach from prior work [cleland:2010, Guo:17], which uses a probabilistic method based on indicator terms (keywords) found in requirements and provisions.
Evaluation and Results:
The approaches were evaluated using the HIPAA benchmark dataset and a newly curated dataset comprising four requirements documents (shall-requirements and user stories) traced against 26 GDPR provisions pertinent to software.
- RQ1 (Best ST Model): Comparing 38 ST models in a zero-shot setting on HIPAA using AUC, ST29 (`paraphrase-multilingual-mpnet-base-v2') yielded the best performance (AUC=0.859), highlighting that top-performing general NLP models aren't always best for specific RE tasks.
- RQ2 (K vs. Baseline on HIPAA): Using leave-one-out cross-validation, K<sub>constant</sub> significantly outperformed the baseline B, achieving an average F1-score of ~57% compared to B's ~22% (+35 pp gain). K<sub>constant</sub> provided the best precision-recall balance among K variants, though K<sub>Δ</sub> achieved the highest recall (80%) at the cost of very low precision.
- RQ3 (K on GDPR Dataset): When applied to the unseen GDPR dataset without re-tuning, K<sub>constant</sub>'s performance dropped sharply (average recall ≈ 15%, success rate ≈ 44%). While better than the zero-shot ST29 model, this showed limitations in K's ability to generalize to more complex regulations and diverse requirements types. K also lacked explanatory power (no rationale).
- RQ4 (Rice vs. K on GDPR Dataset): The Rice approach using GPT-4o significantly outperformed K, achieving an average recall of 84% and a success rate of ~89% at the requirement level. Although Rice generated many false positives (FPs) and few exact matches (most were partial matches with 1-3 FPs), it successfully identified the vast majority of true trace links and provided rationales for its predictions. This drastically reduces the analyst's workload, requiring them to vet only ~12% of potential provisions while finding 84% of actual links.
Conclusions:
The paper concludes that while classifier-based approaches like K improve upon older methods for simpler LRT tasks (HIPAA), they struggle with the complexity and transferability needed for regulations like GDPR. LLM-based prompting, as implemented in Rice, demonstrates superior performance and practical utility for complex LRT by leveraging the model's reasoning capabilities and providing explanatory rationales, even without specific training data beyond the few-shot examples. Despite generating FPs, this approach significantly reduces manual effort, making it a more promising direction for future LRT research and practice.