Cross-Domain Data Selection and Augmentation for Automatic Compliance Detection

Published 23 Apr 2026 in cs.CL and cs.LG | (2604.21469v1)

Abstract: Automating the detection of regulatory compliance remains a challenging task due to the complexity and variability of legal texts. Models trained on one regulation often fail to generalise to others. This limitation underscores the need for principled methods to improve cross-domain transfer. We study data selection as a strategy to mitigate negative transfer in compliance detection framed as a natural language inference (NLI) task. Specifically, we evaluate four approaches for selecting augmentation data from a larger source domain: random sampling, Moore-Lewis's cross-entropy difference, importance weighting, and embedding-based retrieval. We systematically vary the proportion of selected data to analyse its effect on cross-domain adaptation. Our findings demonstrate that targeted data selection substantially reduces negative transfer, offering a practical path toward scalable and reliable compliance automation across heterogeneous regulations.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that principled data selection, especially importance weighting and Moore-Lewis methods, significantly enhances compliance detection accuracy across domains.
It models regulatory compliance as a Natural Language Inference task, effectively categorizing entailment, neutral, and contradictory pairs for increased explainability.
Empirical findings reveal that optimal selection ratios prevent negative transfer, highlighting the need for careful calibration when augmenting source domain data.

Cross-Domain Data Selection and Augmentation for Automated Regulatory Compliance Detection

Introduction

Automatic detection of regulatory compliance within software systems presents considerable challenges due to the heterogeneity and complexity of legal texts. Existing models, while effective within a single regulatory domain, suffer significant performance degradation when transferred across domains, predominantly because of non-trivial shifts in legal terminology, structure, and reasoning styles. The paper "Cross-Domain Data Selection and Augmentation for Automatic Compliance Detection" (2604.21469) addresses this by systematically evaluating data selection strategies for augmenting compliance datasets, aiming to mitigate negative transfer and enhance cross-domain generalization. The study frames compliance detection as a Natural Language Inference (NLI) task and benchmarks several principled data selection methods to control the quality of cross-domain augmentation.

Figure 1: Pipeline illustrating the evaluation of multiple data selection methods to subset a large source domain (e.g., GDPR) for effective transfer to a smaller target domain (e.g., HIPAA) and prevent negative transfer.

Problem Formulation and Methodology

Compliance as Natural Language Inference

The compliance detection task is modeled as NLI, where regulatory requirements serve as premises and corresponding statements from Data Processing Agreements (DPAs) or policy documents serve as hypotheses. The classification is formulated into entailment (compliance), neutral, or contradictory (non-compliance) pairs. This approach leverages the verbal structure of legal requirements, enabling the model to generalize over explicit textual entailments and facilitating data-driven explainability.

Data Selection Strategies

The paper explores four data selection strategies for cross-domain augmentation:

Random Sampling: Baseline method with uniform sampling from source domain.
Moore-Lewis Cross-Entropy Difference: Ranks source data based on differential cross-entropy scores from LLMs trained separately on source and target domains, prioritizing samples statistically similar to the target.
Importance Weighting (Density Ratio): Trains a classifier to estimate density ratio between target and source instances, then uses these estimates to select source samples with high target-likelihood.
Embedding-Based Retrieval: Utilizes pretrained encoder representations (e.g., RoBERTa-large), selecting source examples with high cosine similarity to target-domain instances.
Figure 2: Visualization of data selection scores and position rankings for the embedding similarity method.

Figure 3: Distribution of embedding similarities between source and target datasets using RoBERTa-large, illustrating the discriminative signal for neighbor selection.

Additionally, a random sampling baseline and full (unfiltered) augmentation serve as controls to measure the effects of indiscriminate data addition.

Experimental Design

Datasets

The experiments are anchored on a large GDPR-DPA dataset as the source domain and a much smaller, annotated HIPAA dataset as the target. The regulatory shift between GDPR (focused on data protection within the EU) and HIPAA (healthcare data privacy in the US) ensures a strong domain mismatch, making this a stringent test for cross-domain transfer approaches.

Evaluation Protocol

Two primary scenarios are considered:

In-domain: Models are trained and tested within the same domain, establishing an upper performance bound.
Cross-domain: Models are trained on GDPR-DPA (with selectable augmentation) and evaluated on HIPAA, measuring transfer effectiveness under various selection ratios (from 1% up to 90%).

Encoder-based models (BERT-large, RoBERTa-large, Legal-BERT) are trained via multi-class NLI fine-tuning; decoder-based LLMs (GPT-2-XL, Llama-3) are evaluated under zero- and one-shot prompting to quantify the gap with discriminative models in compliance scenarios.

Empirical Findings

In-domain Performance

Fine-tuned encoder models achieve strong F1 scores on in-domain (GDPR-DPA) tests (Legal-BERT: 0.86, BERT: 0.85), but performance degrades sharply on the out-of-domain (HIPAA) test (F1 ≤ 0.56 for all models). Decoder LLMs, even with prompting, are substantially weaker, barely above random-chance on both domains. This underscores the intractability of unsupervised domain transfer without explicit adaptation.

Cross-Domain Data Selection Results

Systematic augmentation using data selection methods delivers substantial improvement over target-only training, but the relationship between selection ratio and performance is non-monotonic. The most salient observations include:

Embedding Similarity: Achieves robust gains at both low (1%) and high (75%) selection ratios (validation F1 up to 0.93), but exhibits volatility, with notable negative transfer at intermediate ratios.
Importance Weighting: Peaks with an F1 of 0.97 (validation) at 5% selected data, substantially outperforming random or full-augmentation baselines, though performance collapses at larger ratios due to reintroduction of poorly matched samples.
Moore-Lewis Cross-Entropy: Delivers stable improvement, especially at higher selection ratios, with fewer sharp outliers and less exposure to severe negative transfer.
Random Sampling and Full Augmentation: Both suffer from pronounced negative transfer at intermediate to high ratios, with F1 scores often dropping below those of no augmentation.
Figure 4: Data selection scoring and ranking under the importance weighting method, illustrating which source instances are prioritized for augmentation.

The paper makes a strong empirical claim that augmenting with a small, carefully selected subset of source-domain data—using importance weighting or Moore-Lewis selection—substantially outperforms using the full source dataset or random sampling, with importance weighting yielding the best peak F1.

Negative Transfer Analysis

Qualitative review and lexical statistics reveal that the uppermost-ranked source samples according to selection scores correspond to meaningful alignment regarding domain-relevant concepts (e.g., mappings between inactivity timeouts in GDPR and HIPAA security controls). By contrast, as selection thresholds widen, generic or superficial legal phrasing contaminates the augmented dataset, inducing spurious lexical overlap without true semantic transfer, thereby causing negative transfer and empirical loss. The analysis supports capping source augmentation at low percentiles and using signal metrics sensitive to semantic alignment rather than surface overlap.

Implications and Future Directions

Practical Implications

Principled data selection is crucial to safe and effective cross-domain transfer in regulatory compliance detection tasks; naively increasing training set size with uncontrolled cross-domain data can degrade model reliability.
Embedding-based and density ratio techniques are comparatively robust, but their operational benefits are sensitive to selection ratios, and both require careful calibration.
These findings generalize to compliance systems targeting regulatory heterogeneity, where annotated target data are scarce—demonstrating a pathway to scalable, explainable compliance verification.

Theoretical Implications and AI Research Outlook

This work reinforces the centrality of distributional alignment and domain adaptation theory within NLP for legal and regulatory text analysis. The observation of non-monotonic transfer curves and catastrophic negative transfer at particular augmentation ratios has direct theoretical relevance for transfer learning research, renewable in broader multi-domain adaptation scenarios.

Future investigations should expand empirical analysis to other legal regimes (e.g., finance, safety standards), develop compliance-specific representations (such as domain-aware LMs), and explore synthetic augmentation strategies (e.g., paraphrasing, backtranslation, structure-guided generation). Further, analysis of internal model representations and explanation faithfulness is required for high-stakes compliance automation systems to enhance trust and traceability.

Conclusion

The study provides a comprehensive and data-driven framework for cross-domain compliance detection, demonstrating that targeted selection of cross-domain data—especially via importance weighting and Moore-Lewis entropy difference—enables superior model transferability while mitigating negative transfer. These insights hold direct implications for the design of cross-regulatory compliance automation and transfer learning methodologies in NLP, and suggest promising directions for principled, scalable, and trustworthy AI systems in high-stakes legal contexts.