Synthetic Data Engine for Fraud Research
- Synthetic Data Engine is a highly configurable system that programmatically generates datasets mimicking real-world covariate distributions and complex network structures.
- It employs modular simulation workflows and statistical models like Poisson and Gamma GLMs to replicate policyholder, contract, and claim data with precise control over parameters.
- The engine supports benchmarking and method evaluation in fraud analytics by enabling controlled experiments on class imbalance, network features, and investigative errors.
A synthetic data engine is a highly configurable software system for programmatically generating datasets that replicate both the covariate distributions and structural complexities of real-world data, while allowing explicit experimental control over parameters, dependencies, and simulation process. In the context of insurance fraud analytics, as described in Oskarsdóttir et al. (2022), the simulation engine produces synthetic datasets with heterogeneous data types and rich network structure that mimic the relational and covariate patterns of actual insurance fraud claims data. The system offers granular user control over population, dependency structures, probability models, and the downstream fraud generation process, enabling rigorous evaluation and development of learning and detection strategies.
1. Modular Simulation Workflow and Data Model
The synthetic data engine’s workflow is sequential and modular, encompassing seven primary steps, each corresponding to a layer in the modeled data hierarchy:
- Policyholder Generation: Simulates demographic and behavioral attributes (e.g., age, gender, exposure, number of contracts) using distributions such as normal, Poisson, and empirical dependency models (e.g., copulas for negative correlation between vehicle age and value).
- Contract Feature Simulation: Generates contract-level attributes (vehicle attributes, coverage type, bonus-malus score) using categorical, discrete, and continuous models with options for simulating dependencies through copulas (e.g., Ali–Mikhail–Haq, Frank).
- Claims Generation: Models claim occurrence frequency using a Poisson GLM as a function of policyholder and contract attributes, and claim severity with a gamma GLM (log link), parameterized by user-selected covariates and effect sizes.
- Social Network Construction: Builds a bipartite graph that links claims (one class of nodes) with “parties” (e.g., garages, brokers; another node class), forming the foundation for engineered network features and relational analytics.
- Fraud Label Generation: Generates ground-truth fraud labels for claims using a logistic regression that incorporates both traditional features (claim amount, claim age, policyholder age/contract count) and engineered network features (neighborhood sizes, second-order ratios, BiRank score).
- Business Rule Filter and Expert Investigation Simulation: Applies expert-system rules (e.g., flagging based on high claim amount relative to vehicle value or repeat claims within short intervals), then randomly determines investigation outcomes conditional on underlying (simulated) fraud. Provides both ground truth and noisy labels reflecting real-world investigation coverage and errors.
- Tabular Merging and Output: Integrates all simulation layers into a unified dataset containing all attributes, network links, fraud labels, and engineered covariates for downstream analysis.
This architecture supports extensibility; each module’s parameters—such as the number of policyholders, class-imbalance ratios, and network feature effect sizes—can be tuned for specific research or benchmarking needs.
2. Parameterized Data Generation and Engineered Features
Detailed data generation mechanisms provide flexibility and statistical fidelity:
- User Controls: Researchers specify the scale (e.g., 10,000 policyholders), type and number of parties, fraud rate (e.g., 1–2%), and effect size magnitudes for covariates.
- Covariate Simulation: Uses a suite of probabilistic generators:
- Contract count: Poisson distribution, with λ possibly conditioned on age (e.g., via convex function).
- Vehicle and contract attributes: mix of categorical generators and copulas to match empirical dependencies.
- Claims Frequency:
- is exposure; are covariates; is the coefficient vector (user-controlled).
- Claim Severity:
- models correlation between claim frequency and severity.
- Coverage Type: Modeled by multinomial logistic regression for cases with more than two categorical outcomes.
- Social Network Analytics: For each claim, engineered features include:
- First- and second-order neighborhood statistics (e.g., , ), the ratio of fraudulent claims in the second-order neighborhood (), and a fraud “score” using iterative BiRank (, ), with the symmetrically normalized adjacency matrix.
- Normalization: Continuous features are -score normalized or via min-max scaling to :
Such explicit modeling allows fine-grained matching of real data statistics and structure, subject to quality of domain knowledge and observed relationships.
3. Addressing Methodological and Practical Challenges
The simulation engine explicitly targets multiple persistent challenges in fraud analytics:
- Class Imbalance: Since fraud rates in real datasets are typically 1–2%, the engine allows explicit control of class proportions and effect sizes to enable benchmarking under rare-event learning scenarios.
- Label Scarcity: Simulates limited investigation with random errors, reflecting that only a small proportion of real claims are labeled but most are unlabeled.
- Complex Relational Structure: By simulating a bipartite network and computing advanced network metrics, the engine enables research into state-of-the-art fraud detection strategies leveraging graph-based learning and feature engineering as advocated in, e.g., Van Vlasselaer et al. (2016), Tumminello et al. (2023).
- Data Scarcity and Confidentiality: Synthetic datasets circumvent privacy barriers to data sharing, facilitating open benchmarking and research reproducibility.
4. Applications, Experimentation, and Research Enablement
Applications for the synthetic data engine include:
- Benchmarking Classification Algorithms: Enables evaluation of both classical (e.g., logistic regression, random forests) and advanced (deep or relational) machine learning models under known ground truth.
- Method Comparison Under Controlled Variation: By varying the strength or inclusion of network features, claim severity (e.g., exaggeration as fraud proxy), or investigation intensity, the engine supports stress-testing of analytic pipelines.
- Study of Semi- and Unsupervised Methods: Simulated partial labeling and controlled label noise reproduce practical challenges conjugate to real fraud investigation, making the engine suitable for research on semi-supervised, self-training, or active learning frameworks.
- Ablation and Impact Analysis: Researchers can ablate or amplify covariates, network links, or engineered feature effect sizes to directly quantify their importance in model performance.
- Exploring Cost-Sensitive and Practical Impacts: By simulating realistic investigation errors and business rules, utility analyses can be conducted on the value or risk of alternative fraud tackling strategies.
5. Technical Implementation and Key Formulae
The engine relies on well-established statistical and network-theoretic approaches:
| Model/Algorithm | Formula/Description | Usage |
|---|---|---|
| Claim Frequency | Claim count per contract | |
| Claim Severity | Simulates claim amount | |
| Fraud Label GLM | Fraud probability per claim | |
| Multinomial Logistic | , | Simulating multi-level features |
| BiRank Propagation | , | Fraud score feature in network |
| Normalization | Comparable continuous covariates |
This tight integration of statistical and network simulation is precisely documented, supporting a diverse array of downstream analytic strategies.
6. Limitations, Interpretation, and Research Context
While the engine supports high realism with user control, the fidelity to real-world patterning is constrained by the accuracy of chosen models, dependency structures, and domain expertise. Mis-specification of distributions, covariate relationships, or business rules will propagate into the synthetic dataset. The approach relies on prior findings (Oskarsdóttir et al. 2022; Van Vlasselaer et al. 2016; Tumminello et al. 2023) regarding the efficacy of relational/network features for fraud detection.
Furthermore, the generated data is neither intended to be directly substitutable for real cases in operational systems nor to identify actual fraudsters; its intended use is methodological development, reproducibility benchmarking, and experimental analysis where ground truth and feature transmission must be fully known and controlled.
7. Significance and Future Directions
This synthetic data engine represents a comprehensive simulation environment bridging individual, contract, claim, and relational data generation for fraud research. It enables evaluation of model robustness, comparison of feature engineering strategies, and exploration of semi-supervised settings under realistic but shareable conditions. Future development may include extending the complexity of network evolution, introducing multi-period simulation, or integrating more granular claim timing effects. As the field moves toward graph learning and network-based analytics, such engines will continue to serve as foundational resources for methodological and applied research in financial crime analytics and insurance fraud detection (Campo et al., 2023).