Corporate Fraud Detection in Rich-yet-Noisy Financial Graph (2502.19305v2)

Published 26 Feb 2025 in cs.LG, cs.AI, q-fin.RM, and q-fin.ST

Abstract: Corporate fraud detection aims to automatically recognize companies that conduct wrongful activities such as fraudulent financial statements or illegal insider trading. Previous learning-based methods fail to effectively integrate rich interactions in the company network. To close this gap, we collect 18-year financial records in China to form three graph datasets with fraud labels. We analyze the characteristics of the financial graphs, highlighting two pronounced issues: (1) information overload: the dominance of (noisy) non-company nodes over company nodes hinders the message-passing process in Graph Convolution Networks (GCN); and (2) hidden fraud: there exists a large percentage of possible undetected violations in the collected data. The hidden fraud problem will introduce noisy labels in the training dataset and compromise fraud detection results. To handle such challenges, we propose a novel graph-based method, namely, Knowledge-enhanced GCN with Robust Two-stage Learning (${\rm KeGCN}{R}$), which leverages Knowledge Graph Embeddings to mitigate the information overload and effectively learns rich representations. The proposed model adopts a two-stage learning method to enhance robustness against hidden frauds. Extensive experimental results not only confirm the importance of interactions but also show the superiority of ${\rm KeGCN}{R}$ over a number of strong baselines in terms of fraud detection effectiveness and robustness.

Summary

The paper presents a novel model that integrates GNNs with knowledge graph embeddings to effectively mitigate information overload in fraud detection.
The paper employs a Multi-Path Weighted Convolution Network to enhance node representation and improve resilience against hidden label noise.
The paper demonstrates superior detection performance with higher AUC scores over traditional and state-of-the-art methods on extensive Chinese stock market datasets.

Corporate Fraud Detection in Rich-yet-Noisy Financial Graph

The paper "Corporate Fraud Detection in Rich-yet-Noisy Financial Graph" (2502.19305) addresses the challenge of detecting corporate fraud in large-scale, complex financial graphs. By leveraging Graph Neural Networks (GNN) and Knowledge Graph Embeddings (KGE), the proposed model focuses on mitigating issues of information overload and hidden fraud inherent in the noisy datasets.

Introduction

The paper identifies two critical challenges in corporate fraud detection using financial graphs: information overload and hidden fraud. Information overload occurs due to the dominance of non-company nodes in the graph, creating noise that complicates message passing in GNNs. Hidden fraud refers to undetected fraudulent activities that introduce label noise and affect model accuracy.

The proposed model, Knowledge-enhanced GCN with Robust Two-stage Learning, incorporates KGEs to distill relevant information from support nodes, mitigating information overload. Additionally, it employs a two-stage robust learning method to enhance resilience against label noise from hidden frauds.

Methodology

Knowledge-Enhanced GCN

The model utilizes KGE to convert financial transactions and relationships into a feature space where GNN can effectively process them without succumbing to information overload. The knowledge graph is constructed with important financial entities and relationships, such as company-to-transaction links.

Multi-Path Weighted Convolution Layers

To process the graph data, a Multi-Path Weighted Convolution Network (MW-GCN) is employed. This network accounts for the varying importance of different relational paths in the graph, enhancing node representation quality.

Robust Two-Stage Learning

The model's robustness to hidden fraud is achieved through a two-stage learning framework. Initially, a transition model estimates the likelihood of hidden fraud, capitalizing on instance and neighborhood dependencies. In the second stage, the main model is optimized with a corrected loss function to account for the noise egressed from hidden fraud.

Experimental Results

Dataset

The model is evaluated on three rich datasets from the Chinese stock market, divided into the Main Board Market (MBM), Growth Enterprise Market (GEM), and Small and Medium Enterprise Board Market (SME). Each dataset employs historical financial records over 18 years, incorporating company attributes and relational data.

Performance Comparison

The proposed model significantly outperforms traditional methods such as XGBoost and DNN, as well as state-of-the-art GNN models like DAGNN and FastGTN. In terms of Area Under the Curve (AUC) scores, it consistently registers higher values, indicating superior fraud detection capabilities.

Figure 1: Relations (e.g. illegal transactions) are essential for corporate fraud detection. When a violation goes undetected in the historical record, it is referred to as a hidden fraud case. Such cases become label noises hindering the effectiveness of corporate fraud detection.

Implications and Future Work

The research demonstrates the efficacy of integrating KGE with GCNs to address the dual challenges of information overload and hidden fraud in fraud detection. The structural adaptations cater specifically to financial datasets, enabling more accurate detection in the presence of noisy data.

Future work could explore further refinements in handling class imbalance and the distinct characteristics of specific fraud schemes. Additionally, the adaptability of the model to other industries or graph-based problems represents a potential direction for extending its applicability.

Conclusion

The paper offers a comprehensive solution for corporate fraud detection in complex and noisy financial graphs, significantly improving detection accuracy through innovative use of KGE and robust learning techniques. The approach provides a solid foundation for tackling similar challenges in other domains involving relational data.