Community-Level Anomaly Detection for Anti-Money Laundering (1910.11313v1)

Published 24 Oct 2019 in cs.LG, cs.CR, and stat.ML

Abstract: Anomaly detection in networks often boils down to identifying an underlying graph structure on which the abnormal occurrence rests on. Financial fraud schemes are one such example, where more or less intricate schemes are employed in order to elude transaction security protocols. We investigate the problem of learning graph structure representations using adaptations of dictionary learning aimed at encoding connectivity patterns. In particular, we adapt dictionary learning strategies to the specificity of network topologies and propose new methods that impose Laplacian structure on the dictionaries themselves. In one adaption we focus on classifying topologies by working directly on the graph Laplacian and cast the learning problem to accommodate its 2D structure. We tackle the same problem by learning dictionaries which consist of vectorized atomic Laplacians, and provide a block coordinate descent scheme to solve the new dictionary learning formulation. Imposing Laplacian structure on the dictionaries is also proposed in an adaptation of the Single Block Orthogonal learning method. Results on synthetic graph datasets comprising different graph topologies confirm the potential of dictionaries to directly represent graph structure information.

Citations (5)

View on Semantic Scholar

Summary

The paper introduces three innovative dictionary learning methods that embed graph Laplacian constraints to improve detection of anomalous patterns in financial networks.
It employs separable dictionary learning and orthonormal block strategies to reduce computational complexity while maintaining high classification accuracy.
Experimental results show accuracies over 90%, demonstrating the potential of these methods for effectively identifying suspicious transaction structures.

This paper introduces three novel dictionary learning (DL) methods designed to improve the detection of anomalous network structures, with a specific focus on anti-money laundering (AML) applications. The core idea is to incorporate graph structural information directly into the dictionary learning process, thereby enhancing the ability to distinguish between normal and abnormal connectivity patterns.

Financial transactions can be modeled as graphs where nodes are entities and edges are transactions. Money laundering schemes often create specific, sometimes complex, subgraph structures. The paper aims to identify these patterns by learning representations of graph structures, particularly Laplacians.

The authors propose three distinct approaches:

1. Laplacian-structured Dictionary Learning

This method directly learns dictionaries whose atoms are constrained to be graph Laplacians. The input signals $\bm{Y}$ are vectorized Laplacian matrices. The goal is to find a dictionary $\bm{D}$ and sparse representations $\bm{X}$ such that $\bm{Y} \approx \bm{DX}$ . The key innovation is that each dictionary atom $\bm{D}_i$ must correspond to the vectorized form of a Laplacian matrix $\bm{L}^{(i)}$ .

Problem Formulation:

The optimization problem is:

$\min\limits_{\bm{D},\bm{X},\bm{L}} \quad \|\bm{Y} - \bm{D}\bm{X}\|_F^2 + \frac{\rho}{2}\sum\limits_{i = 1}^n\left( \text{Tr}(\bm{L}^{(i)}) - m\right)^2$

Subject to:

$\bm{D}_i = \text{vec}(\bm{L}^{(i)})$ , for $1\le i \le n$ (each atom is a vectorized Laplacian)
$\bm{L}^{(i)} \mathbf{1} = 0$ (rows sum to zero)
$\bm{L}^{(i)}_{kj} \le 0$ , for $k \neq j$ (non-positive off-diagonal elements)
$\|\bm{X}_i\|_0 \le s$ (sparsity constraint on representations)

The term $\frac{\rho}{2}\sum\limits_{i = 1}^n\left( \text{Tr}(\bm{L}^{(i)}) - m\right)^2$ is a penalty to enforce the trace constraint $\sum_j \bm{L}^{(i)}_{jj} = m$ (which avoids trivial solutions) and makes the problem more amenable to block coordinate descent by relaxing a coupling constraint.

Implementation - Optimization Strategy:

The problem is solved using an Alternating Minimization (AM) scheme:

Compute Sparse Representations $\bm{X}$ (Sparse Coding): With $\bm{D}$ fixed, solve for $\bm{X}$ . This is a standard $\ell_0$ -constrained sparse coding problem, which can be addressed using algorithms like Orthogonal Matching Pursuit (OMP).

$\bm{X}^{k+1}= \arg\min\limits_{X} \quad \|\bm{Y} - \bm{D}^k \bm{X}\|_F^2 \quad \text{s.t.} \quad \|\bm{X}_i\|_0 \le s$
Compute Dictionary $\bm{D}$ (Dictionary Update): With $\bm{X}$ $X$ fixed, solve for $\bm{D}$ $D$ (and implicitly $\bm{L}^{(i)}$ $L^{(i)}$ s). This subproblem is convex.

$\bm{D}^{k} = \arg\min\limits_{\bm{D},\bm{L}} \quad \|\bm{Y} - \bm{D}\bm{X}^k\|_F^2 + \frac{\rho}{2}\sum\limits_{i = 1}^n\left(\text{Tr}(\bm{L}^{(i)})- m\right)^2$

Subject to Laplacian constraints (zero row sums, non-positive off-diagonals). Due to the scale ( $m^2 \times N$ ), a Block Coordinate Gradient Descent (BCGD) algorithm is proposed. It iteratively updates blocks (rows) of each atom $\bm{D}_i$ .
- A random atom $\bm{D}_i$ and a random $m$ -sized block (representing a row of $\bm{L}^{(i)}$ ) are chosen.
- A projected coordinate gradient descent step is performed. The step size uses an estimated Lipschitz constant $L_i = \|\bm{X}^i\|^2_F + \rho$ .
- The projection is onto the set $\mathcal{X}_{\ell} = \{d \in \mathbb{R}^m: \mathbf{1}^Td = 0, d_{\ell} \ge 0, d_j \le 0, \forall j \neq \ell\}$ , which can be computed efficiently (e.g., using Kiwiel's algorithm in $\mathcal{O}(m \log m)$ ).

The BCGD per-iteration complexity for dictionary update is roughly $\mathcal{O}(mn + m \log m)$ if precomputed terms are used.

2. Separable Laplacian Classification

This approach leverages the 2D structure of Laplacian matrices by using separable dictionary learning. Instead of vectorizing the $m \times m$ Laplacian signals $\bm{Y}$ , they are represented as $\bm{Y} \approx \bm{D}_1 \bm{X} \bm{D}_2^\top$ , where $\bm{D}_1 \in \mathbb{R}^{m \times n_1}$ and $\bm{D}_2 \in \mathbb{R}^{m \times n_2}$ are two dictionaries, and $\bm{X} \in \mathbb{R}^{n_1 \times n_2}$ is the sparse representation. This is equivalent to using a full dictionary $\bm{D} = \bm{D}_2 \otimes \bm{D}_1$ .

Implementation - Classification Scheme:

Training: For each class $c$ $c$ , train a pair of dictionaries $(\bm{D_1}^{(c)}, \bm{D_2}^{(c)})$ $(D_{1}^{(c)}, D_{2}^{(c)})$ using only the training signals belonging to that class.
- Sparse coding can use 2D OMP.
- Dictionary update can use Pairwise Approximate K-SVD (alternately updating $\bm{D}_1$ and $\bm{D}_2$ ).
Testing: For a new test signal $\bm{Y}_{test}$ $Y_{t es t}$ :
- Compute its sparse representation $\bm{X}^{(c)}$ using each class-specific dictionary pair $(\bm{D_1}^{(c)}, \bm{D_2}^{(c)})$ .
- Calculate the reconstruction error: $\bm{\epsilon}_c = \|\bm{Y}_{test} - \bm{D_1}^{(c)} \bm{X}^{(c)} (\bm{D_2}^{(c)})^\top \|_F^2$ .
- Assign the test signal to the class $c$ that yields the minimum reconstruction error.

This method benefits from reduced complexity compared to vectorizing and working with a single large dictionary.

3. Graph Orthonormal Blocks Classification

This method adapts the Single Block Orthogonal (SBO) algorithm. SBO structures the dictionary as a union of orthonormal blocks $\bm{D} = [\bm{Q}_1, \bm{Q}_2, \ldots, \bm{Q}_L]$ , where each $\bm{Q}_j$ is an $m \times m$ orthogonal matrix ( $\bm{Q_j}^T\bm{Q_j} = \bm{I}$ ).

Implementation - SBO Adaptation for Laplacian Classification:

Initialization: For each class $c$ , initialize the orthonormal blocks $\bm{Q}^{(c)}_j$ using orthogonalized versions of true Laplacian matrices characteristic of that class.
Training: Perform SBO training separately for each class using its signals.
- Representation: For a signal $\bm{y}$ , find the best block $\bm{Q}_j$ and compute the sparse representation $\bm{x} = \text{SELECT}(\bm{Q}_j^T\bm{y}, s)$ (hard thresholding, optimal due to orthogonality). The best block is chosen by maximizing the energy of representation coefficients (Proposition \ref{prop:Qalloc}).
- Dictionary Update: Each block $\bm{Q}_j$ is updated using the signals it best represents, by solving an orthogonal Procrustes problem (Proposition \ref{prop:Qopt}), typically involving SVD of $\bm{XY}^T$ .
Classification: Collect all trained blocks from all classes. For a new test signal, determine which block (and thus which class) best represents it using the energy criterion from Proposition \ref{prop:Qalloc}.

SBO offers computational advantages, especially in the representation stage ( $O(m^2)$ ), over methods like K-SVD that use OMP (which is more complex).

Experiments and Results

Two synthetic experiments were conducted, focusing on anomaly detection scenarios where anomalies are rare.

Experiment 1: Anomalous Graph Laplacians

Data: Normal graphs (stochastic block model, 50 nodes) and anomalous graphs (Watts-Strogatz with circular structure, 10 nodes implanted). Laplacians of these graphs were used as signals. 5000 normal, 500 anomalous.
Methods Compared:
- L-structured DL (proposed)
- Separable L-Class (proposed)
- Standard DL Classification (SRC-like: separate dictionaries per class, vectorized Laplacians)
- One-Class SVM (OC-SVM)
Results (Classification Accuracy %):

Method Accuracy

L-structured DL 91.31

Separable L-Class 90.64

DL Classification 89.55

OC-SVM 81.1

The proposed methods that incorporate Laplacian structure outperformed the standard DL and OC-SVM.

Method	Accuracy
L-structured DL	91.31
Separable L-Class	90.64
DL Classification	89.55
OC-SVM	81.1

Experiment 2: Anomalous Signals on Graphs

Data: Signals generated to lie on graphs with different topologies (same as Exp 1). True dictionary $\bm{D} = (\lambda \bm{I} + \bm{L})^{-1} \bm{D_0}$ , ensuring signals adhere to graph structure $\bm{L}$ . 6000 normal, 600 anomalous signals.
Methods Compared:
- SBO adaptation (proposed, initialized with orthogonalized true Laplacians)
- Standard DL Classification (SRC-like)
Results (Classification Accuracy %):
- SBO adaptation: 99.70% (with 48 bases per class)
- DL Classification: 99.77%
- The SBO adaptation achieved comparable performance to standard DL but with significant computational advantages.

Conclusions

The paper successfully demonstrates that incorporating graph structural information into dictionary learning algorithms improves performance in network classification tasks, particularly for anomaly detection.

Directly imposing Laplacian structure on dictionary atoms (L-structured DL) or exploiting the 2D nature of Laplacians (Separable L-Class) yielded better results than structure-agnostic DL and OC-SVM when signals are graph Laplacians.
Adapting SBO with Laplacian-initialized blocks for classifying signals residing on graphs showed performance comparable to standard DL but with lower computational complexity.

These methods show promise for AML by identifying unusual transaction patterns represented as anomalous graph structures. The focus on synthetic data means further validation on real-world financial transaction datasets would be necessary to fully assess their practical utility in AML.

PDF Markdown