Autoencoder-PRC-RF Model for Anomaly Detection
- The paper introduces a hybrid model that combines autoencoder filtering with PRC-RF classification to enhance anomaly detection in high-dimensional, imbalanced datasets.
- It employs autoencoder-based feature compression to remove noise and redundant information, maintaining data integrity for subsequent analysis.
- Precision-recall optimized decision trees in the random forest boost minority class detection in critical applications like fraud, network intrusion, and medical diagnostics.
The Autoencoder-PRC-RF Model is a hybrid ensemble methodology designed to address the dual challenges of extreme class imbalance and high dimensionality, particularly in anomaly detection tasks. By integrating unsupervised autoencoders for feature compression and denoising with ensembles of Precision-Recall Curve (PRC) Random Forests for supervised classification, the framework aims to enhance accuracy, interpretability, and robustness over previous approaches in domains such as fraud detection, network intrusion, and medical diagnostics (Miao et al., 6 Sep 2025).
1. Model Architecture and Workflow
The Autoencoder-PRC-RF architecture is explicitly structured as a two-stage ensemble:
- Autoencoder Filtering:
- An autoencoder, comprising an encoder and decoder , is trained to minimize reconstruction loss (typically mean squared error ).
- The encoder compresses high-dimensional input into a lower-dimensional latent representation. The decoder reconstructs the input from this latent code.
- During deployment, the reconstruction error is used as an anomaly score. Observations with reconstruction errors above a cross-validated threshold are considered anomalous and filtered out.
- PRC Random Forest Classification:
- The filtered (or denoised) data is fed into the PRC Random Forest (PRC-RF), an ensemble of decision trees that select split features based on area under the precision-recall curve (AUPRC).
- For each node, candidate features are evaluated; AUPRC is calculated via trapezoidal approximation:
where and denote recall and precision at each possible split. - Each tree aggregates terminal node probabilities for input . - Trees are built on bootstrap samples using randomly selected feature subsets, similar to classical random forest bagging.
The complete algorithm follows the sequence:
- Apply autoencoder to preprocess
- Filter data using anomaly threshold on reconstruction error
- Train PRC-RF ensemble on cleaned dataset
2. Functionality and Design Principles
The model’s hybrid integration serves two fundamental purposes:
- Mitigating Class Imbalance: PRC trees utilize precision-recall metrics (AUPRC and F1-score) at every split, focusing decision boundaries on the minority class. The baseline AUPRC is set explicitly as , directly quantifying imbalance.
- Combating High Dimensionality: The autoencoder’s latent compression eliminates redundant and noisy features, making subsequent tree splits more stable and generalizable.
This design ensures that anomaly detection is both sensitive and scalable, maintaining high precision and recall for rare classes while handling large feature sets.
3. Performance Metrics
Model performance is assessed using the following metrics, fundamental for imbalanced problems:
Metric | Definition | Role |
---|---|---|
Recall | Sensitivity | |
Precision | Fraction correct | |
Specificity | Non-anomaly rate | |
Accuracy | Overall correctness | |
F1 Score | Balance metric |
Experimental comparisons show that the Autoencoder-PRC-RF framework generally achieves higher F1 scores and accuracy than baseline PRC-RF models. This suggests that preprocessing with autoencoders improves anomaly identification and correct classification, especially for the minority class (Miao et al., 6 Sep 2025).
4. Experimental Protocols
Key attributes of evaluation include:
- Datasets: The model is tested on benchmark datasets from UCI and Kaggle: credit default (, , minority), financial distress (, , minority), and breast cancer diagnosis (, , minority).
- Training: Autoencoder hyperparameters (layering, latent dimension, loss) are tuned per dataset. Reconstruction errors are thresholded via cross-validation to filter anomalies. PRC-RF is trained on the cleaned set; each ensemble tree evolves using bootstrapping and random feature selection, split optimization using AUPRC and F1.
- Validation: Results are averaged over multiple random splits (e.g., 100 repetitions), ensuring statistical reliability of reported accuracy, specificity, recall, precision, and F1 metrics.
5. Applications and Use Cases
The Autoencoder-PRC-RF model is relevant in high-stakes environments characterized by imbalanced and high-dimensional data:
- Financial Fraud Detection: Improved isolation of rare fraudulent events by leveraging PRC-focused splits and denoising autoencoder preprocessing.
- Network Security and Intrusion Detection: Enhanced sensitivity to aberrant activity, suitable for real-time and streaming anomaly monitoring.
- Medical Diagnostics: Reliable positive case identification where miss rates are critical, e.g., breast cancer prediction, leveraging interpretability of PRC decision pathways.
The interpretability of PRC tree splits, which are based on AUPRC and F1-score, provides actionable explanations for domain experts, crucial in regulated and clinical contexts.
6. Comparison and Context
Relative to prior methods:
- PRC-RF alone: Does not address noise or redundant feature contamination in high-dimensional data. Adding the autoencoder filter enhances both accuracy and minority class detection.
- Classical random forests: Use impurity-based splitting (Gini, information gain), which may be suboptimal under extreme imbalance. PRC-based splitting directly optimizes for minority class sensitivity.
- Deep neural approaches: Can struggle with interpretability and may require larger sample sizes. The hybrid approach retains the tree model's clarity and the autoencoder's representational power.
Empirical results demonstrate that Autoencoder-PRC-RF consistently surpasses baseline PRC-RF and impurity-based forest models in F1 scores and overall detection accuracy on representative datasets (Miao et al., 6 Sep 2025).
7. Prospects and Limitations
The framework’s modularity, scalability, and interpretability position it well for deployment in operational anomaly detection settings. A plausible implication is that further refinement of autoencoder architectures or adaptive threshold selection could yield even greater sensitivity to subtle anomalies. However, as with all two-stage models, performance may depend on judicious tuning of the autoencoder’s reconstruction threshold: too aggressive filtering risks loss of minority-class signals, while too lenient inclusion retains outliers.
The ensemble structure is robust to parameter choices such as number of trees and feature subsets, but care is necessary to prevent overfitting during autoencoder training, especially given class imbalance in the raw data.
In summary, the Autoencoder-PRC-RF Model embodies a technically rigorous hybrid paradigm: it combines autoencoder-based unsupervised denoising with precision-recall optimized tree ensembles to deliver accurate, interpretable, and scalable anomaly detection in high-dimensional, imbalanced data domains (Miao et al., 6 Sep 2025).