ggRandomForests Package
- ggRandomForests is an R package that extracts and visualizes intermediate metrics from Random Forest models, including RSF.
- It decouples data extraction from plot rendering using ggplot2, providing interpretable data frames and customizable graphics.
- The package supports practical diagnostics such as error curves, variable importance, minimal depth, partial dependence, and ROC analyses.
The ggRandomForests package is an R software extension that enables extraction and visualization of intermediate results from Random Forest models—especially Random Survival Forests (RSF)—grown using the randomForestSRC package. Built on the ggplot2 infrastructure, it provides modular, flexible tools for quantifying, analyzing, and interpreting key properties of fitted random forest models, including error curves, variable importance, variable selection, partial dependence, variable effects, ROC curves, and survival probability estimation. The design emphasizes the separation of data extraction and plot rendering, returning analyzable data frames and customizable graphics objects suitable for publication and further analysis (Ehrlinger, 2016).
1. Motivation, Framework, and Objectives
randomForestSRC implements generalizations of Random Forests (RF) for regression, classification, and time-to-event (survival) analysis, offering robust, nonparametric model fitting. While RF provides strong predictive performance, the model is complex and challenging to interpret; native post-hoc diagnostics and visualizations are limited and difficult to individualize. ggRandomForests was developed to address this gap by decoupling extraction of informative metrics (such as OOB error, variable importance, or minimal depth) from their graphical representation, and by ensuring broad coverage of diagnostic queries for forests trained in randomForestSRC across all supported families.
Key objectives include:
- Modularity: Extraction via “gg_” functions produces interpretable data frames (e.g., gg_error, gg_vimp, gg_partial) that can be further analyzed, filtered, or manipulated.
- Completeness: Support for compiling all significant RF/RSF summaries (error trajectory, VIMP, minimal depth, partial and variable dependence, survival and ROC curves).
- Flexibility: Output plots are ggplot2 objects, facilitating extension with custom themes, layers, and labeling by analytical need.
This modular approach enhances reproducibility by making analytic states explicit and amendable prior to visualization.
2. Package Architecture and Data Flow
The integration between randomForestSRC, ggRandomForests, and ggplot2 structures the typical usage workflow. Forest fitting and prediction employ randomForestSRC; data extraction and downstream plotting leverage the S3 API in ggRandomForests.
Functionality Partitioning
| Functional Layer | Primary Methods | Output (Class) |
|---|---|---|
| Forest Fitting | rfsrc(), var.select() | rfsrc, var.select |
| Data Extraction | gg_error(), gg_vimp(), gg_minimal_depth(), gg_partial(), gg_variable(), gg_survival(), gg_rfsrc(), gg_roc() | gg_<feature> |
| Plotting | plot.gg_error(), plot.gg_vimp(), etc. | ggplot object |
The process sequence is as follows:
- Model Fitting: Using
rfsrc()to train models with flagstree.err=TRUEandimportance=TRUEensures compatibility with subsequent analysis. - Data Extraction: Functions prefixed with “gg_” parse rfsrc (or post-processing) objects to yield structured summary data, such as OOB error curves (gg_error), variable importance lists (gg_vimp), or minimal depth rankings (gg_minimal_depth).
- Visualization: Plotting methods (S3 class plotters) take the output data frames and generate ggplot objects, enhancing extensibility and thematic control.
This design defines a reproducible, explicitly staged pipeline from forest fitting to sophisticated interpretive visualization (Ehrlinger, 2016).
3. Core Analytical Functions and Use Cases
Each principle aspect of random forest model inspection is addressed by one or more dedicated functions in the ggRandomForests ecosystem:
- gg_error: Extracts OOB error curves as the number of trees increases. Useful for evaluating sufficient ensemble size and for verifying stabilization of predictive error.
- gg_vimp: Captures permutation-based variable importance (VIMP), quantifying the marginal impact of each variable on predictive accuracy through OOB prediction perturbations.
- gg_minimal_depth: Summarizes minimal depth metrics from
var.select()output, ranking variables by their average proximity to root splits and demarcating “important” variables via analytic thresholds. - gg_minimal_vimp: Visualizes the correspondence between VIMP ranking and minimal depth ranking for comparison and cross-validation.
- gg_survival: Generates empirical survival curves (Kaplan–Meier or cumulative hazard) from data, stratified by covariates if needed.
- gg_rfsrc: Produces aggregated, forest-predicted survival curves—median and confidence intervals—per group or covariate level.
- gg_variable: Constructs variable dependence plots, mapping RF-predicted outcomes as functions of individual variables at specified time points.
- gg_partial: Computes risk-adjusted partial dependence curves, marginalizing predicted effects over empirical covariate distributions.
- gg_roc: Extracts ROC curve diagnostics from classification forests, supporting evaluation of model discriminative performance.
Example Syntax
1 2 3 |
rf <- rfsrc(Surv(time, status) ~ ., data=mydata, importance=TRUE, tree.err=TRUE) errD <- gg_error(rf) plot(errD) + labs(x="Number of Trees", y="OOB Error") + theme_minimal() |
This approach makes each plot both a diagnostic visual and an extractable step in a modular analytic protocol (Ehrlinger, 2016).
4. Theoretical Underpinnings
Key theoretical summaries embedded in the package logic include:
- Out-of-Bag (OOB) Error: For bootstrap trees, OOB predictions leverage the ~36.8% unsampled cases per tree; the OOB error aggregates prediction loss over these cases:
- Variable Importance (VIMP): OOB error after permuting variable , minus the original OOB error, measures the predictive utility:
- Minimal Depth: For a random forest, for each variable is averaged over all trees; importance threshold given by average minimal depth across all variables.
- RSF Survival Estimation: Survival for subject is the mean of the terminal-node Kaplan–Meier estimates over trees where is OOB:
- Partial Dependence: Average predicted effect for variable after marginalizing over the empirical distribution of other covariates:
These principles are consistently applied in function design and interpretation (Ehrlinger, 2016).
5. Application to Survival Data: PBC Case Study
A detailed demonstration on the Primary Biliary Cirrhosis (PBC) dataset (clinical trial data) illustrates core workflow and analytic outputs:
- Data Preparation: Filtering, deriving new features, and transforming time units.
- Model Fitting: RSF is trained using
rfsrc(Surv(years, status) ~ ., ...), with appropriate settings for variable sampling, split strategy, and missing data imputation. - Diagnostics:
- OOB error curves for selecting ensemble size.
- VIMP and minimal depth plots for key predictor identification.
- Comparison plots of VIMP vs. minimal depth orderings.
- Survival curves by treatment, both empirical (Kaplan–Meier) and model-predicted (from forest aggregation).
- Time-varying variable effect plots and risk-adjusted partial dependence for continuous predictors (e.g., bilirubin at 1 and 3 years).
- ROC metrics shown for illustration in the classification context (using the Iris dataset).
The case study confirms the utility of ggRandomForests for visual and quantitative scrutiny of random forest survival models and substantiates the dual aims of prediction and information retrieval in time-to-event settings (Ehrlinger, 2016).
6. Recommended Practices and Interpretive Considerations
- For reproducibility and analytic robustness:
- Always specify
tree.err=TRUEandimportance=TRUEduring model training to support extraction of error and VIMP curves. - Employ large ensemble sizes () for stabilization.
- Customize ggplot2 outputs via additional theming, facetting, and scaling for publication or domain adaptation.
- Always specify
- Analytic caveats:
- VIMP can be biased for predictors with high missingness or many continuous splits.
- Minimal depth and VIMP offer complementary perspectives; disagreement may pinpoint variables sensitive to forest growth stochasticity.
- OOB variable dependence reflects raw marginal effects and may confound with uncontrolled dependencies.
- Partial dependence adjusts risk predictions for covariate distributions but can extrapolate beyond observed data—interpretation should account for potential extrapolation risk.
- ROC diagnostics are strictly for classification forests—not for survival or regression contexts.
These guidelines facilitate critical, replicable use of ggRandomForests for model exploration, particularly in survival analysis, and align with best practices for transparent statistical modeling (Ehrlinger, 2016).