randomForestSRC Package Overview
- randomForestSRC is a unified R package that implements ensemble random forests for survival, regression, and classification analysis.
- It adapts Breiman’s nonparametric tree ensembles for right-censored data with integrated error estimation and variable selection techniques.
- The package provides practical tools such as OOB error metrics, variable importance (VIMP) measures, minimal depth analysis, and interactive visualizations.
The randomForestSRC package is a unified random forest implementation in R supporting survival, regression, and classification analysis based on ensemble decision tree methodology. Originating from Breiman’s nonparametric tree ensembles, with extensions for right-censored time-to-event data by Ishwaran and Kogalur, randomForestSRC provides a single functional and algorithmic framework for a variety of supervised learning scenarios, with extensive tools for error estimation, variable importance, interpretability, and missing data (Ehrlinger, 2016).
1. Core Functionality
The central function rfsrc() grows random forests for three response families:
- "surv": Time-to-event outcomes, using the
Surv(time,status)format. - "regr": Continuous regression.
- "class": Categorical classification.
Default parameterization is set by the detected response type. Defaults include:
- Split Rule: “logrank” for survival, “gini” for classification, “mse” (mean squared error) or “md” for regression.
- mtry: for survival/classification, for regression.
- ntree: 1000.
- nodesize: 1 (classification/regression), 3 (survival).
- nsplit (survival): 10, i.e., random split selection per candidate covariate.
- samptype: “swr” (with replacement; bagging), “swor” (without).
- na.action: “na.impute” (adaptive tree imputation), or “na.omit.”
Users are encouraged to tune:
- ntree until OOB error stabilizes (typically 500–2000).
- mtry: Lower values promote tree heterogeneity (diversity), higher values yield lower individual tree bias.
- nodesize: Smaller values permit deeper trees (greater variance, lower bias); for large data, moderate increases (5–10) may improve generalization.
- splitrule in survival forests can be “logrank,” “logrankrandom,” or “random.”
2. Algorithmic Foundations
RandomForestSRC employs the classical forest approach:
- Bootstrap & Aggregation: For :
- Draw bootstrap sample (size ) from full data , fit tree .
- The OOB (out-of-bag) set, , comprises of data not used in .
- Prediction:
- Regression:
0 - Classification: Majority vote from OOB predictions. - Survival: Forest survival estimate is ensemble of terminal node Kaplan–Meier curves.
Splitting Criteria:
- Survival: Maximize two-sample log-rank statistic:
1
maximizing separation of event times post-split. - Regression: Split yielding largest impurity reduction 2. - Classification: Gini impurity or entropy reduction.
Survival Functions:
- Kaplan–Meier within terminal nodes:
3
Forest ensemble:
4 - Cumulative hazard via Nelson–Aalen is analogous.
3. Error Estimation and Performance Metrics
Classification: OOB error as proportion of misclassified OOB samples; tracked via
p/3$5Survival:
- Integrated Brier Score at time $p/3$6:
$p/3$7
with $p/3$8 as inverse-probability-of-censoring weights. - Concordance index (C-index):
$p/3$9
quantifying agreement between predicted and observed event orderings.
4. Variable Importance and Interpretability
Permutation Variable Importance (VIMP):
- For variable $b=1, \dots, B$0, permute OOB values in each tree, anticipate increase in OOB error.
- $b=1, \dots, B$1.
- Values are stored in
$importancewithin model objects and can be visualized.
- Minimal Depth:
- 2: level of rootmost split on 3 in tree 4 (root=0).
- Minimal depth = average over all trees.
- Variables with small minimal depth generally have greatest predictive relevance.
var.select()computes minimal depths and an analytic mean-depth threshold to guide variable selection.
- Interaction Assessment:
find.interaction()returns a 5 matrix indexed by variable pairs indicating candidate interactive effects, based on minimal depth.
5. Handling Right-Censored Data and Missingness
- Censoring: Input via
Surv(time, status)response, with status 1 for event, 0 for censoring. - Split handling: Log-rank statistic and node-specific Kaplan–Meier estimators enable robust nonparametric management of right-censoring.
- Missing Data: Adaptive “na.impute” option imputes missing entries at each node, using draws from in-node non-missing data in split search and final OOB-aggregated filling.
6. Model Object Structure and Methodology
Objects returned by
rfsrc()possess a standardized S3 structure:- Primary slots:
6n,7mtry,8nodesize. - Forest representation: `9ndbigtree), node data, variables and split points.
- Predictions and error rates:
0predicted.oob,D_bD_b$2chf(cumulative hazard),$D_b$3(unique event times). - Variable selection and interaction: Integrated methods include
var.select()andfind.interaction()for in-depth model interrogation.
Methods for visualization and exploration include
print.rfsrc(),plot.rfsrc(), variable importance and depth plots, and heatmaps of interaction metrics.7. Representative R Usage and Workflow Examples
Canonical workflow involves:
$D_b$6
The table below summarizes key object components:
Component Description Applies to $err.rateOOB error by tree All D_bSurvival `5n\times Tforest Underlying trees: node data, splits, memberships All var.select()Computes minimal depth, analytic depth threshold All find.interaction()Pairwise minimal depth interactions All The randomForestSRC package thus operationalizes nonparametric ensemble learning with rigorous error control and interpretability set within a unified R framework for categorical, continuous, and survival outcomes (Ehrlinger, 2016).
References (1)