Papers
Topics
Authors
Recent
2000 character limit reached

randomForestSRC Package Overview

Updated 23 December 2025
  • randomForestSRC is a unified R package that implements ensemble random forests for survival, regression, and classification analysis.
  • It adapts Breiman’s nonparametric tree ensembles for right-censored data with integrated error estimation and variable selection techniques.
  • The package provides practical tools such as OOB error metrics, variable importance (VIMP) measures, minimal depth analysis, and interactive visualizations.

The randomForestSRC package is a unified random forest implementation in R supporting survival, regression, and classification analysis based on ensemble decision tree methodology. Originating from Breiman’s nonparametric tree ensembles, with extensions for right-censored time-to-event data by Ishwaran and Kogalur, randomForestSRC provides a single functional and algorithmic framework for a variety of supervised learning scenarios, with extensive tools for error estimation, variable importance, interpretability, and missing data (Ehrlinger, 2016).

1. Core Functionality

The central function rfsrc() grows random forests for three response families:

  • "surv": Time-to-event outcomes, using the Surv(time,status) format.
  • "regr": Continuous regression.
  • "class": Categorical classification.

Default parameterization is set by the detected response type. Defaults include:

  • Split Rule: “logrank” for survival, “gini” for classification, “mse” (mean squared error) or “md” for regression.
  • mtry: p\lfloor\sqrt{p}\rfloor for survival/classification, p/3p/3 for regression.
  • ntree: 1000.
  • nodesize: 1 (classification/regression), 3 (survival).
  • nsplit (survival): 10, i.e., random split selection per candidate covariate.
  • samptype: “swr” (with replacement; bagging), “swor” (without).
  • na.action: “na.impute” (adaptive tree imputation), or “na.omit.”

Users are encouraged to tune:

  • ntree until OOB error stabilizes (typically 500–2000).
  • mtry: Lower values promote tree heterogeneity (diversity), higher values yield lower individual tree bias.
  • nodesize: Smaller values permit deeper trees (greater variance, lower bias); for large data, moderate increases (5–10) may improve generalization.
  • splitrule in survival forests can be “logrank,” “logrankrandom,” or “random.”

2. Algorithmic Foundations

RandomForestSRC employs the classical forest approach:

  • Bootstrap & Aggregation: For b=1,,Bb=1, \dots, B:
    • Draw bootstrap sample DbD_b (size nn) from full data DD, fit tree TbT_b.
    • The OOB (out-of-bag) set, OOBbOOB_b, comprises 36.8%\sim 36.8\% of data not used in DbD_b.
  • Prediction:
    • Regression:

    f^RF(xi)=1Bib:iOOBbf^b(xi)\hat{f}_{\mathrm{RF}}(x_i) = \frac{1}{B_i}\sum_{b: i\in OOB_b} \hat{f}_b(x_i) - Classification: Majority vote from OOB predictions. - Survival: Forest survival estimate is ensemble of terminal node Kaplan–Meier curves.

  • Splitting Criteria:

    • Survival: Maximize two-sample log-rank statistic:

    LR(s)=[tT(dL(t)YL(t)d(t)Y(t))]2tTYL(t)YR(t)d(t)[Y(t)d(t)]Y(t)2[Y(t)1]LR(s) = \frac{\Big[\sum_{t\in\mathcal T}\big(d_L(t) - Y_L(t)\frac{d(t)}{Y(t)}\big)\Big]^2} {\sum_{t\in\mathcal T}\frac{Y_L(t)Y_R(t)d(t)[Y(t)-d(t)]}{Y(t)^2 [Y(t)-1]}}

    maximizing separation of event times post-split. - Regression: Split yielding largest impurity reduction ΔI=(yiyˉ)2\Delta I = \sum(y_i-\bar y)^2. - Classification: Gini impurity or entropy reduction.

  • Survival Functions:

    • Kaplan–Meier within terminal nodes:

    S^b,j(t)=ut(1db,j(u)Yb,j(u))\hat S_{b,j}(t) = \prod_{u\le t}\left(1-\frac{d_{b,j}(u)}{Y_{b,j}(u)}\right)

    Forest ensemble:

    S^RF(txi)=1Bb=1BS^b,jb(t)\hat S_{\mathrm{RF}}(t\mid x_i) = \frac{1}{B}\sum_{b=1}^B \hat S_{b,j_b}(t) - Cumulative hazard via Nelson–Aalen is analogous.

3. Error Estimation and Performance Metrics

  • Classification: OOB error as proportion of misclassified OOB samples; tracked via err.rate</code>.</p></li><li><p><strong>Regression</strong>:OOBmeansquarederror:</p></li></ul><p>err.rate</code>.</p></li> <li><p><strong>Regression</strong>: OOB mean squared error:</p></li> </ul> <p>\mathrm{OOB\_MSE} = \frac{1}{n}\sum_i(y_i-\hat{f}_{\mathrm{OOB}}(x_i))^2</p><ul><li><p><strong>Survival</strong>:</p><ul><li><strong>IntegratedBrierScore</strong>attime</p> <ul> <li><p><strong>Survival</strong>:</p> <ul> <li><strong>Integrated Brier Score</strong> at time t:</li></ul><p>:</li> </ul> <p>BS(t) = \frac{1}{n} \sum_{i=1}^n w_i(t)\left(I\{T_i>t\} - \hat S_{\mathrm{RF}}(t\mid x_i)\right)^2</p><p>with</p> <p>with w_i(t)asinverseprobabilityofcensoringweights.<strong>Concordanceindex(Cindex)</strong>:</p><p> as inverse-probability-of-censoring weights. - <strong>Concordance index (C-index)</strong>:</p> <p>C = \frac{\sum_{i</p><p>quantifyingagreementbetweenpredictedandobservedeventorderings.</p></li></ul><h2class=paperheadingid=variableimportanceandinterpretability>4.VariableImportanceandInterpretability</h2><ul><li><p><strong>PermutationVariableImportance(VIMP)</strong>:</p><ul><li>Forvariable</p> <p>quantifying agreement between predicted and observed event orderings.</p></li> </ul> <h2 class='paper-heading' id='variable-importance-and-interpretability'>4. Variable Importance and Interpretability</h2> <ul> <li><p><strong>Permutation Variable Importance (VIMP)</strong>:</p> <ul> <li>For variable v,permuteOOBvaluesineachtree,anticipateincreaseinOOBerror.</li><li>, permute OOB values in each tree, anticipate increase in OOB error.</li> <li>VIMP(v) = \text{Err}_{\text{permuted}}(v) - \text{Err}_{\text{original}}.</li><li>Valuesarestoredin<code>.</li> <li>Values are stored in <code>importance within model objects and can be visualized.

  • Minimal Depth:
    • depthb(v)\mathrm{depth}_b(v): level of rootmost split on vv in tree bb (root=0).
    • Minimal depth = average over all trees.
    • Variables with small minimal depth generally have greatest predictive relevance.
    • var.select() computes minimal depths and an analytic mean-depth threshold to guide variable selection.
  • Interaction Assessment: find.interaction() returns a p×pp \times p matrix indexed by variable pairs indicating candidate interactive effects, based on minimal depth.
  • 5. Handling Right-Censored Data and Missingness

    • Censoring: Input via Surv(time, status) response, with status 1 for event, 0 for censoring.
    • Split handling: Log-rank statistic and node-specific Kaplan–Meier estimators enable robust nonparametric management of right-censoring.
    • Missing Data: Adaptive “na.impute” option imputes missing entries at each node, using draws from in-node non-missing data in split search and final OOB-aggregated filling.

    6. Model Object Structure and Methodology

    Objects returned by rfsrc() possess a standardized S3 structure:

    • Primary slots: family,family`, `n, p,p`, `mtry, ntree,ntree`, `nodesize.
    • Forest representation: `forestcontainstreememberships(forest` contains tree memberships (ndbigtree), node data, variables and split points.
    • Predictions and error rates: predicted,predicted`, `predicted.oob, err.rate</code>.</li><li><strong>Survivalspecificcomponents</strong>:<code>err.rate</code>.</li> <li><strong>Survival-specific components</strong>: <code>survival` (n×Tn\times Tmatrix), `chf</code>(cumulativehazard),<code>chf</code> (cumulative hazard), <code>time.interest</code>(uniqueeventtimes).</li><li><strong>Variableselectionandinteraction</strong>:Integratedmethodsinclude<code>var.select()</code>and<code>find.interaction()</code>forindepthmodelinterrogation.</li></ul><p>Methodsforvisualizationandexplorationinclude<code>print.rfsrc()</code>,<code>plot.rfsrc()</code>,variableimportanceanddepthplots,andheatmapsofinteractionmetrics.</p><h2class=paperheadingid=representativerusageandworkflowexamples>7.RepresentativeRUsageandWorkflowExamples</h2><p>Canonicalworkflowinvolves:</p><p>!!!!0!!!!</p><p>Thetablebelowsummarizeskeyobjectcomponents:</p><divclass=overflowxautomaxwfullmy4><tableclass=tablebordercollapsewfullstyle=tablelayout:fixed><thead><tr><th>Component</th><th>Description</th><th>Appliesto</th></tr></thead><tbody><tr><td><code></code> (unique event times).</li> <li><strong>Variable selection and interaction</strong>: Integrated methods include <code>var.select()</code> and <code>find.interaction()</code> for in-depth model interrogation.</li> </ul> <p>Methods for visualization and exploration include <code>print.rfsrc()</code>, <code>plot.rfsrc()</code>, variable importance and depth plots, and heatmaps of interaction metrics.</p> <h2 class='paper-heading' id='representative-r-usage-and-workflow-examples'>7. Representative R Usage and Workflow Examples</h2> <p>Canonical workflow involves:</p> <p>
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      
      library(randomForestSRC)
      set.seed(42)
      rf <- rfsrc(
        formula   = Surv(time, status) ~ .,
        data      = pbc.trial,
        ntree     = 1000,
        mtry      = floor(sqrt(ncol(pbc.trial)-2)),
        nodesize  = 3,
        nsplit    = 10,
        na.action = "na.impute",
        importance= TRUE
      )
      plot(rf)
      barplot(sort(rf$importance, decreasing=TRUE), main="VIMP (permutation)")
      varsel <- var.select(rf)
      plot(varsel)
      matplot(rf%%%%34%%%%survival[1:50, ]), type="l", col=rainbow(50))
      pred <- predict(rf, newdata=pbc.test, na.action="na.impute", importance=TRUE)
      library(pec)
      cidx <- concordance.index(
        x = rowMeans(rf$chf),
        surv.time = rf$yvar[,1],
        surv.event= rf$yvar[,2]
      )
      print(cidx$c.index)
      heatmap(find.interaction(rf), symm=TRUE)
      </p> <p>The table below summarizes key object components:</p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Component</th> <th>Description</th> <th>Applies to</th> </tr> </thead><tbody><tr> <td><code>
      err.rate
      OOB error by tree All importance</code></td><td>PermutationVIMP</td><td>All</td></tr><tr><td>importance</code></td> <td>Permutation VIMP</td> <td>All</td> </tr> <tr> <td>`survival` | OOB survival estimatesn×Tn\times T Survival `chfOOBcumulativehazardchf` | OOB cumulative hazardn\times T</td><td>Survival</td><td></td></tr><tr><td><code></td> <td>Survival</td> <td></td> </tr> <tr> <td><code>forest Underlying trees: node data, splits, memberships All var.select() Computes minimal depth, analytic depth threshold All find.interaction() Pairwise minimal depth interactions All

    The randomForestSRC package thus operationalizes nonparametric ensemble learning with rigorous error control and interpretability set within a unified R framework for categorical, continuous, and survival outcomes (Ehrlinger, 2016).

    Definition Search Book Streamline Icon: https://streamlinehq.com
    References (1)

    Whiteboard

    Follow Topic

    Get notified by email when new papers are published related to randomForestSRC Package.