Balanced Gain Ratio (BGR) in Decision Trees

Updated 20 March 2026

Balanced Gain Ratio (BGR) is an impurity-based gain metric that adjusts the traditional gain function to correct bias in decision tree splits.
It computes the metric as the information gain divided by (1 + split information), preserving beneficial correction for high-arity splits while preventing bias from unbalanced, low-predictive splits.
Empirical evaluations demonstrate that BGR produces shallower, more interpretable trees and improves model induction speed and accuracy across various benchmark datasets.

The Balanced Gain Ratio (BGR) is an alternative impurity-based gain function for decision tree induction that addresses split selection biases observed in earlier metrics such as information gain and Quinlan's information gain ratio in C4.5. BGR is specifically designed to preserve the corrective properties of the gain ratio regarding high-cardinality categorical splits, while simultaneously mitigating the tendency of the gain ratio to favor extremely unbalanced, low-predictive-value splits. Its implementation yields more balanced decision trees—both in terms of depth and branch distribution—without reintroducing bias towards features with numerous distinct values (Leroux et al., 2018).

1. Formal Definition and Motivation

Let $C$ denote a training sample set of size $n$ with $K$ classes. A candidate split $S$ divides $C$ into $J$ child nodes $C^1, \ldots, C^J$ . For each candidate split, the impurity reduction gain $G(S, C)$ (typically based on entropy) and the split information

$\text{SplitInformation}(S, C) = -\sum_{j=1}^J p^j \log p^j,$

with $p^j = |C^j|/n$ , are computed. Quinlan's traditional gain ratio is given by

$\text{GainRatio}(S,C) = \frac{G(S,C)}{\text{SplitInformation}(S,C)}.$

The Balanced Gain Ratio is formulated as

$\Gamma(S,C) = \frac{G(S,C)}{1 + \text{SplitInformation}(S,C)}.$

The addition of $1$ to the denominator is designed to eliminate the artificial inflation of the score for splits where split information is near zero (characteristic of highly unbalanced splits). This adjustment maintains the bias correction for high-arity (many-way) splits but prevents favoring splits that isolate tiny, pure subsets—a known issue with Quinlan's original ratio.

2. Mathematical Derivation

For a node $C^j$ , the impurity using entropy is

$I(C^j) = -\sum_{k=1}^K p_k^j \log p_k^j,$

where $p_k^j$ is the empirical frequency of class $k$ in $C^j$ . The standard information gain is

$G(S, C) = I(C) - \sum_{j=1}^J \frac{|C^j|}{n} I(C^j).$

The split information is

$SI(S, C) = -\sum_{j=1}^J p^j \log p^j.$

Thus, the Balanced Gain Ratio is

$\Gamma(S,C) = \frac{G(S,C)}{1 + SI(S,C)}.$

If $SI$ is small (highly uneven splits), the denominator approaches $1$, so the ratio closely reflects the raw information gain. For large $SI$ (many-way splits), the behavior is similar to the original gain ratio, strongly penalizing high-cardinality splits.

3. Bias Corrections in Gain Functions

Standard Information Gain

The standard impurity reduction $G(S,C)$ has a known bias toward splits on high-cardinality features because maximizing purity is trivial when each instance can be isolated. In the limit, features with one unique value per instance yield maximal gain, even absent predictive value.

Quinlan’s Gain Ratio

Quinlan’s ratio penalizes high-cardinality splits:

$\text{GainRatio}(S,C) = \frac{G(S,C)}{SI(S,C)}$

Here, $SI$ grows with the number of partitions, discouraging many-way splits. However, when $SI$ is very small—characteristic of splits producing one large and several tiny partitions—the denominator deflates, causing the gain ratio to be artificially large and biasing tree growth toward these unbalanced splits.

Balanced Gain Ratio

By redefining the denominator as $1 + SI$, BGR suppresses this bias. For large $SI$ , the additive $1$ is negligible and BGR approximates the original gain ratio. For $SI \rightarrow 0$ , BGR equals $G$ , precluding artifactual boosting of splits that simply isolate noise points.

4. Algorithmic Implementation

BGR integrates into top-down, greedy induction schemes such as C4.5 by substituting the gain function for split selection. The following pseudocode outlines the integration:

function BuildTree(nodeSamples C):
  if stoppingCriterion(C) then
    return Leaf(label=majorityClass(C))
  bestScore ← −∞
  bestSplit  ← null
  for each feature f:
    for each candidate split S on f:
      Partition C into {C^1,…,C^J} according to S
      Compute p^j = |C^j|/|C|, j=1…J
      Compute impurity I(C) and I(C^j) for all j
      G ← I(C) − Σ_{j=1}^J p^j I(C^j)                # standard gain
      SI ← −Σ_{j=1}^J p^j log(p^j)                  # split information
      BGR ← G / (1 + SI)                            # balanced gain ratio
      if BGR > bestScore:
        bestScore ← BGR
        bestSplit ← S
  if bestSplit is null then
    return Leaf(label=majorityClass(C))
  create internal node with test bestSplit
  for each child partition C^j of bestSplit:
    attach BuildTree(C^j) as j-th branch
  return internal node

All other induction, threshold discovery, and pruning strategies in conventional C4.5 are preserved (Leroux et al., 2018).

5. Theoretical Properties

Monotonicity in Split Information: The denominator $1 + SI$ is strictly increasing with $SI$ , ensuring that for constant gain, higher split information always reduces or preserves the BGR score.
Limiting Behavior:
- As $SI \to \infty$ , $\Gamma \sim G/SI \to 0$ , strongly discouraging high-arity splits.
- As $SI \to 0$ , $\Gamma \to G$ , removing any split information penalty and thus not artificially inflating scores for extreme unbalanced splits.
Effect of Partition Arity: For $J$ branches, the maximal $SI$ is $\log J$ . Large $J$ implies strong denominator growth and penalization.
Dependence on Class Distribution: BGR, through $G$ , remains sensitive to class mixtures in the partitions, but applies a tempered correction related to the balance of the split.

6. Empirical Evaluation

BGR and Quinlan’s gain ratio were evaluated on ten benchmark UCI datasets using C4.5-style induction, pessimistic error pruning, and stratified cross-validation (10 iterations of 5-folds). The key results on average accuracy are summarized below:

Dataset	GainRatio (%)	BGR (%)	Δ = BGR–GainRatio (%)
Glass	64.02	65.42	+1.40
BUPA Liver	60.58	66.67	+6.09
Heart	78.15	77.78	−0.37
Balance Scale	73.92	77.60	+3.68
Survival	73.86	72.87	−0.99
PIMA	72.79	75.26	+2.47
Wine	94.38	94.38	+0.00
Bank Marketing	89.87	90.23	+0.36
Income	83.92	84.26	+0.34
Letter	86.85	87.40	+0.55

Notably, in large datasets (e.g., Letter), BGR produced trees with approximately 20 levels compared to ≈100 levels for the gain ratio, with a proportional speed-up in induction time. A plausible implication is that BGR's balanced bias leads to shallower, more interpretable models while maintaining or improving predictive accuracy.

7. Guidelines for Adoption

Integration: Deployment in C4.5-style systems requires only substituting $SI$ with $1+SI$ in the denominator of the gain function.
Applicability: BGR is especially advantageous in high-dimensional settings and with features possessing many categorical levels. In small or shallow-tree scenarios, performance closely matches that of the gain ratio (typically within 1%).
Parametric Stability: No additional hyperparameters are introduced; the $+1$ correction is robust across diverse datasets.
Extendibility: The $1+SI$ denominator adjustment generalizes to impurity-based gain metrics that use split information penalization, including those applied to multi-way numeric splits (Leroux et al., 2018).

Markdown Report Issue Upgrade to Chat

References (1)

Information gain ratio correction: Improving prediction with more balanced decision tree splits (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Balanced Gain Ratio (BGR).