Balanced Gain Ratio (BGR) in Decision Trees
- Balanced Gain Ratio (BGR) is an impurity-based gain metric that adjusts the traditional gain function to correct bias in decision tree splits.
- It computes the metric as the information gain divided by (1 + split information), preserving beneficial correction for high-arity splits while preventing bias from unbalanced, low-predictive splits.
- Empirical evaluations demonstrate that BGR produces shallower, more interpretable trees and improves model induction speed and accuracy across various benchmark datasets.
The Balanced Gain Ratio (BGR) is an alternative impurity-based gain function for decision tree induction that addresses split selection biases observed in earlier metrics such as information gain and Quinlan's information gain ratio in C4.5. BGR is specifically designed to preserve the corrective properties of the gain ratio regarding high-cardinality categorical splits, while simultaneously mitigating the tendency of the gain ratio to favor extremely unbalanced, low-predictive-value splits. Its implementation yields more balanced decision trees—both in terms of depth and branch distribution—without reintroducing bias towards features with numerous distinct values (Leroux et al., 2018).
1. Formal Definition and Motivation
Let denote a training sample set of size with classes. A candidate split divides into child nodes . For each candidate split, the impurity reduction gain (typically based on entropy) and the split information
with , are computed. Quinlan's traditional gain ratio is given by
The Balanced Gain Ratio is formulated as
The addition of $1$ to the denominator is designed to eliminate the artificial inflation of the score for splits where split information is near zero (characteristic of highly unbalanced splits). This adjustment maintains the bias correction for high-arity (many-way) splits but prevents favoring splits that isolate tiny, pure subsets—a known issue with Quinlan's original ratio.
2. Mathematical Derivation
For a node , the impurity using entropy is
where is the empirical frequency of class in . The standard information gain is
The split information is
Thus, the Balanced Gain Ratio is
If is small (highly uneven splits), the denominator approaches $1$, so the ratio closely reflects the raw information gain. For large (many-way splits), the behavior is similar to the original gain ratio, strongly penalizing high-cardinality splits.
3. Bias Corrections in Gain Functions
Standard Information Gain
The standard impurity reduction has a known bias toward splits on high-cardinality features because maximizing purity is trivial when each instance can be isolated. In the limit, features with one unique value per instance yield maximal gain, even absent predictive value.
Quinlan’s Gain Ratio
Quinlan’s ratio penalizes high-cardinality splits:
Here, grows with the number of partitions, discouraging many-way splits. However, when is very small—characteristic of splits producing one large and several tiny partitions—the denominator deflates, causing the gain ratio to be artificially large and biasing tree growth toward these unbalanced splits.
Balanced Gain Ratio
By redefining the denominator as $1 + SI$, BGR suppresses this bias. For large , the additive $1$ is negligible and BGR approximates the original gain ratio. For , BGR equals , precluding artifactual boosting of splits that simply isolate noise points.
4. Algorithmic Implementation
BGR integrates into top-down, greedy induction schemes such as C4.5 by substituting the gain function for split selection. The following pseudocode outlines the integration:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
function BuildTree(nodeSamples C): if stoppingCriterion(C) then return Leaf(label=majorityClass(C)) bestScore ← −∞ bestSplit ← null for each feature f: for each candidate split S on f: Partition C into {C^1,…,C^J} according to S Compute p^j = |C^j|/|C|, j=1…J Compute impurity I(C) and I(C^j) for all j G ← I(C) − Σ_{j=1}^J p^j I(C^j) # standard gain SI ← −Σ_{j=1}^J p^j log(p^j) # split information BGR ← G / (1 + SI) # balanced gain ratio if BGR > bestScore: bestScore ← BGR bestSplit ← S if bestSplit is null then return Leaf(label=majorityClass(C)) create internal node with test bestSplit for each child partition C^j of bestSplit: attach BuildTree(C^j) as j-th branch return internal node |
All other induction, threshold discovery, and pruning strategies in conventional C4.5 are preserved (Leroux et al., 2018).
5. Theoretical Properties
- Monotonicity in Split Information: The denominator $1 + SI$ is strictly increasing with , ensuring that for constant gain, higher split information always reduces or preserves the BGR score.
- Limiting Behavior:
- As , , strongly discouraging high-arity splits.
- As , , removing any split information penalty and thus not artificially inflating scores for extreme unbalanced splits.
- Effect of Partition Arity: For branches, the maximal is . Large implies strong denominator growth and penalization.
- Dependence on Class Distribution: BGR, through , remains sensitive to class mixtures in the partitions, but applies a tempered correction related to the balance of the split.
6. Empirical Evaluation
BGR and Quinlan’s gain ratio were evaluated on ten benchmark UCI datasets using C4.5-style induction, pessimistic error pruning, and stratified cross-validation (10 iterations of 5-folds). The key results on average accuracy are summarized below:
| Dataset | GainRatio (%) | BGR (%) | Δ = BGR–GainRatio (%) |
|---|---|---|---|
| Glass | 64.02 | 65.42 | +1.40 |
| BUPA Liver | 60.58 | 66.67 | +6.09 |
| Heart | 78.15 | 77.78 | −0.37 |
| Balance Scale | 73.92 | 77.60 | +3.68 |
| Survival | 73.86 | 72.87 | −0.99 |
| PIMA | 72.79 | 75.26 | +2.47 |
| Wine | 94.38 | 94.38 | +0.00 |
| Bank Marketing | 89.87 | 90.23 | +0.36 |
| Income | 83.92 | 84.26 | +0.34 |
| Letter | 86.85 | 87.40 | +0.55 |
Notably, in large datasets (e.g., Letter), BGR produced trees with approximately 20 levels compared to ≈100 levels for the gain ratio, with a proportional speed-up in induction time. A plausible implication is that BGR's balanced bias leads to shallower, more interpretable models while maintaining or improving predictive accuracy.
7. Guidelines for Adoption
- Integration: Deployment in C4.5-style systems requires only substituting with $1+SI$ in the denominator of the gain function.
- Applicability: BGR is especially advantageous in high-dimensional settings and with features possessing many categorical levels. In small or shallow-tree scenarios, performance closely matches that of the gain ratio (typically within 1%).
- Parametric Stability: No additional hyperparameters are introduced; the correction is robust across diverse datasets.
- Extendibility: The $1+SI$ denominator adjustment generalizes to impurity-based gain metrics that use split information penalization, including those applied to multi-way numeric splits (Leroux et al., 2018).