Quinlan's Gain Ratio (GR) in Decision Trees
- Quinlan's Gain Ratio is a normalized measure of information gain used for selecting attributes in decision trees, addressing biases toward high-arity branches.
- It divides information gain by the split information to penalize splits that create too many branches, reducing the risk of overfitting and deep, skewed trees.
- Modern adaptations, including the Balanced Gain Ratio and its use in graph-based feature selection, underscore its relevance in both traditional and high-dimensional machine learning.
Quinlan’s Gain Ratio (GR) is a normalization of information gain used primarily for attribute selection in decision tree learning algorithms such as C4.5, designed to address the bias of information gain toward high-arity (multi-valued) attributes. By dividing information gain by the split information—the entropy of the attribute split itself—Gain Ratio penalizes useless splits with many branches and thus ameliorates overfitting due to arbitrarily fine-grained partitioning. It plays a critical role both in classical decision tree methodologies and, in modern research, as a feature selection criterion for high-dimensional and structured data, including graph-based learning.
1. Mathematical Definition and Formal Properties
Let be a set of training samples arriving at a decision tree node, and an attribute with possible outcomes. Denote by those samples for which .
- Entropy (Shannon entropy of class distribution in ):
where is the fraction of class- samples in .
- Information Gain (IG), quantifying the reduction in entropy after splitting by :
- Split Information (), also called the intrinsic value, capturing the entropy of the partitioning induced by attribute :
- Gain Ratio (GR), the normalized criterion:
Dividing IG by SI ensures that attributes producing many small, homogeneous splits are not unjustifiably favored. An analogous definition is used for features that induce binary splits, as in graph feature selection based on adjacency matrices (Oishi et al., 2022).
2. Motivations and Bias Correction
Pure information gain tends to be maximized by attributes that finely partition the data, notably those with many distinct values, often resulting in branchings that overfit (e.g., singleton splits where , , yielding and overly deep trees). This occurs because IG does not penalize the dispersion of examples among many branches with negligible predictive value.
Split Information grows as the partition of becomes more uniform across many branches. The quotient penalizes high-arity splits and partially corrects the overfitting tendency of IG (Leroux et al., 2018). This reweights the preference for splits so that both the entropy reduction and the “cost” of finer splits are jointly optimized.
3. Limitations and the Balanced Gain Ratio (BGR) Correction
While GR is effective in reducing high-arity bias, it can introduce an over-penalization against splits with small intrinsic value. This is particularly problematic for unbalanced splits (e.g., isolating a small, pure subset of samples in one branch), leading to:
- Deep, skewed trees that are less interpretable and computationally inefficient.
- Splits that might focus on rare or “fluke” samples rather than representing the main class distributions.
To address this, Leroux et al. (Leroux et al., 2018) introduced the Balanced Gain Ratio (BGR):
The “” in the denominator moderates the SI penalty: for large , BGR approximates GR; for small , the correction is mild, and BGR approaches IG. This method avoids both overfitting from high-arity splits (as in IG) and excessive bias toward unbalanced splits (as can occur in standard GR).
Comparative Summary: GR vs. BGR
| Criterion | Correction Target | Empirical Impact |
|---|---|---|
| GR | High-arity bias | May over-penalize small-SI, yield deep trees |
| BGR | Both high-arity and unbalanced splits | Shallower trees, improved generalization (Leroux et al., 2018) |
4. Empirical Evaluations and Performance
Leroux et al. (Leroux et al., 2018) assessed both GR and BGR on ten UCI benchmark datasets using C4.5 with pessimistic pruning, applying repeated cross-validation. The BGR yielded accuracy improvements across most datasets, particularly on larger or higher-dimensional problems. For example:
| Dataset | GR Acc (%) | BGR Acc (%) | Diff |
|---|---|---|---|
| Glass | 64.02 | 65.42 | +1.40 |
| Bupa (Liver) | 60.58 | 66.67 | +6.09 |
| BalanceScale | 73.92 | 77.60 | +3.68 |
| Letter (large) | 86.85 | 87.40 | +0.55 |
| Heart | 78.15 | 77.78 | −0.37 |
On the “Letter” dataset, trees grown using GR reached depths near 100, whereas BGR-grown trees had a typical depth of ≈20 with marginally better accuracy. This suggests that BGR produces shallower, more interpretable, and computationally efficient trees. On small datasets, differences between GR and BGR are minor.
5. Extensions to Graph-Based Learning
Gain Ratio has been adapted to feature selection in graph neural network (GNN) pipelines (Oishi et al., 2022), where each candidate feature corresponds to a column of an -hop adjacency matrix (binary feature indicating the presence of an -hop neighbor). The adapted Information Gain Ratio (IGR) is computed for each such structural feature:
- For each node among labeled nodes , and each candidate neighbor , define
- as labeled nodes where ,
- as those with .
- Compute entropy for , conditional entropies for the splits, and the split information for the binary feature.
- The IGR is then the ratio of entropy reduction (from the split) to the binary split information.
This process allows selection, ranking, truncation, and filtering (e.g., discarding rare features, retaining top- IGR columns) to construct high-utility structural representations for GNNs. MSI-GNN, which employs this procedure, outperforms standard GCN, H2GCN, and GCNII architectures in semi-supervised node classification, most notably when moderate numbers of high-IGR features () are used, and rare features are filtered out (Oishi et al., 2022).
6. Practical Considerations and Implementation
Key aspects in employing Gain Ratio for split or feature selection include:
- Entropy computation: Accurate estimation of class or label entropy is central; for high-arity or large feature sets, numerical instability may arise when the split information approaches zero.
- Feature filtering: Removal of features with minimal representation among labeled samples is essential for noise reduction and algorithmic stability.
- Trade-offs in truncation: Retaining only top-IGR features can omit useful information; empirical results indicate that conservative truncation (moderate values) maintains performance, while aggressive pruning can be detrimental.
- Duplication and weighting: In architecture such as MSI-GNN, duplication and discounting of structural features (by hop distance) further influences learning dynamics.
Pseudocode for IGR-based selection is straightforward, involving calculation of split probabilities, relevant entropies, and sorting/filtering steps, as detailed in (Oishi et al., 2022).
7. Theoretical and Applied Significance
Quinlan’s Gain Ratio remains foundational in decision tree learning, providing a principled correction for information gain’s high-arity bias. The BGR further refines this correction, balancing tree depth and generalization. Recent adaptations, including their application as feature selectors in GNN pipelines, indicate the continued relevance of the principle in modern architectures for structured data. The use of Gain Ratio in constructing compact, informative representations exemplifies a broader methodological trend: leveraging classical statistical criteria for effective learning in complex, high-dimensional, and structured domains (Leroux et al., 2018, Oishi et al., 2022).