Papers
Topics
Authors
Recent
Search
2000 character limit reached

Quinlan's Gain Ratio (GR) in Decision Trees

Updated 20 March 2026
  • Quinlan's Gain Ratio is a normalized measure of information gain used for selecting attributes in decision trees, addressing biases toward high-arity branches.
  • It divides information gain by the split information to penalize splits that create too many branches, reducing the risk of overfitting and deep, skewed trees.
  • Modern adaptations, including the Balanced Gain Ratio and its use in graph-based feature selection, underscore its relevance in both traditional and high-dimensional machine learning.

Quinlan’s Gain Ratio (GR) is a normalization of information gain used primarily for attribute selection in decision tree learning algorithms such as C4.5, designed to address the bias of information gain toward high-arity (multi-valued) attributes. By dividing information gain by the split information—the entropy of the attribute split itself—Gain Ratio penalizes useless splits with many branches and thus ameliorates overfitting due to arbitrarily fine-grained partitioning. It plays a critical role both in classical decision tree methodologies and, in modern research, as a feature selection criterion for high-dimensional and structured data, including graph-based learning.

1. Mathematical Definition and Formal Properties

Let SS be a set of nn training samples arriving at a decision tree node, and AA an attribute with J=Values(A)J = |\mathrm{Values}(A)| possible outcomes. Denote by SvSS_v \subseteq S those samples for which A=vA = v.

  • Entropy (Shannon entropy of class distribution in SS):

H(S)=k=1Kpklog2pk,H(S) = -\sum_{k=1}^K p_k \log_2 p_k,

where pkp_k is the fraction of class-kk samples in SS.

  • Information Gain (IG), quantifying the reduction in entropy after splitting SS by AA:

IG(S,A)=H(S)vValues(A)SvSH(Sv)IG(S, A) = H(S) - \sum_{v \in \mathrm{Values}(A)} \frac{|S_v|}{|S|} H(S_v)

  • Split Information (SISI), also called the intrinsic value, capturing the entropy of the partitioning induced by attribute AA:

SI(S,A)=vValues(A)SvSlog2SvSSI(S, A) = -\sum_{v \in \mathrm{Values}(A)} \frac{|S_v|}{|S|} \log_2 \frac{|S_v|}{|S|}

  • Gain Ratio (GR), the normalized criterion:

GR(S,A)=IG(S,A)SI(S,A)GR(S, A) = \frac{IG(S, A)}{SI(S, A)}

Dividing IG by SI ensures that attributes producing many small, homogeneous splits are not unjustifiably favored. An analogous definition is used for features that induce binary splits, as in graph feature selection based on adjacency matrices (Oishi et al., 2022).

2. Motivations and Bias Correction

Pure information gain tends to be maximized by attributes that finely partition the data, notably those with many distinct values, often resulting in branchings that overfit (e.g., singleton splits where Sv=1|S_v| = 1, H(Sv)=0H(S_v) = 0, yielding IG=H(S)IG = H(S) and overly deep trees). This occurs because IG does not penalize the dispersion of examples among many branches with negligible predictive value.

Split Information grows as the partition of SS becomes more uniform across many branches. The quotient GR(S,A)GR(S, A) penalizes high-arity splits and partially corrects the overfitting tendency of IG (Leroux et al., 2018). This reweights the preference for splits so that both the entropy reduction and the “cost” of finer splits are jointly optimized.

3. Limitations and the Balanced Gain Ratio (BGR) Correction

While GR is effective in reducing high-arity bias, it can introduce an over-penalization against splits with small intrinsic value. This is particularly problematic for unbalanced splits (e.g., isolating a small, pure subset of samples in one branch), leading to:

  • Deep, skewed trees that are less interpretable and computationally inefficient.
  • Splits that might focus on rare or “fluke” samples rather than representing the main class distributions.

To address this, Leroux et al. (Leroux et al., 2018) introduced the Balanced Gain Ratio (BGR):

Γ(S,A)=IG(S,A)1+SI(S,A)\Gamma(S, A) = \frac{IG(S, A)}{1 + SI(S, A)}

The “+1+1” in the denominator moderates the SI penalty: for large SISI, BGR approximates GR; for small SISI, the correction is mild, and BGR approaches IG. This method avoids both overfitting from high-arity splits (as in IG) and excessive bias toward unbalanced splits (as can occur in standard GR).

Comparative Summary: GR vs. BGR

Criterion Correction Target Empirical Impact
GR High-arity bias May over-penalize small-SI, yield deep trees
BGR Both high-arity and unbalanced splits Shallower trees, improved generalization (Leroux et al., 2018)

4. Empirical Evaluations and Performance

Leroux et al. (Leroux et al., 2018) assessed both GR and BGR on ten UCI benchmark datasets using C4.5 with pessimistic pruning, applying repeated cross-validation. The BGR yielded accuracy improvements across most datasets, particularly on larger or higher-dimensional problems. For example:

Dataset GR Acc (%) BGR Acc (%) Diff
Glass 64.02 65.42 +1.40
Bupa (Liver) 60.58 66.67 +6.09
BalanceScale 73.92 77.60 +3.68
Letter (large) 86.85 87.40 +0.55
Heart 78.15 77.78 −0.37

On the “Letter” dataset, trees grown using GR reached depths near 100, whereas BGR-grown trees had a typical depth of ≈20 with marginally better accuracy. This suggests that BGR produces shallower, more interpretable, and computationally efficient trees. On small datasets, differences between GR and BGR are minor.

5. Extensions to Graph-Based Learning

Gain Ratio has been adapted to feature selection in graph neural network (GNN) pipelines (Oishi et al., 2022), where each candidate feature corresponds to a column of an ii-hop adjacency matrix (binary feature indicating the presence of an ii-hop neighbor). The adapted Information Gain Ratio (IGR) is computed for each such structural feature:

  • For each node vv among labeled nodes VLV_L, and each candidate neighbor uu, define
    • VL,iuV_{L,i}^u as labeled nodes where Ai[v,u]=1A_i[v, u] = 1,
    • VL,iuV_{L,i}^{-u} as those with Ai[v,u]=0A_i[v, u] = 0.
  • Compute entropy for VLV_L, conditional entropies for the splits, and the split information for the binary feature.
  • The IGR is then the ratio of entropy reduction (from the split) to the binary split information.

This process allows selection, ranking, truncation, and filtering (e.g., discarding rare features, retaining top-tt IGR columns) to construct high-utility structural representations for GNNs. MSI-GNN, which employs this procedure, outperforms standard GCN, H2GCN, and GCNII architectures in semi-supervised node classification, most notably when moderate numbers of high-IGR features (t1000t \approx 1000) are used, and rare features are filtered out (Oishi et al., 2022).

6. Practical Considerations and Implementation

Key aspects in employing Gain Ratio for split or feature selection include:

  • Entropy computation: Accurate estimation of class or label entropy is central; for high-arity or large feature sets, numerical instability may arise when the split information approaches zero.
  • Feature filtering: Removal of features with minimal representation among labeled samples is essential for noise reduction and algorithmic stability.
  • Trade-offs in truncation: Retaining only top-IGR features can omit useful information; empirical results indicate that conservative truncation (moderate tt values) maintains performance, while aggressive pruning can be detrimental.
  • Duplication and weighting: In architecture such as MSI-GNN, duplication and discounting of structural features (by hop distance) further influences learning dynamics.

Pseudocode for IGR-based selection is straightforward, involving calculation of split probabilities, relevant entropies, and sorting/filtering steps, as detailed in (Oishi et al., 2022).

7. Theoretical and Applied Significance

Quinlan’s Gain Ratio remains foundational in decision tree learning, providing a principled correction for information gain’s high-arity bias. The BGR further refines this correction, balancing tree depth and generalization. Recent adaptations, including their application as feature selectors in GNN pipelines, indicate the continued relevance of the principle in modern architectures for structured data. The use of Gain Ratio in constructing compact, informative representations exemplifies a broader methodological trend: leveraging classical statistical criteria for effective learning in complex, high-dimensional, and structured domains (Leroux et al., 2018, Oishi et al., 2022).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Quinlan’s Gain Ratio (GR).