Quinlan's Gain Ratio (GR) in Decision Trees

Updated 20 March 2026

Quinlan's Gain Ratio is a normalized measure of information gain used for selecting attributes in decision trees, addressing biases toward high-arity branches.
It divides information gain by the split information to penalize splits that create too many branches, reducing the risk of overfitting and deep, skewed trees.
Modern adaptations, including the Balanced Gain Ratio and its use in graph-based feature selection, underscore its relevance in both traditional and high-dimensional machine learning.

Quinlan’s Gain Ratio (GR) is a normalization of information gain used primarily for attribute selection in decision tree learning algorithms such as C4.5, designed to address the bias of information gain toward high-arity (multi-valued) attributes. By dividing information gain by the split information—the entropy of the attribute split itself—Gain Ratio penalizes useless splits with many branches and thus ameliorates overfitting due to arbitrarily fine-grained partitioning. It plays a critical role both in classical decision tree methodologies and, in modern research, as a feature selection criterion for high-dimensional and structured data, including graph-based learning.

1. Mathematical Definition and Formal Properties

Let $S$ be a set of $n$ training samples arriving at a decision tree node, and $A$ an attribute with $J = |\mathrm{Values}(A)|$ possible outcomes. Denote by $S_v \subseteq S$ those samples for which $A = v$ .

Entropy (Shannon entropy of class distribution in $S$ ):

$H(S) = -\sum_{k=1}^K p_k \log_2 p_k,$

where $p_k$ is the fraction of class- $k$ samples in $S$ .

Information Gain (IG), quantifying the reduction in entropy after splitting $S$ by $A$ :

$IG(S, A) = H(S) - \sum_{v \in \mathrm{Values}(A)} \frac{|S_v|}{|S|} H(S_v)$

Split Information ( $SI$ ), also called the intrinsic value, capturing the entropy of the partitioning induced by attribute $A$ :

$SI(S, A) = -\sum_{v \in \mathrm{Values}(A)} \frac{|S_v|}{|S|} \log_2 \frac{|S_v|}{|S|}$

Gain Ratio (GR), the normalized criterion:

$GR(S, A) = \frac{IG(S, A)}{SI(S, A)}$

Dividing IG by SI ensures that attributes producing many small, homogeneous splits are not unjustifiably favored. An analogous definition is used for features that induce binary splits, as in graph feature selection based on adjacency matrices (Oishi et al., 2022).

2. Motivations and Bias Correction

Pure information gain tends to be maximized by attributes that finely partition the data, notably those with many distinct values, often resulting in branchings that overfit (e.g., singleton splits where $|S_v| = 1$ , $H(S_v) = 0$ , yielding $IG = H(S)$ and overly deep trees). This occurs because IG does not penalize the dispersion of examples among many branches with negligible predictive value.

Split Information grows as the partition of $S$ becomes more uniform across many branches. The quotient $GR(S, A)$ penalizes high-arity splits and partially corrects the overfitting tendency of IG (Leroux et al., 2018). This reweights the preference for splits so that both the entropy reduction and the “cost” of finer splits are jointly optimized.

3. Limitations and the Balanced Gain Ratio (BGR) Correction

While GR is effective in reducing high-arity bias, it can introduce an over-penalization against splits with small intrinsic value. This is particularly problematic for unbalanced splits (e.g., isolating a small, pure subset of samples in one branch), leading to:

Deep, skewed trees that are less interpretable and computationally inefficient.
Splits that might focus on rare or “fluke” samples rather than representing the main class distributions.

To address this, Leroux et al. (Leroux et al., 2018) introduced the Balanced Gain Ratio (BGR):

$\Gamma(S, A) = \frac{IG(S, A)}{1 + SI(S, A)}$

The “ $+1$ ” in the denominator moderates the SI penalty: for large $SI$ , BGR approximates GR; for small $SI$ , the correction is mild, and BGR approaches IG. This method avoids both overfitting from high-arity splits (as in IG) and excessive bias toward unbalanced splits (as can occur in standard GR).

Comparative Summary: GR vs. BGR

Criterion	Correction Target	Empirical Impact
GR	High-arity bias	May over-penalize small-SI, yield deep trees
BGR	Both high-arity and unbalanced splits	Shallower trees, improved generalization (Leroux et al., 2018)

4. Empirical Evaluations and Performance

Leroux et al. (Leroux et al., 2018) assessed both GR and BGR on ten UCI benchmark datasets using C4.5 with pessimistic pruning, applying repeated cross-validation. The BGR yielded accuracy improvements across most datasets, particularly on larger or higher-dimensional problems. For example:

Dataset	GR Acc (%)	BGR Acc (%)	Diff
Glass	64.02	65.42	+1.40
Bupa (Liver)	60.58	66.67	+6.09
BalanceScale	73.92	77.60	+3.68
Letter (large)	86.85	87.40	+0.55
Heart	78.15	77.78	−0.37

On the “Letter” dataset, trees grown using GR reached depths near 100, whereas BGR-grown trees had a typical depth of ≈20 with marginally better accuracy. This suggests that BGR produces shallower, more interpretable, and computationally efficient trees. On small datasets, differences between GR and BGR are minor.

5. Extensions to Graph-Based Learning

Gain Ratio has been adapted to feature selection in graph neural network (GNN) pipelines (Oishi et al., 2022), where each candidate feature corresponds to a column of an $i$ -hop adjacency matrix (binary feature indicating the presence of an $i$ -hop neighbor). The adapted Information Gain Ratio (IGR) is computed for each such structural feature:

For each node $v$ $v$ among labeled nodes $V_L$ $V_{L}$ , and each candidate neighbor $u$ $u$ , define
- $V_{L,i}^u$ as labeled nodes where $A_i[v, u] = 1$ ,
- $V_{L,i}^{-u}$ as those with $A_i[v, u] = 0$ .
Compute entropy for $V_L$ , conditional entropies for the splits, and the split information for the binary feature.
The IGR is then the ratio of entropy reduction (from the split) to the binary split information.

This process allows selection, ranking, truncation, and filtering (e.g., discarding rare features, retaining top- $t$ IGR columns) to construct high-utility structural representations for GNNs. MSI-GNN, which employs this procedure, outperforms standard GCN, H2GCN, and GCNII architectures in semi-supervised node classification, most notably when moderate numbers of high-IGR features ( $t \approx 1000$ ) are used, and rare features are filtered out (Oishi et al., 2022).

6. Practical Considerations and Implementation

Key aspects in employing Gain Ratio for split or feature selection include:

Entropy computation: Accurate estimation of class or label entropy is central; for high-arity or large feature sets, numerical instability may arise when the split information approaches zero.
Feature filtering: Removal of features with minimal representation among labeled samples is essential for noise reduction and algorithmic stability.
Trade-offs in truncation: Retaining only top-IGR features can omit useful information; empirical results indicate that conservative truncation (moderate $t$ values) maintains performance, while aggressive pruning can be detrimental.
Duplication and weighting: In architecture such as MSI-GNN, duplication and discounting of structural features (by hop distance) further influences learning dynamics.

Pseudocode for IGR-based selection is straightforward, involving calculation of split probabilities, relevant entropies, and sorting/filtering steps, as detailed in (Oishi et al., 2022).

7. Theoretical and Applied Significance

Quinlan’s Gain Ratio remains foundational in decision tree learning, providing a principled correction for information gain’s high-arity bias. The BGR further refines this correction, balancing tree depth and generalization. Recent adaptations, including their application as feature selectors in GNN pipelines, indicate the continued relevance of the principle in modern architectures for structured data. The use of Gain Ratio in constructing compact, informative representations exemplifies a broader methodological trend: leveraging classical statistical criteria for effective learning in complex, high-dimensional, and structured domains (Leroux et al., 2018, Oishi et al., 2022).

Markdown Report Issue Upgrade to Chat

References (2)

Multi-duplicated Characterization of Graph Structures using Information Gain Ratio for Graph Neural Networks (2022)

Information gain ratio correction: Improving prediction with more balanced decision tree splits (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Quinlan’s Gain Ratio (GR).