Papers
Topics
Authors
Recent
Search
2000 character limit reached

Binary Relevance in Multi-label Learning

Updated 23 June 2026
  • Binary Relevance (BR) is a multi-label learning method that converts each label prediction into an independent binary classification task.
  • It trains k separate classifiers using hinge loss with L2 regularization, ensuring modularity and parallelizable design.
  • Despite its simplicity, BR struggles with class imbalance and modeling label correlations, leading to scalable enhancements like stochastic sketching and joint transformations.

Binary Relevance (BR) is a foundational approach in multi-label learning that decomposes the original multi-label problem into a collection of independent binary classification problems, one for each label. Despite its simplicity and modularity, BR faces significant limitations in scalability, inability to model label correlations, and vulnerability to class imbalance. Recent advances address these bottlenecks via scalable transformations, stochastic sketching, and hybrid approaches combining ranking and low-rank constraints.

1. Formal Definition and Learning Objective

Given an input space X⊆RnX \subseteq \mathbb{R}^n and a set of labels L={1,2,…,k}L = \{1, 2, \dots, k\}, each instance x∈Xx \in X is associated with a binary label vector y∈{0,1}ky \in \{0, 1\}^k. The training set is D={(xi,yi)}i=1mD = \{(x_i, y_i)\}_{i=1}^m, with yi,ℓ=1y_{i,\ell} = 1 indicating that label ℓ\ell is present for instance ii.

Binary Relevance transforms multi-label learning into kk independent binary classification tasks. For each label ℓ∈L\ell \in L, BR constructs a dataset L={1,2,…,k}L = \{1, 2, \dots, k\}0 where L={1,2,…,k}L = \{1, 2, \dots, k\}1 if L={1,2,…,k}L = \{1, 2, \dots, k\}2 and L={1,2,…,k}L = \{1, 2, \dots, k\}3 otherwise. Each L={1,2,…,k}L = \{1, 2, \dots, k\}4 is trained by minimizing a regularized empirical risk, typically using binary losses such as the hinge loss L={1,2,…,k}L = \{1, 2, \dots, k\}5 or squared hinge loss, with an additional L={1,2,…,k}L = \{1, 2, \dots, k\}6 regularization: L={1,2,…,k}L = \{1, 2, \dots, k\}7 where L={1,2,…,k}L = \{1, 2, \dots, k\}8. The resulting L={1,2,…,k}L = \{1, 2, \dots, k\}9 classifiers predict for each test input, and the output vector aggregates their per-label binary decisions (Jambor et al., 2019, Wu et al., 2019, Gong et al., 2021).

BR naturally optimizes Hamming loss, defined empirically as: x∈Xx \in X0 where x∈Xx \in X1 and x∈Xx \in X2 is the predicted label for x∈Xx \in X3 (Wu et al., 2019).

2. Computational Complexity and Scalability

BR’s primary computational costs are:

  • Training Cost: If training a single binary classifier requires x∈Xx \in X4 time, then the total cost is x∈Xx \in X5. For linear models with x∈Xx \in X6-dimensional features and x∈Xx \in X7 samples, this cost becomes prohibitive as x∈Xx \in X8 increases (Jambor et al., 2019).
  • Prediction Cost: x∈Xx \in X9 for an instance, where y∈{0,1}ky \in \{0, 1\}^k0 is the computation per classifier.

The method’s runtime grows linearly with both the number of labels y∈{0,1}ky \in \{0, 1\}^k1 and the number of instances y∈{0,1}ky \in \{0, 1\}^k2. This dependence is the key bottleneck when scaling BR to problems with millions of samples and thousands of labels (Gong et al., 2021).

3. Statistical Properties and Theoretical Guarantees

BR independently optimizes per-label risks, inheriting generalization properties from standard supervised learning:

  • Uniform Convergence: The decomposition into y∈{0,1}ky \in \{0, 1\}^k3 independent problems allows for margin-based and empirical risk-based bounds for each y∈{0,1}ky \in \{0, 1\}^k4 (Jambor et al., 2019).
  • Asymptotic Consistency: As sample size increases, the empirical Hamming loss of BR approaches the Bayes error for independent binary classification tasks under standard assumptions (Gong et al., 2021).

However, as BR minimizes Hamming loss independently, it cannot directly optimize macro/micro-averaged subset-based metrics or ranking losses common in multi-label settings.

4. Limitations: Class Imbalance and Label Correlations

BR exhibits two main deficiencies:

  • Class-Imbalance Issue: In multi-label data, positive examples for each label are often a small minority, making loss minimization dominated by negatives and leading to poor recall for rare labels. This issue is not addressed by the pointwise loss functions used in BR (Wu et al., 2019).
  • Neglect of Label Correlations: BR treats each label as an independent task; statistical dependencies, co-occurrence, or mutual exclusivity among labels are not modeled. This precludes leveraging label relationships for improved predictive performance, particularly for rare or correlated labels (Wu et al., 2019).

Approaches such as RBRL incorporate pairwise ranking losses and low-rank coupling to address these defects—adding a ranking term to separate positive from negative labels and a trace-norm constraint on the label weight matrix to encourage shared structure across labels (Wu et al., 2019).

5. Scalable Alternatives: Joint Transformations and Stochastic Sketching

Several methods have been developed to overcome BR’s scalability bottlenecks:

  • Single-Binary Transformation (DiagT): Constructs a joint optimization by stacking the y∈{0,1}ky \in \{0, 1\}^k5 independent problems into a single sparse binary classification in higher-dimensional space. Given y∈{0,1}ky \in \{0, 1\}^k6 and y∈{0,1}ky \in \{0, 1\}^k7, the transformed data y∈{0,1}ky \in \{0, 1\}^k8 and label vector y∈{0,1}ky \in \{0, 1\}^k9 are constructed so that each D={(xi,yi)}i=1mD = \{(x_i, y_i)\}_{i=1}^m0 pair becomes a single binary example. The weight vector D={(xi,yi)}i=1mD = \{(x_i, y_i)\}_{i=1}^m1 is partitioned into D={(xi,yi)}i=1mD = \{(x_i, y_i)\}_{i=1}^m2 label-specific components. DiagT allows the use of sparsity, feature hashing, and random under-sampling to dramatically reduce memory and compute costs while maintaining the interpretability and prediction decomposability of BR (Jambor et al., 2019).
  • Stochastic Sketching: Randomly projects the data to a lower dimension via subgaussian or Hadamard-based sketching. The sketched regression model is solved in D={(xi,yi)}i=1mD = \{(x_i, y_i)\}_{i=1}^m3 time with D={(xi,yi)}i=1mD = \{(x_i, y_i)\}_{i=1}^m4, attaining a D={(xi,yi)}i=1mD = \{(x_i, y_i)\}_{i=1}^m5-approximation to the BR optimum with high probability. At test time, the sketch–learned label embedding supports kNN inference in label space. This approach preserves generalization properties with provable statistical guarantees and produces speedups of 10×–100× over standard BR (Gong et al., 2021).
Approach Training Complexity Scalability Features
Standard BR D={(xi,yi)}i=1mD = \{(x_i, y_i)\}_{i=1}^m6 Decoupled, parallelizable, no label sharing
DiagT D={(xi,yi)}i=1mD = \{(x_i, y_i)\}_{i=1}^m7 (sparse) Joint training, sparsity/hash speedups
Stochastic Sketch (SS+WH/GAU) D={(xi,yi)}i=1mD = \{(x_i, y_i)\}_{i=1}^m8 Small sketch size, theoretical guarantees

6. Empirical Performance and Comparative Evaluation

Empirical comparisons consistently demonstrate:

  • Precision and Speed: DiagT and its variants (with under-sampling or hashing) show higher precision@K and substantially faster training than standard BR in large-scale recommender tasks (e.g., 13.9s for DiagT-rus vs. 216.4s for BR, with 3–4% absolute precision@K improvements) (Jambor et al., 2019).
  • Competitive Accuracy: Stochastic sketching achieves Hamming loss and Example-F1 measures nearly matching full BR, with D={(xi,yi)}i=1mD = \{(x_i, y_i)\}_{i=1}^m9–yi,â„“=1y_{i,\ell} = 10 accuracy loss and 10×–100× acceleration. FastXML and SLEEC exhibit lower accuracy but more complex optimization (Gong et al., 2021).
Dataset BR+LIB Hamming Loss SS+WH Hamming Loss BR+LIB Train Time (s) SS+WH Train Time (s)
corel5k 0.0098 0.0102 7.20 0.20
nus(vlad) 0.0211 0.0225 222.2 20.2
nus(bow) 0.0215 0.0226 511.8 34.3
rcv1x 0.0017 0.00195 22607 55.9

On extremely large-scale datasets (hundreds of thousands of instances, hundred+ labels), DiagT-like and sketching techniques are practical alternatives to classic BR (Jambor et al., 2019, Gong et al., 2021).

7. Extensions and Hybrid Methods

Hybrid formulations explicitly integrate BR’s Hamming loss minimization with other objectives:

  • Ranking SVM with Binary Relevance and Low-Rank Learning (RBRL): Augments the BR loss with a pairwise ranking term to address class imbalance and a trace-norm penalty to impose low-rank structure on the label weight matrix. The result is an optimization involving squared hinge–pointwise loss (BR’s surrogate), ranking constraints, and trace-norm regularization, solvable by proximal gradient methods. RBRL outperforms plain BR on datasets where label correlations and imbalance are prominent, at the cost of higher computational complexity per iteration (Wu et al., 2019).

The incorporation of ranking and low-rank modeling addresses BR’s limitations but introduces multi-objective trade-offs, additional hyperparameters, and heightened optimization costs.


Binary Relevance’s independence, modularity, and statistical transparency have secured its prominence as a baseline and component in scalable multi-label learning. While it is effective in moderate dimensions, recent works demonstrate that exploiting sparsity, randomized projections, and principled hybridization with ranking and low-rank modeling are necessary to match modern data scale and complexity (Jambor et al., 2019, Wu et al., 2019, Gong et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Binary Relevance (BR).