Proxy-Based Biclustering Model Trees
- The paper introduces Oxytrees, which employ proxy-based compression and Kronecker product kernel regression to overcome scalability and generalization challenges in bipartite learning.
- Proxy-based biclustering model trees use an impurity function with proxy matrices to optimize split selection and enable efficient batch leaf-assignment for rapid predictions.
- Empirical evaluations show up to 30× faster training and 10× quicker predictions compared to traditional methods, with competitive performance across various biological interaction datasets.
Proxy-based biclustering model trees, exemplified by the Oxytrees algorithm, are a class of machine learning models designed to efficiently learn and predict interactions in bipartite learning scenarios. Bipartite learning involves estimating or predicting values in a large, partially observed interaction matrix , where each entry depends on a pair of feature vectors associated with distinct types of entities—for example, drug–target or RNA–disease interactions. Oxytrees address the scalability and generalization limitations of previous biclustering and model tree approaches by introducing proxy-based compression for split optimization, model-tree leaf learning via Kronecker product kernel regression, and efficient batch inference for large prediction tasks (Ilídio et al., 16 Nov 2025).
1. Problem Setting and Motivations
Bipartite learning is characterized by the presence of two separate sets of feature matrices, and , corresponding to the rows and columns of a large, sparse interaction matrix . Many rows or columns may have only limited observed interactions, and the matrix is typically too large for direct modeling. Oxytrees aim to:
- Discover a biclustering structure in .
- Fit a simple (linear-in-features) regression or classification model per bicluster.
- Enable fast prediction on novel dyads, critical for inductive and semi-inductive learning.
- Overcome the domain specificity and scalability bottlenecks of prior biclustering forests and constant-leaf model trees.
2. Impurity Function and Proxy Matrix Construction
Oxytrees utilize an impurity function that can be efficiently computed from sufficient statistics. The core form is: For variance-based impurity:
- This enables the construction of two proxy matrices at each node:
| Proxy Matrix | Dimension | Content Description |
|---|---|---|
| Row aggregations: | ||
| Column aggregations: |
These proxies allow the impurity of any bicluster resulting from a candidate row or column split to be computed as a function of partial sums over these proxies, without full recomputation over .
3. Efficient Split Search and Biclustering Criterion
Oxytrees conditionalize splits on either row or column features. For a given node collecting a submatrix :
- Splitting on a subset of rows or columns yields submatrices and .
- The split is chosen to maximize impurity reduction:
where and are the dyad counts in and .
Proxy-based statistics permit evaluating each candidate split for impurity reduction in time per proxy build, rather than for direct computations. Split selection sweeps through sorted feature values of and , with per-split costs or for numerics.
4. Model-Tree Construction and Leaf Fitting with Kronecker Ridge Regression
The model tree structure recursively partitions by alternating vertical and horizontal splits. Each leaf node receives all dyads falling into its corresponding bicluster and fits a regularized least-squares (RLS) model with a Kronecker product kernel (RLS-Kron):
- Kernels on and on define the joint kernel between dyads: .
- For a leaf with row entities and column entities:
- with entries
- with entries
- RLS-Kron optimizes:
with a closed-form solution based on eigendecomposition and elementwise operations. Prediction for new dyads uses:
where and are kernel feature matrices for new , .
5. Batch Leaf-Assignment and Fast Inference
Naïve application of a tree model to all test pairs would require traversals. Oxytrees introduce an optimized batch leaf-assignment algorithm:
- At each split, the relevant set ( or ) is partitioned based on the split condition, and both branches receive appropriate tuples.
- At a leaf, all pairs from receive predictions from the corresponding RLS-Kron model.
- The overall assignment and prediction cost is , which is substantially faster than per-pair traversal for large test sets.
6. Empirical Results and Evaluation
Empirical evaluation on 15 biological interaction datasets demonstrates:
- Training speed: Up to faster training compared to BICTR biclustering forests for large matrices (, ), with observed complexity versus for BICTR.
- Prediction speed: Batch inference is up to faster than BICTR per batch.
- Predictive performance:
- In the inductive (TT) setting, ensembles of Oxytrees yield superior or statistically tied AUPRC/AUROC versus BICTR and other baselines (RLS-Kron, NRLMF, WkNNIR), based on Friedman–Nemenyi tests at .
- Advantages are especially pronounced when using RLS-Kron leaf models relative to constant-leaf alternatives.
- Competitive performance persists in semi-inductive (TL, LT), transductive (TD), and partially unlabeled (PU) scenarios.
- Ablation studies confirm the necessity of proxy-based split search, RLS-Kron leaf fitting, and batch inference to attain these improvements (Ilídio et al., 16 Nov 2025).
7. Software Implementation and Reproducibility
Oxytrees are provided with a Python API compatible with Scikit-Learn, enabling:
- Access to all 15 benchmark datasets and evaluation metrics used in the study.
- Reproducibility of experimental results.
- Practical deployment for large bipartite learning tasks in computational biology and beyond.
The code and datasets are available at https://github.com/pedroilidio/oxytrees2025.