DepoRanker: Phage Depolymerase Ranker

Updated 16 November 2025

DepoRanker is a computational tool that uses machine learning to identify and rank Klebsiella phage depolymerases, offering an efficient alternative to homology-based methods.
It employs a simple amino acid composition vector with an XGBoost ranking ensemble to effectively prioritize candidate proteins from phage proteomes for downstream wet-lab validation.
DepoRanker streamlines depolymerase discovery by reducing screening from tens of candidates to a concise top shortlist, accelerating phage-based antibacterial strategies.

DepoRanker is a web-based and open-source computational tool designed for the identification and ranking of Klebsiella phage depolymerases using machine learning methodologies. Developed in response to the limitations of homology-based searches in novel depolymerase discovery, DepoRanker prioritizes candidate proteins from newly sequenced Klebsiella phage proteomes for experimental follow-up, providing a streamlined alternative to traditional BLAST annotation pipelines.

1. Biological Motivation and Problem Definition

Carbapenem-resistant Klebsiella pneumoniae, designated a “priority pathogen” by the WHO, poses significant healthcare challenges due to its polysaccharide (K-type) capsule. This capsule facilitates biofilm formation, enhances virulence, and impedes the efficacy of both antibiotics and many bacteriophages. Phage-encoded depolymerases — enzymes capable of degrading bacterial capsular polysaccharides — have the potential to dismantle this protective barrier, thereby sensitizing Klebsiella to phage infection or acting as direct antibacterial agents. Historically, genome-based depolymerase discovery has relied on sequence homology searches (e.g., BLAST for “tail-spikes” or “tail-fibres”), yet many experimentally validated depolymerases exhibit low or negligible homology to database entries, causing significant bottlenecks in wet-lab screening and functional annotation. DepoRanker specifically addresses these challenges by employing machine learning to triage every candidate protein in a proteome, furnishing a concise, ranked shortlist for experimental verification.

2. Dataset Construction and Curation

DepoRanker’s model is trained on curated phage proteomes:

Positives (“known depolymerases”): 39 experimentally characterized proteins sourced from 24 unique Klebsiella phage proteomes, referenced in peer-reviewed studies (e.g., Hsu et al. 2013; Majkowska-Skrobek et al. 2016; Pan et al. 2017; Li et al. 2022).
Negatives (“non-depolymerases”): All other annotated proteins (n = 2,601) in these 24 proteomes, explicitly excluding the 39 positives.
External hold-out set: Five recently characterized depolymerases along with their complete parent proteomes (see Table 2 of the source manuscript).

Protein sequences were downloaded in FASTA format and parsed using libraries such as BioPython’s SeqIO. Aside from excluding previously labeled positives from the negative set, no sequence filtering (e.g., by length or manual curation) was performed.

3. Protein Feature Representation

DepoRanker adopts a single, interpretable feature set: the non-normalized amino-acid composition vector. Each sequence $S$ of length $L$ is encoded as:

$x_i = \text{number of occurrences of amino acid } i \in \{A, C, D, \ldots, Y\} \text{ in } S,$

so that $x = (x_1, x_2, ..., x_{20}) \in \mathbb{N}^{20}$ . No additional features—such as k-mer frequencies, predicted disorder, isoelectric point, or molecular weight—are incorporated in the current release. This design prioritizes transparency and generalizability within the Klebsiella phage context. As stated by the authors, “a simple amino acid composition of a protein is used as its feature representation.”

4. Machine Learning Approach and Ranking Strategy

4.1 Learning Algorithm and Hyperparameters

DepoRanker implements XGBoost in its “rank:pairwise” mode. For each phage proteome $g$ , protein sequences are segregated into:

$P_g$ : positives (known depolymerases)
$N_g$ : negatives (non-depolymerases)

The ranking model learns parameters $\theta$ such that, for all $(i, j) \in P_g \times N_g$ ,

$f(x_i;\theta) \gg f(x_j;\theta),$

where $f(\cdot;\theta)$ is the ensemble’s real-valued output score.

Key hyperparameters, determined via cross-validation, are:

learning_rate = 0.1
subsample = 0.9
max_depth = 3

4.2 Cross-Validation and Model Deployment

To prevent train/test leakage, positives are clustered at 10% identity using CD-HIT, yielding 7 non-redundant clusters. Each fold in a 7-fold cross-validation uses one cluster for testing and the remainder for training, ensuring that “proteomes containing similar depolymerases [are] never split across train/test in the same fold.” Thus, deployment employs an ensemble of seven models, with the consensus score for a protein $x$ calculated as:

$f_{\text{ensemble}}(x) = \frac{1}{7} \sum_{k=1}^7 f_k(x).$

4.3 Scoring and Ranking

For each query proteome, all proteins are scored using $f_{\text{ensemble}}(x)$ . They are then ranked in descending order of score, with the highest-ranking proteins designated as the most probable depolymerases.

5. Model Evaluation and Comparative Performance

DepoRanker’s efficacy is substantiated using multiple performance metrics:

Rank of First Positive Prediction (RFPP): For each proteome, the RFPP is the rank at which the first true depolymerase appears. In cross-validation across 24 proteomes, the median RFPP was 1, and the maximal (100th-percentile) RFPP was 3. In contrast, BLAST-based ranking exhibited a maximal RFPP of 31, and a random baseline had a median of 52.

Phage Accession	Proteome Size	BLAST RFPP	DepoRanker RFPP
NC027399	540	1	1
AB797215	203	20	1
JF501022	77	31	2
OU509535	528	8	3

Receiver Operating Characteristic (ROC) and AUROC: DepoRanker achieved AUROC = 0.99, surpassing BLAST’s AUROC of 0.94.
Precision-Recall and AUCPR: DepoRanker’s area under the precision-recall curve was 0.42, compared to 0.37 for BLAST.

On the independent external hold-out of five newly characterized depolymerases, DepoRanker ranked the correct protein at positions {1,1,1,2,3}, for an average RFPP of 1.6. Although no formal statistical significance tests were reported, the separation from BLAST is considerable.

6. Software Architecture and Availability

DepoRanker is accessible both as a web server and as open-source software:

Web server: https://deporanker.dcs.warwick.ac.uk/
Source code: https://github.com/wgrgwrght/deporanker

6.1 Web and Backend Implementation

Frontend: Static HTML upload page for FASTA input, with a “Rank Proteome” button.
Backend: Python Flask microservice. The backend handles FASTA parsing, amino acid vector computation, model loading, scoring with the XGBoost ensemble, and output HTML rendering with a downloadable CSV.
Input: Complete protein sequences in FASTA format; there is no enforced genus restriction, though the model is Klebsiella-centric.
Output: CSV file with columns: ProteinID, Score, Rank, sorted by descending score.

6.2 Dependencies

Python 3.x, Flask
XGBoost Python package
BioPython, NumPy, Pandas
CD-HIT v4.8.1 (for training/extended clustering)
SHAP (for feature importance analysis)

7. Practical Application and Workflow

A typical application proceeds as follows:

Assemble a FASTA file (e.g., kleb_phage.fasta) containing all protein sequences from a newly sequenced Klebsiella phage.
Submit the FASTA file at https://deporanker.dcs.warwick.ac.uk/.
Download and open the results CSV, which includes a ranked list of proteins by likelihood of being depolymerases.
Prioritize the top 1–3 candidates for cloning, expression, and subsequent wet-lab assay validation.

Example output:

ProteinID	Score	Rank
gp_001	0.924	1
gp_102	0.645	2
gp_057	0.582	3

This workflow reduces the experimental burden, typically narrowing wet-lab screens from tens of candidates to three or fewer per proteome.

8. Impact, Limitations, and Future Directions

DepoRanker demonstrates marked gains over BLAST-based homology screens, both in ranking accuracy (RFPP) and operational efficiency. Its successful use of a simple 20-dimensional amino acid composition vector—despite the absence of higher-order sequence or structural descriptors—highlights the power of carefully tuned ranking ensembles in proteomic triage.

Limitations include:

Klebsiella-centric training: The model’s performance on distant genera is untested.
Restricted feature space: Only amino acid counts are currently employed; adding features such as dipeptide/tripeptide k-mers, secondary structure predictions, or domain profiles could offer further gains.
Retraining needs: Ongoing addition of experimentally verified depolymerases will require periodic retraining or model fine-tuning.
Extension potential: Adopting deeper sequence models (e.g., transformer architectures) could facilitate detection of more remote evolutionary relationships or nonlocal sequence signals.

DepoRanker offers the community a transparent, reproducible platform for accelerating depolymerase discovery, directly contributing to phage-based strategies targeting multidrug-resistant Klebsiella.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DepoRanker.