Papers
Topics
Authors
Recent
2000 character limit reached

DepoRanker: Phage Depolymerase Ranker

Updated 16 November 2025
  • DepoRanker is a computational tool that uses machine learning to identify and rank Klebsiella phage depolymerases, offering an efficient alternative to homology-based methods.
  • It employs a simple amino acid composition vector with an XGBoost ranking ensemble to effectively prioritize candidate proteins from phage proteomes for downstream wet-lab validation.
  • DepoRanker streamlines depolymerase discovery by reducing screening from tens of candidates to a concise top shortlist, accelerating phage-based antibacterial strategies.

DepoRanker is a web-based and open-source computational tool designed for the identification and ranking of Klebsiella phage depolymerases using machine learning methodologies. Developed in response to the limitations of homology-based searches in novel depolymerase discovery, DepoRanker prioritizes candidate proteins from newly sequenced Klebsiella phage proteomes for experimental follow-up, providing a streamlined alternative to traditional BLAST annotation pipelines.

1. Biological Motivation and Problem Definition

Carbapenem-resistant Klebsiella pneumoniae, designated a “priority pathogen” by the WHO, poses significant healthcare challenges due to its polysaccharide (K-type) capsule. This capsule facilitates biofilm formation, enhances virulence, and impedes the efficacy of both antibiotics and many bacteriophages. Phage-encoded depolymerases — enzymes capable of degrading bacterial capsular polysaccharides — have the potential to dismantle this protective barrier, thereby sensitizing Klebsiella to phage infection or acting as direct antibacterial agents. Historically, genome-based depolymerase discovery has relied on sequence homology searches (e.g., BLAST for “tail-spikes” or “tail-fibres”), yet many experimentally validated depolymerases exhibit low or negligible homology to database entries, causing significant bottlenecks in wet-lab screening and functional annotation. DepoRanker specifically addresses these challenges by employing machine learning to triage every candidate protein in a proteome, furnishing a concise, ranked shortlist for experimental verification.

2. Dataset Construction and Curation

DepoRanker’s model is trained on curated phage proteomes:

  • Positives (“known depolymerases”): 39 experimentally characterized proteins sourced from 24 unique Klebsiella phage proteomes, referenced in peer-reviewed studies (e.g., Hsu et al. 2013; Majkowska-Skrobek et al. 2016; Pan et al. 2017; Li et al. 2022).
  • Negatives (“non-depolymerases”): All other annotated proteins (n = 2,601) in these 24 proteomes, explicitly excluding the 39 positives.
  • External hold-out set: Five recently characterized depolymerases along with their complete parent proteomes (see Table 2 of the source manuscript).

Protein sequences were downloaded in FASTA format and parsed using libraries such as BioPython’s SeqIO. Aside from excluding previously labeled positives from the negative set, no sequence filtering (e.g., by length or manual curation) was performed.

3. Protein Feature Representation

DepoRanker adopts a single, interpretable feature set: the non-normalized amino-acid composition vector. Each sequence SS of length LL is encoded as:

xi=number of occurrences of amino acid i{A,C,D,,Y} in S,x_i = \text{number of occurrences of amino acid } i \in \{A, C, D, \ldots, Y\} \text{ in } S,

so that x=(x1,x2,...,x20)N20x = (x_1, x_2, ..., x_{20}) \in \mathbb{N}^{20}. No additional features—such as k-mer frequencies, predicted disorder, isoelectric point, or molecular weight—are incorporated in the current release. This design prioritizes transparency and generalizability within the Klebsiella phage context. As stated by the authors, “a simple amino acid composition of a protein is used as its feature representation.”

4. Machine Learning Approach and Ranking Strategy

4.1 Learning Algorithm and Hyperparameters

DepoRanker implements XGBoost in its “rank:pairwise” mode. For each phage proteome gg, protein sequences are segregated into:

  • PgP_g: positives (known depolymerases)
  • NgN_g: negatives (non-depolymerases)

The ranking model learns parameters θ\theta such that, for all (i,j)Pg×Ng(i, j) \in P_g \times N_g,

f(xi;θ)f(xj;θ),f(x_i;\theta) \gg f(x_j;\theta),

where f(;θ)f(\cdot;\theta) is the ensemble’s real-valued output score.

Key hyperparameters, determined via cross-validation, are:

  • learning_rate = 0.1
  • subsample = 0.9
  • max_depth = 3

4.2 Cross-Validation and Model Deployment

To prevent train/test leakage, positives are clustered at 10% identity using CD-HIT, yielding 7 non-redundant clusters. Each fold in a 7-fold cross-validation uses one cluster for testing and the remainder for training, ensuring that “proteomes containing similar depolymerases [are] never split across train/test in the same fold.” Thus, deployment employs an ensemble of seven models, with the consensus score for a protein xx calculated as:

fensemble(x)=17k=17fk(x).f_{\text{ensemble}}(x) = \frac{1}{7} \sum_{k=1}^7 f_k(x).

4.3 Scoring and Ranking

For each query proteome, all proteins are scored using fensemble(x)f_{\text{ensemble}}(x). They are then ranked in descending order of score, with the highest-ranking proteins designated as the most probable depolymerases.

5. Model Evaluation and Comparative Performance

DepoRanker’s efficacy is substantiated using multiple performance metrics:

  • Rank of First Positive Prediction (RFPP): For each proteome, the RFPP is the rank at which the first true depolymerase appears. In cross-validation across 24 proteomes, the median RFPP was 1, and the maximal (100th-percentile) RFPP was 3. In contrast, BLAST-based ranking exhibited a maximal RFPP of 31, and a random baseline had a median of 52.
Phage Accession Proteome Size BLAST RFPP DepoRanker RFPP
NC027399 540 1 1
AB797215 203 20 1
JF501022 77 31 2
OU509535 528 8 3
  • Receiver Operating Characteristic (ROC) and AUROC: DepoRanker achieved AUROC = 0.99, surpassing BLAST’s AUROC of 0.94.
  • Precision-Recall and AUCPR: DepoRanker’s area under the precision-recall curve was 0.42, compared to 0.37 for BLAST.

On the independent external hold-out of five newly characterized depolymerases, DepoRanker ranked the correct protein at positions {1,1,1,2,3}, for an average RFPP of 1.6. Although no formal statistical significance tests were reported, the separation from BLAST is considerable.

6. Software Architecture and Availability

DepoRanker is accessible both as a web server and as open-source software:

6.1 Web and Backend Implementation

  • Frontend: Static HTML upload page for FASTA input, with a “Rank Proteome” button.
  • Backend: Python Flask microservice. The backend handles FASTA parsing, amino acid vector computation, model loading, scoring with the XGBoost ensemble, and output HTML rendering with a downloadable CSV.
  • Input: Complete protein sequences in FASTA format; there is no enforced genus restriction, though the model is Klebsiella-centric.
  • Output: CSV file with columns: ProteinID, Score, Rank, sorted by descending score.

6.2 Dependencies

  • Python 3.x, Flask
  • XGBoost Python package
  • BioPython, NumPy, Pandas
  • CD-HIT v4.8.1 (for training/extended clustering)
  • SHAP (for feature importance analysis)

7. Practical Application and Workflow

A typical application proceeds as follows:

  1. Assemble a FASTA file (e.g., kleb_phage.fasta) containing all protein sequences from a newly sequenced Klebsiella phage.
  2. Submit the FASTA file at https://deporanker.dcs.warwick.ac.uk/.
  3. Download and open the results CSV, which includes a ranked list of proteins by likelihood of being depolymerases.
  4. Prioritize the top 1–3 candidates for cloning, expression, and subsequent wet-lab assay validation.

Example output:

ProteinID Score Rank
gp_001 0.924 1
gp_102 0.645 2
gp_057 0.582 3

This workflow reduces the experimental burden, typically narrowing wet-lab screens from tens of candidates to three or fewer per proteome.

8. Impact, Limitations, and Future Directions

DepoRanker demonstrates marked gains over BLAST-based homology screens, both in ranking accuracy (RFPP) and operational efficiency. Its successful use of a simple 20-dimensional amino acid composition vector—despite the absence of higher-order sequence or structural descriptors—highlights the power of carefully tuned ranking ensembles in proteomic triage.

Limitations include:

  • Klebsiella-centric training: The model’s performance on distant genera is untested.
  • Restricted feature space: Only amino acid counts are currently employed; adding features such as dipeptide/tripeptide k-mers, secondary structure predictions, or domain profiles could offer further gains.
  • Retraining needs: Ongoing addition of experimentally verified depolymerases will require periodic retraining or model fine-tuning.
  • Extension potential: Adopting deeper sequence models (e.g., transformer architectures) could facilitate detection of more remote evolutionary relationships or nonlocal sequence signals.

DepoRanker offers the community a transparent, reproducible platform for accelerating depolymerase discovery, directly contributing to phage-based strategies targeting multidrug-resistant Klebsiella.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to DepoRanker.