Papers
Topics
Authors
Recent
Search
2000 character limit reached

Leveraging Retrieval-Augmented Generation and Large Language Models to Predict SERCA-Binding Protein Fragments from Cardiac Proteomics Data

Published 26 Feb 2025 in q-bio.QM | (2502.19574v1)

Abstract: LLMs have shown promise in various natural language processing tasks, including their application to proteomics data to classify protein fragments. In this study, we curated a limited mass spectrometry dataset with 1000s of protein fragments, consisting of proteins that appear to be attached to the endoplasmic reticulum in cardiac cells, of which a fraction was cloned and characterized for their impact on SERCA, an ER calcium pump. With this limited dataset, we sought to determine whether LLMs could correctly predict whether a new protein fragment could bind SERCA, based only on its sequence and a few biophysical characteristics, such as hydrophobicity, determined from that sequence. To do so, we generated random sequences based on cloned fragments, embedded the fragments into a retrieval augmented generation (RAG) database to group them by similarity, then fine-tuned LLM prompts to predict whether a novel sequence could bind SERCA. We benchmarked this approach using multiple open-source LLMs, namely the Meta/llama series, and embedding functions commonly available on the Huggingface repository. We then assessed the generalizability of this approach in classifying novel protein fragments from mass spectrometry that were not initially cloned for functional characterization. By further tuning the prompt to account for motifs, such as ER retention sequences, we improved the classification accuracy by and identified several proteins predicted to localize to the endoplasmic reticulum and bind SERCA, including Ribosomal Protein L2 and selenoprotein S. Although our results were based on proteomics data from cardiac cells, our approach demonstrates the potential of LLMs in identifying novel protein interactions and functions with very limited proteomic data.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.