In-Context Learning can distort the relationship between sequence likelihoods and biological fitness (2504.17068v1)

Published 23 Apr 2025 in cs.LG and q-bio.BM

Abstract: LLMs have emerged as powerful predictors of the viability of biological sequences. During training these models learn the rules of the grammar obeyed by sequences of amino acids or nucleotides. Once trained, these models can take a sequence as input and produce a likelihood score as an output; a higher likelihood implies adherence to the learned grammar and correlates with experimental fitness measurements. Here we show that in-context learning can distort the relationship between fitness and likelihood scores of sequences. This phenomenon most prominently manifests as anomalously high likelihood scores for sequences that contain repeated motifs. We use protein LLMs with different architectures trained on the masked LLMing objective for our experiments, and find transformer-based models to be particularly vulnerable to this effect. This behavior is mediated by a look-up operation where the model seeks the identity of the masked position by using the other copy of the repeated motif as a reference. This retrieval behavior can override the model's learned priors. This phenomenon persists for imperfectly repeated sequences, and extends to other kinds of biologically relevant features such as reversed complement motifs in RNA sequences that fold into hairpin structures.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/Pastel/status/1915672814004195631

HackerNews

In-Context Learning can distort the relationship between likelihoods and fitness (1 point, 0 comments)

In-Context Learning can distort the relationship between sequence likelihoods and biological fitness (2504.17068v1)

Summary

Related Papers

Tweets

HackerNews