Effectiveness of unsupervised strategies for learning cis-regulatory logic

Establish whether restricting unsupervised genomic language model training to annotated regulatory regions or incorporating evolutionary conservation leads to improved learning of cis-regulatory logic that governs regulatory activity across cellular contexts.

Background

Unsupervised and self-supervised genomic LLMs often learn representations dominated by background sequence variation rather than sparse regulatory features. To steer these models toward regulatory signals, proposed strategies include training only on annotated regulatory regions and incorporating evolutionary conservation. However, it remains unclear if these interventions genuinely improve the learning of cis-regulatory rules relevant for predicting functional genomics readouts.

References

Their effect on learning cis-regulatory logic, however, has not been rigorously established.

Toward Interpretable and Generalizable AI in Regulatory Genomics  (2602.01230 - Nagai et al., 1 Feb 2026) in Section “Modeling Approaches for Regulatory Genomics”