Protein Autoregressive Modeling (PAR)

Updated 6 February 2026

Protein Autoregressive Modeling is a framework that factorizes protein sequences, structures, and trajectories into sequential conditional distributions for scalable generative modeling.
It leverages advanced neural architectures, including transformers, diffusion models, and graph neural networks, to enable efficient sequence generation, structure backmapping, and dynamic simulation.
By mitigating exposure bias and facilitating zero-shot generalization, PAR drives innovations in protein design, fitness prediction, and multi-task bioengineering applications.

Protein Autoregressive Modeling (PAR) is a probabilistic framework for modeling, generating, and predicting protein sequences, structures, and dynamics. Distinct from classical methods relying on holistic or energy-based modeling, PAR factorizes high-dimensional conditional distributions into a sequence of univariate or low-dimensional distributions, permitting efficient, scalable, and expressive handling of sequence, structure, and time-evolution in proteins. This approach has been applied to sequence generation, conformational ensemble modeling, side-chain packing, all-atom backmapping, and temporal synthesis of molecular trajectories, leveraging neural architectures such as transformers, diffusion models, graph neural networks, and normalizing flows. Recent PAR innovations provide strong performance on generative, predictive, and design tasks, with models capable of robust zero-shot generalization, transfer to out-of-distribution inputs, and combinatorial multi-task handling.

1. Formalism and Probabilistic Structure

The cornerstone of PAR is the autoregressive factorization of a target variable (sequence, structure, or trajectory) into conditionals, enabling left-to-right (or coarse-to-fine, temporal, or spatial) decomposition:

Sequence-level PAR: Given a protein primary sequence $s = (s_1, ..., s_L)$ ,

$P(s) = \prod_{i=1}^L P(s_i \mid s_{<i})$

This factorization forms the basis for generative sequence modeling, as in arDCA (Trinquier et al., 2021), RITA (Hesslow et al., 2022), and Tranception (Notin et al., 2022).

Structure-level PAR: For hierarchical or residue-wise all-atom reconstruction from coarse representations ( $x$ as C $_\alpha$ trace, $X$ as all-atom coordinates),

$p(X \mid x) = \prod_{i=1}^n p(X_i \mid x, X_{<i})$

Each $p(X_i \mid x, X_{<i})$ can itself be a complex conditional distribution, potentially implemented by diffusion models or flow-based models as in DiAMoNDBack (Jones et al., 2023) and PAR (Qu et al., 4 Feb 2026).

Trajectory-level PAR: For time-ordered conformational ensembles $(X_1, ..., X_T)$ ,

$P(X_{1:T}) = \prod_{t=1}^T P(X_t \mid X_{<t})$

Temporal multi-scale PAR further decomposes each $X_t$ into slow and fast dynamical components, as in TEMPO (Xu et al., 24 Oct 2025) and ConfRover (Shen et al., 23 May 2025).

Side-chain torsion-level PAR: For side-chain angle autoregression,

$p(\chi_1, \chi_2, \chi_3, \chi_4) = p(\chi_1) p(\chi_2 \mid \chi_1) p(\chi_3 \mid \chi_{1:2}) p(\chi_4 \mid \chi_{1:3})$

Each torsion angle is generated in sequence, conditioned on previous angles and the backbone (Zhang et al., 2023).

These decompositions enable tractable likelihood computation, exact ancestral sampling, and explicit manipulations for downstream design and fitness prediction.

2. Representative Architectures and Methodologies

Research in PAR has yielded a range of architectures tailored for protein sequence, structure, and dynamics:

Autoregressive Sequence Transformers: Decoder-only transformer models (RITA (Hesslow et al., 2022), Tranception (Notin et al., 2022)), employing causal attention, rotary or ALiBi positional embeddings, and token-level softmax likelihoods, trained on UniRef-100 and related databases.
Hierarchical and Multi-Scale Generative Models: Multi-scale architectures break down the generative process into successive resolutions or scales. In (Qu et al., 4 Feb 2026), structure generation proceeds by first synthesizing coarse backbones and successively refining to higher resolutions, with each stochastically sampled scale conditioned autoregressively on earlier (downsampled) representations using a transformer encoder and flow-based decoder.
Diffusion-Driven Autoregressive Models:
- Structure Backmapping: DiAMoNDBack (Jones et al., 2023) applies residue-wise denoising diffusion probabilistic modeling, embedding local context via U-Net architectures, and conditioning the denoising process on both C $_\alpha$ traces and spatially local previously-decoded fragment atoms.
- Side-chain Packing: DiffPack (Zhang et al., 2023) applies VE-SDE-based diffusion on the torus for side-chain torsions, decomposing the full side-chain conformer into an autoregressive chain of conditional diffusion models.
Temporal and Spatio-Temporal PAR:
- Trajectory Generation: TEMPO (Xu et al., 24 Oct 2025) and ConfRover (Shen et al., 23 May 2025) employ causal temporal models—GRUs or transformers—combined with spatial encoding (Invariant Point Attention, pair and single-residue features) and physically-informed SDE-based decoders. ConfRover combines a Llama-style causal transformer for latent temporal integration with an SE(3) diffusion decoder for frame-level 3D structure sampling.
Multi-Task and Prompt-driven Frameworks: Prot2Token (Pourmirzaei et al., 26 May 2025) unifies classification, regression, and sequence-to-sequence tasks under a next-token autoregressive decoder, leveraging task-specific learned prompt tokens and cross-attention between a protein encoder and shared decoder.

3. Training, Objectives, and Exposure Mitigation

Training paradigms across PAR frameworks leverage efficient conditional likelihood or score-based objectives:

Cross-Entropy/Negative Log-Likelihood: Sequence models optimize the negative log-probability of observed residues (or tokens), either directly (Trinquier et al., 2021, Hesslow et al., 2022, Pourmirzaei et al., 26 May 2025) or via weighted objectives for multi-task learning.
Diffusion and Score Matching Losses: When the generative process is diffusion-based (as in DiAMoNDBack (Jones et al., 2023), PAR (Qu et al., 4 Feb 2026), DiffPack (Zhang et al., 2023)), models minimize expected squared error between predicted and true noise in latent or Cartesian space, or torus-wrapped score-matching in angular coordinates.
Scheduled Sampling and Noisy Context Learning: Autoregressive models are susceptible to exposure bias, where divergence between training (conditioning on ground-truth context) and inference (conditioning on model predictions) hurts quality. Modern PAR frameworks (e.g., (Qu et al., 4 Feb 2026)) employ noisy context learning by injecting noise into conditioning features during training and scheduled sampling by probabilistically replacing ground-truth context with predictions, resulting in improved robustness and inference performance.

4. Evaluation, Empirical Results, and Benchmarks

PAR methods are assessed using both standard generative modeling metrics and specialized evaluation for biological plausibility:

Application	Key Metrics	State-of-the-art (selected works)
Sequence generation	Perplexity, cross-entropy, zero-shot fitness (ρ)	RITA: perplexity 5.48@XLarge, ρ=0.387 (Hesslow et al., 2022)
Structure backmapping	Bond %, Clash %, Diversity, sc-RMSD, FPSD, fS	DiAMoNDBack: Bond 99.18%, Clash 0.57% (Jones et al., 2023); PAR: sc-RMSD 1.01Å (Qu et al., 4 Feb 2026)
Ensemble/Trajectory	Pairwise RMSD, RMSF, PCA-W2, tICA, JSD, F1	TEMPO: RMSD 1.78Å, clash 4.75% (Xu et al., 24 Oct 2025); ConfRover: multi-start Pearson 0.77 (Shen et al., 23 May 2025)
Side-chain packing	χ angle MAE, @20°, atom RMSD	DiffPack: MAE 15.35°, Acc@20° 69.5% CASP13 (Zhang et al., 2023)

Experimental evidence demonstrates PAR models achieving generative parity or superiority compared to energy-based models or non-autoregressive deep VaEs. Notably, stochastic ensemble sampling and motif-prompted generalization are enabled directly by the AR construction, allowing new protocols in design and modeling (Qu et al., 4 Feb 2026, Jones et al., 2023).

5. Applications and Impact

Protein Autoregressive Modeling underpins advances in:

De novo sequence generation: Fast, scalable sampling of novel protein sequences for library design (Hesslow et al., 2022, Trinquier et al., 2021).
Fitness and mutational effect prediction: Quantitative log-likelihood ratios of mutants vs wild-type, often matching deep MSA-based methods, and handling substitutions, indels, and multiple mutants (Notin et al., 2022, Hesslow et al., 2022, Trinquier et al., 2021).
Structure backmapping and side-chain packing: High-fidelity restoration of atomistic detail from coarse input, enabling realistic physical simulations and structure completion (Jones et al., 2023, Zhang et al., 2023).
Trajectory and conformational dynamics modeling: Generation of physically plausible conformational ensembles spanning both equilibrium and transition states, with support for path interpolation and time-independent sampling (Xu et al., 24 Oct 2025, Shen et al., 23 May 2025).
Multi-task, high-throughput prediction: Unification of annotation, regression, and structure tasks in a single framework, supporting accelerated annotation and design (Pourmirzaei et al., 26 May 2025).

A plausible implication is that as AR methods are further integrated with conditional design strategies (e.g., motif scaffolding, prompt-based synthesis), their composability and controllability will drive new experimental workflows in protein engineering.

6. Limitations, Challenges, and Future Directions

Current limitations and open research questions include:

Integration of explicit physical constraints (force fields, solvent effects) directly into AR generative processes remains incomplete, particularly for long-timescale dynamics and side-chain conformational diversity (Xu et al., 24 Oct 2025).
Generalization to very large proteins, multi-chain complexes, and rare or disordered topologies is still underexplored, due in part to dataset composition and model scalability constraints (Shen et al., 23 May 2025, Xu et al., 24 Oct 2025).
Some diffusion-based AR models incur high computational costs, particularly with deep transformers and repeated reverse diffusion steps (noted in (Shen et al., 23 May 2025)).
Sequence-only AR models lack direct integration of structure; hybrid approaches that close the loop between sequence, structure, and function are an area of active research (Pourmirzaei et al., 26 May 2025).
Further mitigation of exposure bias, improvements in zero-shot motif conditionality, and adaptive multi-scale schemes (beyond current two- or three-scale approaches) are ongoing engineering and algorithmic challenges (Qu et al., 4 Feb 2026, Xu et al., 24 Oct 2025).

Future developments are expected to involve energy-guided AR sampling, broader integration of physics-based priors, active learning closed loops with molecular simulation data, and efficient AR-diffusion hybrids capable of scaling to larger complexes and sampling over broader biophysical landscapes.