Papers
Topics
Authors
Recent
Search
2000 character limit reached

Future Alignment Score (FAS): Theory & Applications

Updated 4 July 2026
  • Future Alignment Score (FAS) is a context-dependent metric that evaluates a model’s current outputs by forecasting future states or proposal quality.
  • It spans varied formulations—from probabilistic runtime monitoring using proper scoring rules to LLM fine-tuning dynamics with alignment probability estimates.
  • In research proposal forecasting, FAS measures how well a generated proposal anticipates post-cutoff publications, highlighting both performance gains and limitations.

Future Alignment Score (FAS) denotes a future-oriented alignment quantity whose precise meaning depends on the evaluation setting. In recent arXiv work, the term is used explicitly for time-sliced evaluation of research proposals against post-cutoff publications, and it is also a natural mapping for two related constructs that were not originally named FAS: the alignment score monitored online for probabilistic systems and the alignment score governing LLM fine-tuning dynamics (Wang et al., 28 Mar 2026, Henzinger et al., 28 Jul 2025, Huang et al., 18 May 2026). Across these settings, the common object is prospective rather than retrospective: FAS measures how well a model’s current output distribution, current policy, or current proposal anticipates future outcomes that have not yet been observed.

1. Conceptual scope and formal variants

The literature does not present a single canonical FAS. One paper defines FAS directly for research proposal forecasting, whereas two others define alignment scores that can be mapped to FAS in a future-facing sense. This suggests that FAS is best understood as a context-dependent formalization of future alignment rather than a universally standardized metric.

Setting Formal quantity Interpretation
Probabilistic runtime monitoring FASt:=EXt+1qt[S(pt,Xt+1)]FAS_t := E_{X_{t+1}\sim q_t}[S(p_t, X_{t+1})]; cumulative FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)] Prospective next-state predictive alignment
LLM fine-tuning dynamics S(θ):=Exp(x)[yY1{(x,y)A}πθ(yx)]S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right] Expected probability of producing an aligned completion
Research proposal forecasting FAS(P^)=maxpRk(P^)sLLM(P^,p)\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p) Maximum semantic alignment to retrieved future papers

These variants differ materially. In the probabilistic-systems setting, FAS is induced by a bounded proper scoring rule and is typically a loss-like quantity for which lower is better under many scores. In the fine-tuning and proposal-forecasting settings, the mapped or explicit FAS is higher when alignment is stronger. The shared feature is temporal orientation: each definition evaluates present model behavior by reference to future states, future completions, or future literature rather than only past fit (Henzinger et al., 28 Jul 2025, Huang et al., 18 May 2026, Wang et al., 28 Mar 2026).

2. Sequential future alignment in probabilistic systems

In the runtime-monitoring formulation, the state space XX is finite. At time tt, a model issues a probabilistic forecast over the next state,

pt(x)=Pmodel(Xt+1=xhistory),p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),

while the system or environment determines a true but unknown distribution

qt(x)=Penv(Xt+1=xhistory).q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).

The environment chooses qtq_t based only on past observations, before the model’s prediction, as required by the sequential forecasting guarantees. After predicting ptp_t, the monitor observes the next state FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]0 (Henzinger et al., 28 Jul 2025).

The score function is a bounded proper scoring rule

FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]1

with range width FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]2. Properness means that for all FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]3,

FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]4

Two examples used experimentally are the Brier score, bounded in FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]5,

FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]6

and the spherical score, bounded in FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]7,

FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]8

Under this formulation, the instantaneous future alignment quantity is the expected next-step score

FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]9

The cumulative version is the paper’s Average Expected Score (AES):

S(θ):=Exp(x)[yY1{(x,y)A}πθ(yx)]S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]0

The supplied mapping identifies cumulative FAS with this AES. The definition is explicitly prospective: S(θ):=Exp(x)[yY1{(x,y)A}πθ(yx)]S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]1 is the model’s expected score against the true next-state distribution before the next state is observed. Averaging these values over time produces a sustained measure of predictive alignment.

This formulation is designed for settings in which formal verification depends on a probabilistic model whose validity may drift at runtime. Formal verification results are reliable only insofar as the model remains aligned with reality. Alignment monitoring supplies an online statistical check of that premise by continually testing whether the model’s successor distributions remain close to the system’s realized behavior. The weighted version further connects monitoring to verification-relevant aspects of system behavior, such as BSCC exits for safety or selected groups for fairness, although the paper does not prove that a high or low FAS directly implies a bound on downstream property error (Henzinger et al., 28 Jul 2025).

3. Estimation, confidence sequences, and monitor variants

The average monitor operates sequentially. At each time step it computes a forecast S(θ):=Exp(x)[yY1{(x,y)A}πθ(yx)]S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]2, observes S(θ):=Exp(x)[yY1{(x,y)A}πθ(yx)]S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]3, forms the realized score

S(θ):=Exp(x)[yY1{(x,y)A}πθ(yx)]S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]4

and updates the empirical average

S(θ):=Exp(x)[yY1{(x,y)A}πθ(yx)]S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]5

Using the paper’s indexing convention, this is written as

S(θ):=Exp(x)[yY1{(x,y)A}πθ(yx)]S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]6

The monitor outputs a time-uniform high-probability interval

S(θ):=Exp(x)[yY1{(x,y)A}πθ(yx)]S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]7

covering the true AES for all times simultaneously with probability at least S(θ):=Exp(x)[yY1{(x,y)A}πθ(yx)]S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]8 (Henzinger et al., 28 Jul 2025).

The empirical-variance-adaptive confidence sequence uses

S(θ):=Exp(x)[yY1{(x,y)A}πθ(yx)]S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]9

and the predictable variance process

FAS(P^)=maxpRk(P^)sLLM(P^,p)\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)0

The half-width is

FAS(P^)=maxpRk(P^)sLLM(P^,p)\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)1

The guarantee is

FAS(P^)=maxpRk(P^)sLLM(P^,p)\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)2

The assumptions are boundedness of FAS(P^)=maxpRk(P^)sLLM(P^,p)\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)3, non-anticipation of the environment, and full observability of the realized state. No stationarity or mixing assumption is required; the construction is nonparametric. The bounds are nonasymptotic and time-uniform, and the interval width shrinks with empirical-variance adaptation.

Two extensions generalize the monitor. The differential alignment monitor compares a model FAS(P^)=maxpRk(P^)sLLM(P^,p)\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)4 with a reference model FAS(P^)=maxpRk(P^)sLLM(P^,p)\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)5 through

FAS(P^)=maxpRk(P^)sLLM(P^,p)\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)6

Its per-step statistic is

FAS(P^)=maxpRk(P^)sLLM(P^,p)\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)7

with empirical mean FAS(P^)=maxpRk(P^)sLLM(P^,p)\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)8. Because FAS(P^)=maxpRk(P^)sLLM(P^,p)\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)9, the range doubles to XX0. If the upper confidence bound is below XX1, the tested model is better; if the lower bound is above XX2, the reference is better.

The weighted alignment monitor emphasizes task-critical regimes. It uses prediction weights XX3 determined by history, and an outcome-dependent weighted scoring rule constructed from a base proper score and an outcome weight XX4. With

XX5

the weighted rule is

XX6

Weighted time is

XX7

the weighted empirical score is

XX8

and the weighted target is

XX9

The range constant becomes tt0, and the same confidence-sequence form is used with tt1 replaced by tt2.

The algorithms maintain only the running time index, running mean, and variance process; average and differential monitors use tt3 memory in tt4 and tt5 time per step to evaluate the scoring rule. Weighted monitoring remains tt6 per step when weights are Markovian, but may require tt7 time and memory when weights depend on full history. In experiments on the PRISM DTMC benchmarks—Brp(16,2), Conditional, Crowds(5,5), Crowds(4,3), Die, Leader(3,5), Nand(5,2), and Quantiles, with a tt8 self-loop to the initial state to avoid BSCCs—the monitors were reported as fast and memory-efficient and as detecting misalignment early. Time per iteration scaled linearly with support size, with approximately tt9/iter at pt(x)=Pmodel(Xt+1=xhistory),p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),0 for Brier and approximately pt(x)=Pmodel(Xt+1=xhistory),p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),1/iter at pt(x)=Pmodel(Xt+1=xhistory),p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),2. Differential decisions often occurred within tens to a few hundreds of observations for strong corruptions; representative cases included pt(x)=Pmodel(Xt+1=xhistory),p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),3 steps for Die, Invert vs Env, and pt(x)=Pmodel(Xt+1=xhistory),p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),4 for Crowds(4,3), Invert vs Env, whereas several additive-noise cases remained indecisive at horizon pt(x)=Pmodel(Xt+1=xhistory),p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),5 when models were close (Henzinger et al., 28 Jul 2025).

4. Fine-tuning dynamics and alignment forecasting in LLMs

In the fine-tuning-dynamics formulation, the mapped FAS is the paper’s alignment score

pt(x)=Pmodel(Xt+1=xhistory),p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),6

where pt(x)=Pmodel(Xt+1=xhistory),p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),7 is a prompt, pt(x)=Pmodel(Xt+1=xhistory),p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),8 is an autoregressive completion, and pt(x)=Pmodel(Xt+1=xhistory),p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),9 is the aligned set induced by human preference or a calibrated judge. This quantity is the expected probability that the model produces an aligned completion under the prompt distribution. The paper derives its first-order update during fine-tuning and thereby turns alignment into a forecastable dynamical variable (Huang et al., 18 May 2026).

The formalism introduces the prefix probability

qt(x)=Penv(Xt+1=xhistory).q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).0

and the future alignment potential for choosing token qt(x)=Penv(Xt+1=xhistory).q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).1 at step qt(x)=Penv(Xt+1=xhistory).q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).2,

qt(x)=Penv(Xt+1=xhistory).q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).3

For a small policy change, the first-order alignment variation is

qt(x)=Penv(Xt+1=xhistory).q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).4

The paper then defines outcome-conditioned posteriors after a prefix:

qt(x)=Penv(Xt+1=xhistory).q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).5

and

qt(x)=Penv(Xt+1=xhistory).q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).6

qt(x)=Penv(Xt+1=xhistory).q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).7

If qt(x)=Penv(Xt+1=xhistory).q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).8 are logits and

qt(x)=Penv(Xt+1=xhistory).q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).9

is the softmax Jacobian, then

qtq_t0

and the Bayes identity

qtq_t1

converts future-dependent alignment potentials into local alignment contrasts between outcome-conditioned token posteriors.

Under the empirical neural tangent kernel framework and the Relatively Stable Kernel assumption, fine-tuning updates logits as

qtq_t2

For supervised fine-tuning with cross-entropy, qtq_t3, where qtq_t4 is the target next-token distribution induced by the training data. The resulting alignment evolution decomposes into a Rebound Force and a Driving Force:

qtq_t5

The Driving Force is

qtq_t6

and the Rebound Force is

qtq_t7

The interpretation is that the Driving Force is determined by how the training distribution aligns with the token-level contrast between aligned and non-aligned posteriors, while the Rebound Force is an intrinsic self-interaction governed jointly by the current alignment state and posterior narrowness. In the identity-kernel simplification,

qtq_t8

making explicit that posterior narrowness enters through sum-of-squares terms such as

qtq_t9

More concentrated posteriors have larger quadratic forms and therefore stronger rebound. The same framework explains the Rehearsal Priming Effect: prior alignment can leave a latent posterior imprint that increases the later Driving Force under re-exposure, yielding faster re-alignment.

The empirical study covered safety alignment, emergent misalignment, and IMDb sentiment, using Llama-3.1-8B, Gemma-2-2B, and Qwen3-8B, with a three-stage supervised fine-tuning paradigm: Stage 1 forward alignment or misalignment, Stage 2 reverse fine-tuning, and Stage 3 re-exposure. Each stage used ptp_t0 samples. The reported findings were consistent with the theory: alignment reversal under reverse fine-tuning, stronger rebound under lower-diversity data that induced narrower posteriors, and accelerated re-alignment in Stage 3 after stronger Stage 1 priming. Controlled score matching in Stage 3 used tolerances of ptp_t1 for Safety Rate and Positive Score and ptp_t2 for Misalignment Rate (Huang et al., 18 May 2026).

5. Time-sliced scientific forecasting for research proposals

The explicit use of FAS appears in a proposal-evaluation framework that treats research ideation as a time-sliced scientific forecasting task. Given a research question and inspiring papers available before a cutoff time ptp_t3, a model generates a structured proposal ptp_t4, and evaluation asks whether that proposal anticipates directions that appear in papers published after the cutoff. The future corpus is

ptp_t5

candidate papers are retrieved by embedding similarity,

ptp_t6

and the score is

ptp_t7

No additional normalization, weighting, or macro/micro variants are defined (Wang et al., 28 Mar 2026).

The generated proposal has a fixed schema with five fields: Research Question, Hypothesis, Proposed Method, Novelty Claims, and Experimental Details. Retrieval uses the full proposal text. The semantic judge returns a ptp_t8–ptp_t9 score for each of those five fields and an overall FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]00–FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]01 score, using the rubric FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]02 unrelated, FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]03 same broad area, FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]04 some overlap, FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]05 very similar, and FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]06 nearly identical. Overall FAS is the maximum overall score across the retrieved future papers, and component-level FAS is obtained from the same retrieve-then-score procedure.

The evaluation pipeline uses text-embedding-3-large for embeddings, cosine similarity, top-FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]07 retrieval with FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]08, and GPT-4.1-mini as the judge at temperature FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]09. Robustness checks varied FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]10 to FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]11, replaced text-embedding-3-large with text-embedding-3-small, and replaced GPT-4.1-mini with GPT-4o-mini; rankings of systems remained consistent. In a robustness subset of FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]12 re-evaluations, the baseline configuration produced scores of FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]13 for the tuned model, FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]14 for the untuned model, and FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]15 for CoI, with Pearson FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]16 and Spearman FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]17 under FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]18 relative to the baseline, and stable model ordering under embedding-model and judge-model substitutions.

The dataset is time-consistent and contains FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]19 machine learning papers from NeurIPS, ICML, and ICLR. Papers from FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]20 were used to build training supervision and papers from FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]21 formed the post-cutoff future evaluation targets. The experiments sampled FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]22 training instances and FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]23 evaluation instances. For each target paper FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]24, the research question FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]25 was extracted in a leakage-controlled way, inspiring papers FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]26 were selected from references through a two-stage pipeline, and the proposal target FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]27 was synthesized as a forward-looking structured proposal. The shortlist stage retained the top FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]28 references using recency-weighted scoring with a full boost for references within two years of the target, linear decay up to five years, exclusion of references with fewer than FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]29 citations, and a small citation-count tiebreaker. GPT-5-mini then selected FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]30 references most likely to have been direct inspirations.

Training used supervised fine-tuning rather than direct optimization of FAS, because FAS is non-differentiable. The supervision consisted of time-consistent FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]31 pairs and optional citation-grounded reasoning traces teaching gap identification and inspiration borrowing. Three variants were compared: Direct SFT without reasoning traces, CoT SFT with a single monolithic reasoning block, and Stepwise CoT SFT interleaving problem identification, method design, and experiment design reasoning blocks. The models were Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct, and Qwen2.5-14B-Instruct, fine-tuned with LoRA rank FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]32, LoRA FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]33, FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]34 epochs, maximum sequence length FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]35, per-device batch size FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]36, gradient accumulation FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]37, learning rate FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]38, bf16 precision, primarily on FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]39H100 GPUs.

The main results reported higher overall FAS for future-aligned tuning than for prompting alone. Llama-3.1-8B-Instruct improved from FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]40 to FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]41, Qwen2.5-7B-Instruct from FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]42 to FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]43, and Qwen2.5-14B-Instruct from FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]44 to FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]45, corresponding to gains of FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]46, FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]47, and up to FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]48. Gains were concentrated in Hypothesis and Proposed Method, while Experimental Details improved less consistently. Against adapted AI-Researcher and Chain-of-Ideas baselines, FAS was lower than the full method and, in the reported comparisons, lower even than the prompting baseline. Citation-removal ablations showed drops in overall and component-level FAS for Background, Method, and Benchmark citation types, with Novelty the most sensitive component.

Human evaluation used FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]49 pairwise comparisons judged by three domain-expert graduate students. Against human-derived proposals, the Stepwise CoT system obtained FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]50 win/tie/loss overall and was described as competitive; against prompting-only proposals it obtained FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]51. An additional LLM-based proposal-quality metric over Resource Validity, Task–Method Consistency, and Task–Experiment Consistency showed the strongest average score for Stepwise CoT at FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]52, compared with FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]53 for prompting. Two case studies implemented model-generated proposals with a code agent: a strategy-search prompting method yielded a FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]54 accuracy gain on MATH, and a model-merging method, MALS, gave consistent improvements on reasoning tasks, including ARC-Easy FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]55 versus FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]56 for TIES-Merging in the excerpted results (Wang et al., 28 Mar 2026).

6. Assumptions, limitations, and interpretive issues

A central interpretive issue is terminological. Only the proposal-forecasting paper defines the name “Future Alignment Score” directly. In the other two cases, the term is a natural mapping supplied for future-oriented use: the runtime-monitoring paper defines an alignment score and AES, and the fine-tuning paper defines an alignment score FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]57 with a closed-form update. This means that FAS is not yet a standardized cross-domain metric, even though the three formulations share a future-facing semantics (Henzinger et al., 28 Jul 2025, Huang et al., 18 May 2026, Wang et al., 28 Mar 2026).

The probabilistic-systems formulation requires bounded scoring rules, non-anticipating environments, and full observability of realized outcomes. If the environment conditions FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]58 on the current forecast FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]59, the supermartingale-based confidence sequence may fail. Partial observability, delayed or noisy observations, and unbounded scoring rules such as unclipped log score require new estimators or restrictions. The paper also emphasizes that different proper scores emphasize different aspects of forecast quality, affecting bias, variance, and convergence speed.

The fine-tuning-dynamics formulation depends on first-order analysis, the empirical NTK framework, and the Relatively Stable Kernel assumption. The closed-form update is most accurate for small learning rates and short horizons; for longer runs or larger parameter shifts, recomputation or adaptation of kernels and posteriors is recommended. Estimating FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]60, outcome-conditioned posteriors, and kernels is computationally expensive, and judge noise can bias the aligned set FAS1:t:=1ti=1tEXiqi[S(pi,Xi)]FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]61. The paper further identifies a trade-off: higher diversity in alignment data can slow initial alignment gains while reducing posterior narrowness and weakening rebound.

The proposal-forecasting formulation is domain-limited to machine-learning conferences with their own citation practices and publication tempo. Its objective is similarity to future published work, not necessarily novelty or correctness, so genuinely novel ideas that do not later appear in the corpus may be under-rewarded. The supervision pipeline also uses LLMs for inspiring-paper selection and reasoning-trace synthesis, which can introduce bias. Although robustness tests showed stable rankings across retrieval depth, embedding model, and judge model, absolute scores depend on those choices. The paper explicitly notes that time filtering and semantic judging mitigate but do not eliminate the possibility of gaming.

A common misconception would be to treat FAS as a direct proof of downstream property validity, durable safety, or scientific merit. None of the three formulations makes that claim. In runtime monitoring, FAS supplies anytime intervals on predictive alignment rather than direct property-error bounds. In fine-tuning dynamics, FAS predicts alignment trajectories under modeled update assumptions rather than guaranteeing global robustness. In proposal forecasting, FAS is a verifiable surrogate for proposal quality rather than a direct measure of novelty, correctness, or impact. What the three lines of work jointly establish is narrower but precise: future-oriented alignment can be formalized, estimated, and, in some settings, forecasted with explicit statistical or dynamical structure.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Future Alignment Score (FAS).