Future Alignment Score (FAS): Theory & Applications

Updated 4 July 2026

Future Alignment Score (FAS) is a context-dependent metric that evaluates a model’s current outputs by forecasting future states or proposal quality.
It spans varied formulations—from probabilistic runtime monitoring using proper scoring rules to LLM fine-tuning dynamics with alignment probability estimates.
In research proposal forecasting, FAS measures how well a generated proposal anticipates post-cutoff publications, highlighting both performance gains and limitations.

Future Alignment Score (FAS) denotes a future-oriented alignment quantity whose precise meaning depends on the evaluation setting. In recent arXiv work, the term is used explicitly for time-sliced evaluation of research proposals against post-cutoff publications, and it is also a natural mapping for two related constructs that were not originally named FAS: the alignment score monitored online for probabilistic systems and the alignment score governing LLM fine-tuning dynamics (Wang et al., 28 Mar 2026, Henzinger et al., 28 Jul 2025, Huang et al., 18 May 2026). Across these settings, the common object is prospective rather than retrospective: FAS measures how well a model’s current output distribution, current policy, or current proposal anticipates future outcomes that have not yet been observed.

1. Conceptual scope and formal variants

The literature does not present a single canonical FAS. One paper defines FAS directly for research proposal forecasting, whereas two others define alignment scores that can be mapped to FAS in a future-facing sense. This suggests that FAS is best understood as a context-dependent formalization of future alignment rather than a universally standardized metric.

Setting	Formal quantity	Interpretation
Probabilistic runtime monitoring	$FAS_t := E_{X_{t+1}\sim q_t}[S(p_t, X_{t+1})]$ ; cumulative $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$	Prospective next-state predictive alignment
LLM fine-tuning dynamics	$S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]$	Expected probability of producing an aligned completion
Research proposal forecasting	$\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)$	Maximum semantic alignment to retrieved future papers

These variants differ materially. In the probabilistic-systems setting, FAS is induced by a bounded proper scoring rule and is typically a loss-like quantity for which lower is better under many scores. In the fine-tuning and proposal-forecasting settings, the mapped or explicit FAS is higher when alignment is stronger. The shared feature is temporal orientation: each definition evaluates present model behavior by reference to future states, future completions, or future literature rather than only past fit (Henzinger et al., 28 Jul 2025, Huang et al., 18 May 2026, Wang et al., 28 Mar 2026).

2. Sequential future alignment in probabilistic systems

In the runtime-monitoring formulation, the state space $X$ is finite. At time $t$ , a model issues a probabilistic forecast over the next state,

$p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),$

while the system or environment determines a true but unknown distribution

$q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).$

The environment chooses $q_t$ based only on past observations, before the model’s prediction, as required by the sequential forecasting guarantees. After predicting $p_t$ , the monitor observes the next state $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 0 (Henzinger et al., 28 Jul 2025).

The score function is a bounded proper scoring rule

$FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 1

with range width $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 2. Properness means that for all $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 3,

$FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 4

Two examples used experimentally are the Brier score, bounded in $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 5,

$FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 6

and the spherical score, bounded in $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 7,

$FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 8

Under this formulation, the instantaneous future alignment quantity is the expected next-step score

$FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 9

The cumulative version is the paper’s Average Expected Score (AES):

$S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]$ 0

The supplied mapping identifies cumulative FAS with this AES. The definition is explicitly prospective: $S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]$ 1 is the model’s expected score against the true next-state distribution before the next state is observed. Averaging these values over time produces a sustained measure of predictive alignment.

This formulation is designed for settings in which formal verification depends on a probabilistic model whose validity may drift at runtime. Formal verification results are reliable only insofar as the model remains aligned with reality. Alignment monitoring supplies an online statistical check of that premise by continually testing whether the model’s successor distributions remain close to the system’s realized behavior. The weighted version further connects monitoring to verification-relevant aspects of system behavior, such as BSCC exits for safety or selected groups for fairness, although the paper does not prove that a high or low FAS directly implies a bound on downstream property error (Henzinger et al., 28 Jul 2025).

3. Estimation, confidence sequences, and monitor variants

The average monitor operates sequentially. At each time step it computes a forecast $S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]$ 2, observes $S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]$ 3, forms the realized score

$S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]$ 4

and updates the empirical average

$S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]$ 5

Using the paper’s indexing convention, this is written as

$S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]$ 6

The monitor outputs a time-uniform high-probability interval

$S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]$ 7

covering the true AES for all times simultaneously with probability at least $S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]$ 8 (Henzinger et al., 28 Jul 2025).

The empirical-variance-adaptive confidence sequence uses

$S(\theta) := \mathbb{E}_{x \sim p(x)}\left[\sum_{y \in \mathcal{Y}} \mathbf{1}\{(x,y)\in\mathcal{A}\}\,\pi_\theta(y\mid x)\right]$ 9

and the predictable variance process

$\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)$ 0

The half-width is

$\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)$ 1

The guarantee is

$\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)$ 2

The assumptions are boundedness of $\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)$ 3, non-anticipation of the environment, and full observability of the realized state. No stationarity or mixing assumption is required; the construction is nonparametric. The bounds are nonasymptotic and time-uniform, and the interval width shrinks with empirical-variance adaptation.

Two extensions generalize the monitor. The differential alignment monitor compares a model $\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)$ 4 with a reference model $\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)$ 5 through

$\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)$ 6

Its per-step statistic is

$\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)$ 7

with empirical mean $\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)$ 8. Because $\mathrm{FAS}(\hat{P}) = \max_{p \in \mathcal{R}_k(\hat{P})} s_{\text{LLM}}(\hat{P}, p)$ 9, the range doubles to $X$ 0. If the upper confidence bound is below $X$ 1, the tested model is better; if the lower bound is above $X$ 2, the reference is better.

The weighted alignment monitor emphasizes task-critical regimes. It uses prediction weights $X$ 3 determined by history, and an outcome-dependent weighted scoring rule constructed from a base proper score and an outcome weight $X$ 4. With

$X$ 5

the weighted rule is

$X$ 6

Weighted time is

$X$ 7

the weighted empirical score is

$X$ 8

and the weighted target is

$X$ 9

The range constant becomes $t$ 0, and the same confidence-sequence form is used with $t$ 1 replaced by $t$ 2.

The algorithms maintain only the running time index, running mean, and variance process; average and differential monitors use $t$ 3 memory in $t$ 4 and $t$ 5 time per step to evaluate the scoring rule. Weighted monitoring remains $t$ 6 per step when weights are Markovian, but may require $t$ 7 time and memory when weights depend on full history. In experiments on the PRISM DTMC benchmarks—Brp(16,2), Conditional, Crowds(5,5), Crowds(4,3), Die, Leader(3,5), Nand(5,2), and Quantiles, with a $t$ 8 self-loop to the initial state to avoid BSCCs—the monitors were reported as fast and memory-efficient and as detecting misalignment early. Time per iteration scaled linearly with support size, with approximately $t$ 9/iter at $p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),$ 0 for Brier and approximately $p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),$ 1/iter at $p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),$ 2. Differential decisions often occurred within tens to a few hundreds of observations for strong corruptions; representative cases included $p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),$ 3 steps for Die, Invert vs Env, and $p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),$ 4 for Crowds(4,3), Invert vs Env, whereas several additive-noise cases remained indecisive at horizon $p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),$ 5 when models were close (Henzinger et al., 28 Jul 2025).

4. Fine-tuning dynamics and alignment forecasting in LLMs

In the fine-tuning-dynamics formulation, the mapped FAS is the paper’s alignment score

$p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),$ 6

where $p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),$ 7 is a prompt, $p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),$ 8 is an autoregressive completion, and $p_t(x) = P_{\text{model}}(X_{t+1}=x \mid \text{history}),$ 9 is the aligned set induced by human preference or a calibrated judge. This quantity is the expected probability that the model produces an aligned completion under the prompt distribution. The paper derives its first-order update during fine-tuning and thereby turns alignment into a forecastable dynamical variable (Huang et al., 18 May 2026).

The formalism introduces the prefix probability

$q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).$ 0

and the future alignment potential for choosing token $q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).$ 1 at step $q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).$ 2,

$q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).$ 3

For a small policy change, the first-order alignment variation is

$q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).$ 4

The paper then defines outcome-conditioned posteriors after a prefix:

$q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).$ 5

and

$q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).$ 6

$q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).$ 7

If $q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).$ 8 are logits and

$q_t(x) = P_{\text{env}}(X_{t+1}=x \mid \text{history}).$ 9

is the softmax Jacobian, then

$q_t$ 0

and the Bayes identity

$q_t$ 1

converts future-dependent alignment potentials into local alignment contrasts between outcome-conditioned token posteriors.

Under the empirical neural tangent kernel framework and the Relatively Stable Kernel assumption, fine-tuning updates logits as

$q_t$ 2

For supervised fine-tuning with cross-entropy, $q_t$ 3, where $q_t$ 4 is the target next-token distribution induced by the training data. The resulting alignment evolution decomposes into a Rebound Force and a Driving Force:

$q_t$ 5

The Driving Force is

$q_t$ 6

and the Rebound Force is

$q_t$ 7

The interpretation is that the Driving Force is determined by how the training distribution aligns with the token-level contrast between aligned and non-aligned posteriors, while the Rebound Force is an intrinsic self-interaction governed jointly by the current alignment state and posterior narrowness. In the identity-kernel simplification,

$q_t$ 8

making explicit that posterior narrowness enters through sum-of-squares terms such as

$q_t$ 9

More concentrated posteriors have larger quadratic forms and therefore stronger rebound. The same framework explains the Rehearsal Priming Effect: prior alignment can leave a latent posterior imprint that increases the later Driving Force under re-exposure, yielding faster re-alignment.

The empirical study covered safety alignment, emergent misalignment, and IMDb sentiment, using Llama-3.1-8B, Gemma-2-2B, and Qwen3-8B, with a three-stage supervised fine-tuning paradigm: Stage 1 forward alignment or misalignment, Stage 2 reverse fine-tuning, and Stage 3 re-exposure. Each stage used $p_t$ 0 samples. The reported findings were consistent with the theory: alignment reversal under reverse fine-tuning, stronger rebound under lower-diversity data that induced narrower posteriors, and accelerated re-alignment in Stage 3 after stronger Stage 1 priming. Controlled score matching in Stage 3 used tolerances of $p_t$ 1 for Safety Rate and Positive Score and $p_t$ 2 for Misalignment Rate (Huang et al., 18 May 2026).

5. Time-sliced scientific forecasting for research proposals

The explicit use of FAS appears in a proposal-evaluation framework that treats research ideation as a time-sliced scientific forecasting task. Given a research question and inspiring papers available before a cutoff time $p_t$ 3, a model generates a structured proposal $p_t$ 4, and evaluation asks whether that proposal anticipates directions that appear in papers published after the cutoff. The future corpus is

$p_t$ 5

candidate papers are retrieved by embedding similarity,

$p_t$ 6

and the score is

$p_t$ 7

No additional normalization, weighting, or macro/micro variants are defined (Wang et al., 28 Mar 2026).

The generated proposal has a fixed schema with five fields: Research Question, Hypothesis, Proposed Method, Novelty Claims, and Experimental Details. Retrieval uses the full proposal text. The semantic judge returns a $p_t$ 8– $p_t$ 9 score for each of those five fields and an overall $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 00– $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 01 score, using the rubric $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 02 unrelated, $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 03 same broad area, $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 04 some overlap, $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 05 very similar, and $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 06 nearly identical. Overall FAS is the maximum overall score across the retrieved future papers, and component-level FAS is obtained from the same retrieve-then-score procedure.

The evaluation pipeline uses text-embedding-3-large for embeddings, cosine similarity, top- $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 07 retrieval with $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 08, and GPT-4.1-mini as the judge at temperature $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 09. Robustness checks varied $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 10 to $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 11, replaced text-embedding-3-large with text-embedding-3-small, and replaced GPT-4.1-mini with GPT-4o-mini; rankings of systems remained consistent. In a robustness subset of $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 12 re-evaluations, the baseline configuration produced scores of $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 13 for the tuned model, $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 14 for the untuned model, and $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 15 for CoI, with Pearson $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 16 and Spearman $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 17 under $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 18 relative to the baseline, and stable model ordering under embedding-model and judge-model substitutions.

The dataset is time-consistent and contains $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 19 machine learning papers from NeurIPS, ICML, and ICLR. Papers from $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 20 were used to build training supervision and papers from $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 21 formed the post-cutoff future evaluation targets. The experiments sampled $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 22 training instances and $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 23 evaluation instances. For each target paper $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 24, the research question $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 25 was extracted in a leakage-controlled way, inspiring papers $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 26 were selected from references through a two-stage pipeline, and the proposal target $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 27 was synthesized as a forward-looking structured proposal. The shortlist stage retained the top $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 28 references using recency-weighted scoring with a full boost for references within two years of the target, linear decay up to five years, exclusion of references with fewer than $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 29 citations, and a small citation-count tiebreaker. GPT-5-mini then selected $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 30 references most likely to have been direct inspirations.

Training used supervised fine-tuning rather than direct optimization of FAS, because FAS is non-differentiable. The supervision consisted of time-consistent $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 31 pairs and optional citation-grounded reasoning traces teaching gap identification and inspiration borrowing. Three variants were compared: Direct SFT without reasoning traces, CoT SFT with a single monolithic reasoning block, and Stepwise CoT SFT interleaving problem identification, method design, and experiment design reasoning blocks. The models were Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct, and Qwen2.5-14B-Instruct, fine-tuned with LoRA rank $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 32, LoRA $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 33, $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 34 epochs, maximum sequence length $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 35, per-device batch size $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 36, gradient accumulation $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 37, learning rate $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 38, bf16 precision, primarily on $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 39H100 GPUs.

The main results reported higher overall FAS for future-aligned tuning than for prompting alone. Llama-3.1-8B-Instruct improved from $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 40 to $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 41, Qwen2.5-7B-Instruct from $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 42 to $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 43, and Qwen2.5-14B-Instruct from $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 44 to $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 45, corresponding to gains of $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 46, $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 47, and up to $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 48. Gains were concentrated in Hypothesis and Proposed Method, while Experimental Details improved less consistently. Against adapted AI-Researcher and Chain-of-Ideas baselines, FAS was lower than the full method and, in the reported comparisons, lower even than the prompting baseline. Citation-removal ablations showed drops in overall and component-level FAS for Background, Method, and Benchmark citation types, with Novelty the most sensitive component.

Human evaluation used $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 49 pairwise comparisons judged by three domain-expert graduate students. Against human-derived proposals, the Stepwise CoT system obtained $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 50 win/tie/loss overall and was described as competitive; against prompting-only proposals it obtained $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 51. An additional LLM-based proposal-quality metric over Resource Validity, Task–Method Consistency, and Task–Experiment Consistency showed the strongest average score for Stepwise CoT at $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 52, compared with $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 53 for prompting. Two case studies implemented model-generated proposals with a code agent: a strategy-search prompting method yielded a $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 54 accuracy gain on MATH, and a model-merging method, MALS, gave consistent improvements on reasoning tasks, including ARC-Easy $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 55 versus $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 56 for TIES-Merging in the excerpted results (Wang et al., 28 Mar 2026).

6. Assumptions, limitations, and interpretive issues

A central interpretive issue is terminological. Only the proposal-forecasting paper defines the name “Future Alignment Score” directly. In the other two cases, the term is a natural mapping supplied for future-oriented use: the runtime-monitoring paper defines an alignment score and AES, and the fine-tuning paper defines an alignment score $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 57 with a closed-form update. This means that FAS is not yet a standardized cross-domain metric, even though the three formulations share a future-facing semantics (Henzinger et al., 28 Jul 2025, Huang et al., 18 May 2026, Wang et al., 28 Mar 2026).

The probabilistic-systems formulation requires bounded scoring rules, non-anticipating environments, and full observability of realized outcomes. If the environment conditions $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 58 on the current forecast $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 59, the supermartingale-based confidence sequence may fail. Partial observability, delayed or noisy observations, and unbounded scoring rules such as unclipped log score require new estimators or restrictions. The paper also emphasizes that different proper scores emphasize different aspects of forecast quality, affecting bias, variance, and convergence speed.

The fine-tuning-dynamics formulation depends on first-order analysis, the empirical NTK framework, and the Relatively Stable Kernel assumption. The closed-form update is most accurate for small learning rates and short horizons; for longer runs or larger parameter shifts, recomputation or adaptation of kernels and posteriors is recommended. Estimating $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 60, outcome-conditioned posteriors, and kernels is computationally expensive, and judge noise can bias the aligned set $FAS_{1:t} := \frac{1}{t}\sum_{i=1}^t E_{X_i\sim q_i}[S(p_i, X_i)]$ 61. The paper further identifies a trade-off: higher diversity in alignment data can slow initial alignment gains while reducing posterior narrowness and weakening rebound.

The proposal-forecasting formulation is domain-limited to machine-learning conferences with their own citation practices and publication tempo. Its objective is similarity to future published work, not necessarily novelty or correctness, so genuinely novel ideas that do not later appear in the corpus may be under-rewarded. The supervision pipeline also uses LLMs for inspiring-paper selection and reasoning-trace synthesis, which can introduce bias. Although robustness tests showed stable rankings across retrieval depth, embedding model, and judge model, absolute scores depend on those choices. The paper explicitly notes that time filtering and semantic judging mitigate but do not eliminate the possibility of gaming.

A common misconception would be to treat FAS as a direct proof of downstream property validity, durable safety, or scientific merit. None of the three formulations makes that claim. In runtime monitoring, FAS supplies anytime intervals on predictive alignment rather than direct property-error bounds. In fine-tuning dynamics, FAS predicts alignment trajectories under modeled update assumptions rather than guaranteeing global robustness. In proposal forecasting, FAS is a verifiable surrogate for proposal quality rather than a direct measure of novelty, correctness, or impact. What the three lines of work jointly establish is narrower but precise: future-oriented alignment can be formalized, estimated, and, in some settings, forecasted with explicit statistical or dynamical structure.

Markdown Report Issue Upgrade to Chat

References (3)

Learning to Predict Future-Aligned Research Proposals with Language Models (2026)

Alignment Monitoring (2025)

Alignment Dynamics in LLM Fine-Tuning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Future Alignment Score (FAS).