Future Alignment Score (FAS): Theory & Applications
- Future Alignment Score (FAS) is a context-dependent metric that evaluates a model’s current outputs by forecasting future states or proposal quality.
- It spans varied formulations—from probabilistic runtime monitoring using proper scoring rules to LLM fine-tuning dynamics with alignment probability estimates.
- In research proposal forecasting, FAS measures how well a generated proposal anticipates post-cutoff publications, highlighting both performance gains and limitations.
Future Alignment Score (FAS) denotes a future-oriented alignment quantity whose precise meaning depends on the evaluation setting. In recent arXiv work, the term is used explicitly for time-sliced evaluation of research proposals against post-cutoff publications, and it is also a natural mapping for two related constructs that were not originally named FAS: the alignment score monitored online for probabilistic systems and the alignment score governing LLM fine-tuning dynamics (Wang et al., 28 Mar 2026, Henzinger et al., 28 Jul 2025, Huang et al., 18 May 2026). Across these settings, the common object is prospective rather than retrospective: FAS measures how well a model’s current output distribution, current policy, or current proposal anticipates future outcomes that have not yet been observed.
1. Conceptual scope and formal variants
The literature does not present a single canonical FAS. One paper defines FAS directly for research proposal forecasting, whereas two others define alignment scores that can be mapped to FAS in a future-facing sense. This suggests that FAS is best understood as a context-dependent formalization of future alignment rather than a universally standardized metric.
| Setting | Formal quantity | Interpretation |
|---|---|---|
| Probabilistic runtime monitoring | ; cumulative | Prospective next-state predictive alignment |
| LLM fine-tuning dynamics | Expected probability of producing an aligned completion | |
| Research proposal forecasting | Maximum semantic alignment to retrieved future papers |
These variants differ materially. In the probabilistic-systems setting, FAS is induced by a bounded proper scoring rule and is typically a loss-like quantity for which lower is better under many scores. In the fine-tuning and proposal-forecasting settings, the mapped or explicit FAS is higher when alignment is stronger. The shared feature is temporal orientation: each definition evaluates present model behavior by reference to future states, future completions, or future literature rather than only past fit (Henzinger et al., 28 Jul 2025, Huang et al., 18 May 2026, Wang et al., 28 Mar 2026).
2. Sequential future alignment in probabilistic systems
In the runtime-monitoring formulation, the state space is finite. At time , a model issues a probabilistic forecast over the next state,
while the system or environment determines a true but unknown distribution
The environment chooses based only on past observations, before the model’s prediction, as required by the sequential forecasting guarantees. After predicting , the monitor observes the next state 0 (Henzinger et al., 28 Jul 2025).
The score function is a bounded proper scoring rule
1
with range width 2. Properness means that for all 3,
4
Two examples used experimentally are the Brier score, bounded in 5,
6
and the spherical score, bounded in 7,
8
Under this formulation, the instantaneous future alignment quantity is the expected next-step score
9
The cumulative version is the paper’s Average Expected Score (AES):
0
The supplied mapping identifies cumulative FAS with this AES. The definition is explicitly prospective: 1 is the model’s expected score against the true next-state distribution before the next state is observed. Averaging these values over time produces a sustained measure of predictive alignment.
This formulation is designed for settings in which formal verification depends on a probabilistic model whose validity may drift at runtime. Formal verification results are reliable only insofar as the model remains aligned with reality. Alignment monitoring supplies an online statistical check of that premise by continually testing whether the model’s successor distributions remain close to the system’s realized behavior. The weighted version further connects monitoring to verification-relevant aspects of system behavior, such as BSCC exits for safety or selected groups for fairness, although the paper does not prove that a high or low FAS directly implies a bound on downstream property error (Henzinger et al., 28 Jul 2025).
3. Estimation, confidence sequences, and monitor variants
The average monitor operates sequentially. At each time step it computes a forecast 2, observes 3, forms the realized score
4
and updates the empirical average
5
Using the paper’s indexing convention, this is written as
6
The monitor outputs a time-uniform high-probability interval
7
covering the true AES for all times simultaneously with probability at least 8 (Henzinger et al., 28 Jul 2025).
The empirical-variance-adaptive confidence sequence uses
9
and the predictable variance process
0
The half-width is
1
The guarantee is
2
The assumptions are boundedness of 3, non-anticipation of the environment, and full observability of the realized state. No stationarity or mixing assumption is required; the construction is nonparametric. The bounds are nonasymptotic and time-uniform, and the interval width shrinks with empirical-variance adaptation.
Two extensions generalize the monitor. The differential alignment monitor compares a model 4 with a reference model 5 through
6
Its per-step statistic is
7
with empirical mean 8. Because 9, the range doubles to 0. If the upper confidence bound is below 1, the tested model is better; if the lower bound is above 2, the reference is better.
The weighted alignment monitor emphasizes task-critical regimes. It uses prediction weights 3 determined by history, and an outcome-dependent weighted scoring rule constructed from a base proper score and an outcome weight 4. With
5
the weighted rule is
6
Weighted time is
7
the weighted empirical score is
8
and the weighted target is
9
The range constant becomes 0, and the same confidence-sequence form is used with 1 replaced by 2.
The algorithms maintain only the running time index, running mean, and variance process; average and differential monitors use 3 memory in 4 and 5 time per step to evaluate the scoring rule. Weighted monitoring remains 6 per step when weights are Markovian, but may require 7 time and memory when weights depend on full history. In experiments on the PRISM DTMC benchmarks—Brp(16,2), Conditional, Crowds(5,5), Crowds(4,3), Die, Leader(3,5), Nand(5,2), and Quantiles, with a 8 self-loop to the initial state to avoid BSCCs—the monitors were reported as fast and memory-efficient and as detecting misalignment early. Time per iteration scaled linearly with support size, with approximately 9/iter at 0 for Brier and approximately 1/iter at 2. Differential decisions often occurred within tens to a few hundreds of observations for strong corruptions; representative cases included 3 steps for Die, Invert vs Env, and 4 for Crowds(4,3), Invert vs Env, whereas several additive-noise cases remained indecisive at horizon 5 when models were close (Henzinger et al., 28 Jul 2025).
4. Fine-tuning dynamics and alignment forecasting in LLMs
In the fine-tuning-dynamics formulation, the mapped FAS is the paper’s alignment score
6
where 7 is a prompt, 8 is an autoregressive completion, and 9 is the aligned set induced by human preference or a calibrated judge. This quantity is the expected probability that the model produces an aligned completion under the prompt distribution. The paper derives its first-order update during fine-tuning and thereby turns alignment into a forecastable dynamical variable (Huang et al., 18 May 2026).
The formalism introduces the prefix probability
0
and the future alignment potential for choosing token 1 at step 2,
3
For a small policy change, the first-order alignment variation is
4
The paper then defines outcome-conditioned posteriors after a prefix:
5
and
6
7
If 8 are logits and
9
is the softmax Jacobian, then
0
and the Bayes identity
1
converts future-dependent alignment potentials into local alignment contrasts between outcome-conditioned token posteriors.
Under the empirical neural tangent kernel framework and the Relatively Stable Kernel assumption, fine-tuning updates logits as
2
For supervised fine-tuning with cross-entropy, 3, where 4 is the target next-token distribution induced by the training data. The resulting alignment evolution decomposes into a Rebound Force and a Driving Force:
5
The Driving Force is
6
and the Rebound Force is
7
The interpretation is that the Driving Force is determined by how the training distribution aligns with the token-level contrast between aligned and non-aligned posteriors, while the Rebound Force is an intrinsic self-interaction governed jointly by the current alignment state and posterior narrowness. In the identity-kernel simplification,
8
making explicit that posterior narrowness enters through sum-of-squares terms such as
9
More concentrated posteriors have larger quadratic forms and therefore stronger rebound. The same framework explains the Rehearsal Priming Effect: prior alignment can leave a latent posterior imprint that increases the later Driving Force under re-exposure, yielding faster re-alignment.
The empirical study covered safety alignment, emergent misalignment, and IMDb sentiment, using Llama-3.1-8B, Gemma-2-2B, and Qwen3-8B, with a three-stage supervised fine-tuning paradigm: Stage 1 forward alignment or misalignment, Stage 2 reverse fine-tuning, and Stage 3 re-exposure. Each stage used 0 samples. The reported findings were consistent with the theory: alignment reversal under reverse fine-tuning, stronger rebound under lower-diversity data that induced narrower posteriors, and accelerated re-alignment in Stage 3 after stronger Stage 1 priming. Controlled score matching in Stage 3 used tolerances of 1 for Safety Rate and Positive Score and 2 for Misalignment Rate (Huang et al., 18 May 2026).
5. Time-sliced scientific forecasting for research proposals
The explicit use of FAS appears in a proposal-evaluation framework that treats research ideation as a time-sliced scientific forecasting task. Given a research question and inspiring papers available before a cutoff time 3, a model generates a structured proposal 4, and evaluation asks whether that proposal anticipates directions that appear in papers published after the cutoff. The future corpus is
5
candidate papers are retrieved by embedding similarity,
6
and the score is
7
No additional normalization, weighting, or macro/micro variants are defined (Wang et al., 28 Mar 2026).
The generated proposal has a fixed schema with five fields: Research Question, Hypothesis, Proposed Method, Novelty Claims, and Experimental Details. Retrieval uses the full proposal text. The semantic judge returns a 8–9 score for each of those five fields and an overall 00–01 score, using the rubric 02 unrelated, 03 same broad area, 04 some overlap, 05 very similar, and 06 nearly identical. Overall FAS is the maximum overall score across the retrieved future papers, and component-level FAS is obtained from the same retrieve-then-score procedure.
The evaluation pipeline uses text-embedding-3-large for embeddings, cosine similarity, top-07 retrieval with 08, and GPT-4.1-mini as the judge at temperature 09. Robustness checks varied 10 to 11, replaced text-embedding-3-large with text-embedding-3-small, and replaced GPT-4.1-mini with GPT-4o-mini; rankings of systems remained consistent. In a robustness subset of 12 re-evaluations, the baseline configuration produced scores of 13 for the tuned model, 14 for the untuned model, and 15 for CoI, with Pearson 16 and Spearman 17 under 18 relative to the baseline, and stable model ordering under embedding-model and judge-model substitutions.
The dataset is time-consistent and contains 19 machine learning papers from NeurIPS, ICML, and ICLR. Papers from 20 were used to build training supervision and papers from 21 formed the post-cutoff future evaluation targets. The experiments sampled 22 training instances and 23 evaluation instances. For each target paper 24, the research question 25 was extracted in a leakage-controlled way, inspiring papers 26 were selected from references through a two-stage pipeline, and the proposal target 27 was synthesized as a forward-looking structured proposal. The shortlist stage retained the top 28 references using recency-weighted scoring with a full boost for references within two years of the target, linear decay up to five years, exclusion of references with fewer than 29 citations, and a small citation-count tiebreaker. GPT-5-mini then selected 30 references most likely to have been direct inspirations.
Training used supervised fine-tuning rather than direct optimization of FAS, because FAS is non-differentiable. The supervision consisted of time-consistent 31 pairs and optional citation-grounded reasoning traces teaching gap identification and inspiration borrowing. Three variants were compared: Direct SFT without reasoning traces, CoT SFT with a single monolithic reasoning block, and Stepwise CoT SFT interleaving problem identification, method design, and experiment design reasoning blocks. The models were Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct, and Qwen2.5-14B-Instruct, fine-tuned with LoRA rank 32, LoRA 33, 34 epochs, maximum sequence length 35, per-device batch size 36, gradient accumulation 37, learning rate 38, bf16 precision, primarily on 39H100 GPUs.
The main results reported higher overall FAS for future-aligned tuning than for prompting alone. Llama-3.1-8B-Instruct improved from 40 to 41, Qwen2.5-7B-Instruct from 42 to 43, and Qwen2.5-14B-Instruct from 44 to 45, corresponding to gains of 46, 47, and up to 48. Gains were concentrated in Hypothesis and Proposed Method, while Experimental Details improved less consistently. Against adapted AI-Researcher and Chain-of-Ideas baselines, FAS was lower than the full method and, in the reported comparisons, lower even than the prompting baseline. Citation-removal ablations showed drops in overall and component-level FAS for Background, Method, and Benchmark citation types, with Novelty the most sensitive component.
Human evaluation used 49 pairwise comparisons judged by three domain-expert graduate students. Against human-derived proposals, the Stepwise CoT system obtained 50 win/tie/loss overall and was described as competitive; against prompting-only proposals it obtained 51. An additional LLM-based proposal-quality metric over Resource Validity, Task–Method Consistency, and Task–Experiment Consistency showed the strongest average score for Stepwise CoT at 52, compared with 53 for prompting. Two case studies implemented model-generated proposals with a code agent: a strategy-search prompting method yielded a 54 accuracy gain on MATH, and a model-merging method, MALS, gave consistent improvements on reasoning tasks, including ARC-Easy 55 versus 56 for TIES-Merging in the excerpted results (Wang et al., 28 Mar 2026).
6. Assumptions, limitations, and interpretive issues
A central interpretive issue is terminological. Only the proposal-forecasting paper defines the name “Future Alignment Score” directly. In the other two cases, the term is a natural mapping supplied for future-oriented use: the runtime-monitoring paper defines an alignment score and AES, and the fine-tuning paper defines an alignment score 57 with a closed-form update. This means that FAS is not yet a standardized cross-domain metric, even though the three formulations share a future-facing semantics (Henzinger et al., 28 Jul 2025, Huang et al., 18 May 2026, Wang et al., 28 Mar 2026).
The probabilistic-systems formulation requires bounded scoring rules, non-anticipating environments, and full observability of realized outcomes. If the environment conditions 58 on the current forecast 59, the supermartingale-based confidence sequence may fail. Partial observability, delayed or noisy observations, and unbounded scoring rules such as unclipped log score require new estimators or restrictions. The paper also emphasizes that different proper scores emphasize different aspects of forecast quality, affecting bias, variance, and convergence speed.
The fine-tuning-dynamics formulation depends on first-order analysis, the empirical NTK framework, and the Relatively Stable Kernel assumption. The closed-form update is most accurate for small learning rates and short horizons; for longer runs or larger parameter shifts, recomputation or adaptation of kernels and posteriors is recommended. Estimating 60, outcome-conditioned posteriors, and kernels is computationally expensive, and judge noise can bias the aligned set 61. The paper further identifies a trade-off: higher diversity in alignment data can slow initial alignment gains while reducing posterior narrowness and weakening rebound.
The proposal-forecasting formulation is domain-limited to machine-learning conferences with their own citation practices and publication tempo. Its objective is similarity to future published work, not necessarily novelty or correctness, so genuinely novel ideas that do not later appear in the corpus may be under-rewarded. The supervision pipeline also uses LLMs for inspiring-paper selection and reasoning-trace synthesis, which can introduce bias. Although robustness tests showed stable rankings across retrieval depth, embedding model, and judge model, absolute scores depend on those choices. The paper explicitly notes that time filtering and semantic judging mitigate but do not eliminate the possibility of gaming.
A common misconception would be to treat FAS as a direct proof of downstream property validity, durable safety, or scientific merit. None of the three formulations makes that claim. In runtime monitoring, FAS supplies anytime intervals on predictive alignment rather than direct property-error bounds. In fine-tuning dynamics, FAS predicts alignment trajectories under modeled update assumptions rather than guaranteeing global robustness. In proposal forecasting, FAS is a verifiable surrogate for proposal quality rather than a direct measure of novelty, correctness, or impact. What the three lines of work jointly establish is narrower but precise: future-oriented alignment can be formalized, estimated, and, in some settings, forecasted with explicit statistical or dynamical structure.