Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments

Published 12 Jan 2026 in cs.CL and cs.AI | (2601.07606v1)

Abstract: LLMs are increasingly being used to assess and forecast research ideas, yet we lack scalable ways to evaluate the quality of models' judgments about these scientific ideas. Towards this goal, we introduce PoT, a semi-verifiable benchmarking framework that links scientific idea judgments to downstream signals that become observable later (e.g., citations and shifts in researchers' agendas). PoT freezes a pre-cutoff snapshot of evidence in an offline sandbox and asks models to forecast post-cutoff outcomes, enabling verifiable evaluation when ground truth arrives, scalable benchmarking without exhaustive expert annotation, and analysis of human-model misalignment against signals such as peer-review awards. In addition, PoT provides a controlled testbed for agent-based research judgments that evaluate scientific ideas, comparing tool-using agents to non-agent baselines under prompt ablations and budget scaling. Across 30,000+ instances spanning four benchmark domains, we find that, compared with non-agent baselines, higher interaction budgets generally improve agent performance, while the benefit of tool use is strongly task-dependent. By combining time-partitioned, future-verifiable targets with an offline sandbox for tool use, PoT supports scalable evaluation of agents on future-facing scientific idea judgment tasks.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a time-partitioned framework using pre-cutoff evidence and post-cutoff outcomes to benchmark scientific idea judgments.
It employs evidence freezing and offline sandboxing to eliminate contamination, enabling scalable, semi-verifiable future impact predictions.
Empirical results reveal task- and model-dependent agentic benefits with significant gains in compute scaling and diagnostic execution trace insights.

Proof of Time: Semi-Verifiable Benchmarking for Scientific Idea Judgment in AI

Introduction and Motivation

The evaluation of scientific ideas—especially their future impact—remains a central challenge in metascience and AI for Science initiatives. Existing infrastructures rely predominantly on immediate peer judgments or static benchmarks, which are both time-constrained and subject to knowledge contamination. The "Proof of Time" (PoT) framework addresses gaps in the verifiable, scalable, and future-oriented assessment of scientific idea judgments by LLMs and agents. The framework operationalizes a time-partitioned benchmarking paradigm: solvers are given only pre-cutoff evidence and must forecast signals (citations, awards, SOTA shifts, and research agendas) observable only at a later point.

Figure 1: Workflow overview for the PoT benchmark, including dataset creation, task family construction, and the agentic offline sandbox protocol.

Benchmark Design and Task Formalization

PoT instances are formalized as tuples $({\mathcal{E}}_{\leq t_0}, q, y_{t_1})$ , comprising pre-cutoff evidence $\mathcal{E}_{\leq t_0}$ , a query $q$ over candidates (e.g., papers, researchers), and a post-cutoff label $y_{t_1}$ . Key methodological innovations include:

Evidence freezing and offline sandboxing, eliminating web-based contamination and making gains from tool use interpretable.
Use of naturally emerging, post-cutoff signals rather than static, human-annotated targets, achieving semi-verifiability at scale.

PoT encompasses four primary task families, each targeting distinct facets of scientific impact and reasoning:

Impact Prediction: Citation forecasting on papers in top conferences, isolating parametric forecasting from evidence-based judgments.
Peer-Review Award Prediction: Anticipation of award tiers as operationalized by official conference outcomes.
Research Evolution: Forecasting researcher's field focus and publication attribution, requiring longitudinal reasoning.
Technological Frontier (SOTA): Extrapolation on benchmark progression and SOTA scores, emphasizing the robustness to prompt and metric variations.

Experimental Setup

The evaluation suite covers 30,000+ instances distributed over these task families. Solvers include:

Zero-shot models with only prompt-based inference.
Agentic models leveraging an offline toolset (including Python, shell, text editors) under strict evidence constraints.
Agentic models with a further-structured prompt, designed to maximize evidence extraction and protocol adherence.

Test-time budgets are modulated via message limits (15, 30, 50), quantifying the relationship between agentic interaction and performance gains. The model pool comprises recent foundation models spanning the Anthropic Claude 4.5 series, Google Gemini lineage, and OpenAI GPT-5 family.

Results and Analysis

Scaling with Test-Time Compute

The analysis demonstrates that increased message limits yield marked accuracy improvements across most agentic configurations, with family-specific differences in scaling efficiency. Notably, Claude models benefit substantially from greater test-time compute, with accuracy gains exceeding 25 percentage points from lowest to highest budget. Gemini models display high initial performance but reduced marginal scaling, while GPT variants present moderate but saturating improvements.

Figure 2: (A) Compute scaling across models; (B) Task-family agentic vs. zero-shot performance; (C) Effect of structured prompting.

Figure 3: Model-wise scaling gains from increased compute budgets (Acc@50 – Acc@15).

Figure 4: Family-level scaling trends highlighting systematic differences in compute utility.

Agentic Versus Zero-Shot Performance

Agentic solvers outperform zero-shot solvers when tasks require evidence exploration and aggregation, with the Faculty task family exhibiting the largest boost—from near random to ~66% accuracy. Conversely, Peer-Review Award prediction shows negligible aggregate improvement, reflecting the weak signal in the available pre-cutoff evidence and human label subjectivity.

Figure 5: Direct comparison of zero-shot vs. agentic performance across task-model pairs.

Prompt Engineering and Robustness

The structured prompt, designed to enforce explicit agentic protocols and local-only operation, yields variable improvements. For some model families (notably Claude), such structuring amplifies agentic gains, but for others, especially GPT, it can be detrimental or neutral—indicating model-dependent optimal prompting regimes and suggesting risks associated with over-prescription at the policy level.

Post-Cutoff versus Pre-Cutoff Evaluation

PoT's value is underscored by its capacity to reveal shifts in model ranking when evaluation is anchored on post-cutoff, contamination-resistant signals. For example, OpenAI and Gemini models display pronounced changes—gain or drop up to ~25 percentage points—when shifting from pre-cutoff to post-cutoff signals. This highlights the practical necessity of time-partitioned benchmarking for robust evaluation.

Failure Modes and Agentic Diagnostics

Analysis of 1,759 agentic execution traces reveals three primary routing outcomes: complete-correct, complete-wrong, and incomplete (budget-exhausted). Completion errors are largely attributed to reasoning mistakes (37.7%) and retrieval errors (36.3%), while incompleteness is dominated by looping or thrashing behaviors (up to 36%), demonstrating that agentic architecture and protocol discipline are critical determinants of success. Even with correct outcomes, many “lucky” paths and redundant explorations were observed, underscoring the need for trace-based evaluation and process audits beyond aggregate scores.

Figure 6: Distribution and analysis of execution traces that exhaust the allowed interaction/message budget.

Cost, Efficiency, and Practical Trade-Offs

Agentic operation entails substantial compute and token overhead—one to two orders of magnitude higher than zero-shot inference. Efficiency frontiers plotted as accuracy gain versus token overhead reveal that such expenses are only justified in evidence-intensive, information-sparse settings (Faculty, Citations). Routine use of high-budget agents is not recommended unless required by the prospective cost of downstream errors or task importance.

Figure 7: Model-level efficiency frontier depicting the non-linear trade-off between agentic accuracy gains and token overhead.

Figure 8: Task-family efficiency comparison, highlighting greatest agentic payoffs for evidence-rich domains.

Theoretical and Practical Implications

The PoT benchmark presents a scalable, refreshable framework for time-indexed model evaluation, uniquely supporting:

Semi-verifiable, future-facing evaluation through external post-cutoff signals.
Isolation of contamination effects, essential in the context of ever-expanding LLM pre-training corpora.
Detailed agentic diagnostic auditing, relevant for agent alignment research and robust tool-use protocol design.

Empirically, the findings challenge a uniform “agent always helps” assumption. Agentic gains are sharply task and model dependent. The negligible improvement on Peer-Review Award tasks raises questions about the adequacy of current automated proxies for subjective scientific value and highlights the inherent limitations imposed by proxy noise and label design.

Future Directions

Key targets for future development include:

Expanding benchmark coverage beyond NLP and AI research venues, integrating multi-domain metascientific signals as they become available.
Exploring architectural and interface improvements to agentic solvers that mitigate looping and budget exhaustion.
Investigating adaptive budget allocation strategies, dynamic prompt calibration, and more expressive interaction policies.
Integrating process-oriented diagnostics in evaluation protocols, moving beyond outcome-only metrics.
Formalizing protocols for minimizing benchmark data contamination as model and data scale increases.

Conclusion

The "Proof of Time" framework provides a principled, semi-verifiable, and contamination-resilient foundation for the evaluation of scientific idea judgments by LLMs and agentic systems (2601.07606). Through time-partitioned design and post-cutoff outcome linkage, PoT enables direct measurement of future-predictive capability—a critical requirement for AI systems deployed within scientific workflows. The observed empirical heterogeneity across tasks, models, and prompting strategies underscores the necessity of granular, task-specific evaluation for agentic systems and paves the way for more comprehensive, evidence-grounded assessment protocols in metascience and AI for Science.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Proof of Time (PoT): A Simple Guide

What this paper is about

This paper introduces Proof of Time (PoT), a new way to test how well AI systems can judge scientific ideas. In other words, it checks whether AI can look at information available today and make smart predictions about which research ideas will matter in the future.

The big questions the paper asks

The authors wanted to know:

Can AI predict future signals of a research idea’s importance (like how many citations a paper will get or which papers will win awards)?
Is there a fair, scalable way to test these predictions without relying on lots of human grading?
Do “agent” AIs (which can use tools to analyze files and data) make better judgments than regular AIs that just answer directly?
How does giving an AI more time/steps to think (a bigger “message budget”) change its performance?

How they tested it (in everyday terms)

Think of this like a “time capsule” game for AI:

Freeze time at a certain date (the “cutoff”). Give the AI only the information that existed before that date (titles, abstracts, old leaderboards, past publications, etc.).
Ask the AI to make predictions about what will happen after that date (for example, which paper will get the most citations next year).
Wait until the real results are known, then check if the AI was right. This makes the test fair and verifiable.

To keep things controlled, the AI works in an “offline sandbox.” That means:

No internet.
It can only use the files provided (the frozen evidence).
In “agent” mode, the AI can use simple tools (like reading files or running small Python scripts) to explore and analyze the evidence before answering.

They tested four kinds of future-focused tasks:

Task Family	What the AI predicts	Why it matters
Citations (Impact)	Which new papers will be cited more	A proxy for which ideas influence future research
Awards (Peer Review)	Which papers will win conference awards	Measures alignment with expert judgments
Research Evolution (Faculty)	How a professor’s research will shift (topics, authorship)	Tests understanding of research trajectories
SOTA Forecasting (Technological Frontier)	How model performance on benchmarks will improve	Checks whether the AI can read trends and extrapolate

They compared:

Zero-shot: the AI answers directly from the prompt (no tools).
Agentic: the AI explores the evidence with tools inside the sandbox.
Agentic + structured prompt: same as above, but with more detailed instructions on how to use tools.
Different “message limits” (like 15, 30, or 50 steps), which is like giving the AI more time to think and work.

What they found (in clear terms)

Here are the main results:

More thinking time helps: Allowing the agent more interaction steps (a bigger message budget) generally improved results. The AI used the extra steps to find and check more evidence.
Agents help most when evidence matters: Tool-using agents did much better than direct answers on tasks where exploring the provided files helps a lot—especially the Faculty (research evolution) tasks. They also improved moderately on Citations.
Not all tasks benefit equally: For Awards, agent tools didn’t help much on average. For SOTA (predicting benchmark performance in broad buckets), both direct and agent methods were already near the top, so there wasn’t much room to improve.
Prompts aren’t magic: Adding a structured prompt (extra instructions) sometimes helped, sometimes didn’t. It depended on the model.
Real-time labels change the story: When they switched from testing on pre-cutoff data to testing on post-cutoff (future) outcomes, some models’ rankings changed a lot. This shows why judging future-focused tasks with future data is important.
Why agents fail: When agents got things wrong, it was mostly due to:
- Reasoning mistakes (misinterpreting evidence).
- Retrieval/tool issues (finding or parsing the wrong file/data).
- Looping or running out of steps before finishing.

Why this matters

Fairer, future-focused testing: PoT avoids “cheating” from memorized training data by using post-cutoff outcomes. It tests true forecasting ability, not recall.
Scalable and refreshable: Because the “answers” come from real-world future signals (like citations and awards), you don’t need huge teams of experts to label everything.
Practical guidance: The results give a realistic view of when to use agentic AIs. Use them (and give them more steps) for evidence-heavy tasks where careful checking matters; keep it simple for easy or already-saturated tasks.
Understanding limits: Since proxies like citations or awards aren’t perfect measures of “idea quality,” the framework is “semi-verifiable.” Still, it’s a strong step toward evaluating how well AI can judge scientific ideas over time.

In short

PoT is like asking AI to predict tomorrow using only yesterday’s information, then grading it when “tomorrow” actually arrives. It shows that tool-using agents can be powerful for research judgment—especially when they have time to explore evidence—but their benefits depend on the task, the instructions, and the model. This benchmark gives researchers a fair, scalable way to measure and improve AI’s ability to assess which scientific ideas will stand the test of time.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored, phrased so future researchers can act on it.

Validate proxy constructs: systematically quantify how well each proxy (citations, award tiers, SOTA buckets, faculty topic shifts) maps to “idea quality” across subfields and time, using field-normalized metrics, altmetrics, expert ratings, and long-horizon outcomes.
Clarify temporal design: explicitly specify $t_0$ and $t_1$ per task, horizon lengths, and update cadence; run sensitivity analyses to show how performance and conclusions change with different cutoffs/horizons.
Expand domain coverage: extend beyond NLP/AI conferences (ACL/NAACL/EMNLP) to diverse disciplines (e.g., biomedicine, physics, social sciences), journals vs. conferences, and multilingual corpora to assess generalizability.
Improve ranking evaluation: for “Ranking” tasks, replace exact-match accuracy with rank-aware metrics (Spearman/Kendall, NDCG), confidence intervals, and calibration curves to capture ordering quality and uncertainty.
Elicit probabilistic forecasts: require and score probability distributions (e.g., citation buckets, award tiers) using Brier score, log-likelihood, and expected calibration error to assess calibration and risk-aware judgments.
Control for contamination: design counterfactual/perturbed variants (title/abstract paraphrases, masked identifiers, synthetic decoys) and measure how model performance shifts to disentangle parametric memorization from evidence-grounded inference.
Disentangle style vs. substance: in award-tier prediction, ablate text signals (e.g., mask stylistic markers) to test whether models predict awards based on writing style/venue norms versus core contribution.
Normalize citation signals: apply field/year normalization (e.g., z-scores within venue-year, co-author network controls) to reduce exposure effects and reveal genuine impact forecasting.
Address award taxonomy consistency: clarify whether “Findings” is treated as an award tier or track; align tiers to true award categories (e.g., Best Paper, Outstanding Paper) and handle track-specific eligibility cleanly.
Strengthen faculty-task ground truth: replace or validate LLM-derived keywords/field labels with authoritative sources (ORCID, DBLP, institutional pages) and conduct human validation for author–field mappings.
Identity disambiguation: implement robust author disambiguation (name variants, common names) and measure its error rate’s effect on faculty and citation tasks.
Long-horizon impact: add multi-year horizons (e.g., 3–5+ years) to capture slow-burn impact and measure temporal stability of early forecasts.
Benchmark refresh governance: define versioning, periodic refresh schedules, data sources (OpenAlex/Crossref vs. Google Scholar), and reproducible joining heuristics; publish change logs and drift diagnostics.
Trace annotation reliability: replace single LLM-as-judge with multi-judge ensembles, report inter-rater reliability, and include human adjudication for a sampled subset of agent traces.
Failure-to-fix studies: for each failure mode (reasoning, retrieval/tooling, looping), run targeted interventions (e.g., last-mile verification steps, better search formulation, parsing guards) and quantify error reduction.
Agent architecture ablations: compare single-agent ReAct against planning agents, multi-agent debate, memory-augmented agents, and learned tool-use policies (e.g., RL fine-tuning) under identical sandbox constraints.
Adaptive budget policies: develop cost-aware inference strategies (e.g., early-exit criteria, bandit-based budget allocation per instance) and measure accuracy–cost efficiency frontiers in tokens, wall-clock time, and energy.
Tooling diversity: test richer offline tools (vector search over snapshot, lightweight citation graphs, structured metadata indices) and measure their incremental value while preserving isolation from post-cutoff data.
Decoding settings and robustness: standardize and report temperature, top-p, seed settings; run robustness sweeps to ensure conclusions are not artifacts of sampling policies.
Human–model misalignment analysis: quantify where models diverge from peer-review outcomes (topic areas, institutions, novelty classes), and study whether divergences correlate with documented biases or known failure modes.
Bias and fairness audits: instrument tasks with demographic/institutional covariates (when ethically feasible), measure disparate performance, and test mitigation strategies (reweighting, fairness-aware objectives).
SOTA forecasting granularity: move beyond coarse buckets to fine-grained performance deltas, uncertainty intervals, and distributional forecasts; account for benchmark evolution (dataset changes, metric redefinitions).
Open-world realism: complement sandbox evaluation with controlled retrieval-enabled settings to estimate the gap between constrained and realistic workflows and identify where up-to-date information is critical.
Evidence attribution requirements: require agents to cite specific snapshot files/rows used for decisions, and audit attributions to ensure answers are grounded in the offline evidence rather than parametric priors.
Distribution shift diagnostics: compare pre- vs. post-cutoff splits for composition and difficulty shifts; report stratified performance (topic, venue, length) and control for confounders.
Token-to-performance tradeoffs: report standardized token accounting per configuration and instance, and study diminishing returns at higher budgets across models and task families.
Data licensing and reproducibility: assess licensing, stability, and coverage differences between Google Scholar and open bibliographic sources; quantify label drift and replicate runs across sources.
Negative/corrective signals: incorporate “negative” impact indicators (retractions, errata) and topic corrections to test whether models avoid forecasting impact for problematic or debunked ideas.
Manipulation resilience: explore whether public knowledge of benchmark instances enables gaming (e.g., citation padding, leaderboard overfitting) and design defenses (hidden splits, randomized targets).
Error bars and significance: provide bootstrap confidence intervals, multiple-comparison corrections, and significance tests for model/ranking differences; publish per-task sample sizes and variance estimates.
Cross-lingual evaluation: add non-English papers and multilingual faculty to test whether agents generalize across languages and translation noise.
Ethical safeguards: study potential feedback loops where model forecasts influence researcher behavior, and propose guardrails (disclosure, non-deployment policies, impact reviews) for responsible use.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be piloted or deployed now using the PoT benchmark, code, and design patterns (time-partitioned targets, offline sandbox, agent-vs-zero-shot comparisons, message-budget scaling).

Model procurement and benchmarking for AI research assistants (Industry, Software)
- Use case: Evaluate vendor models/agents marketed for literature review, paper triage, and “AI for Science” assistance using PoT’s frozen-evidence, post-cutoff scoring to avoid contamination and judge subjectivity.
- Tools/products/workflows: Internal “PoT runner” service; offline Docker sandbox with read-only corpora; dashboards showing accuracy vs. token cost vs. message budget; contract acceptance criteria tied to PoT scores on tasks like citation prediction and SOTA trajectory buckets.
- Dependencies/assumptions: Access to domain-relevant pre-cutoff corpora; agreement on target horizons; acceptance that citation/award signals are imperfect proxies.
Evidence-grounded agent design and cost-performance tuning (Industry, Software, Academia)
- Use case: Optimize agent loops, toolkits, and message limits by measuring accuracy gains vs. token overhead on PoT’s tasks; deploy auto-tuners that pick low/medium/high budgets depending on task family (e.g., higher budgets for evidence-heavy Faculty/Citations tasks).
- Tools/products/workflows: “Budget tuner” module; trace logging and failure taxonomies (retrieval errors, looping) to guide prompt/tool revisions; structured prompts when helpful.
- Dependencies/assumptions: Comparable sandbox/tooling across models; cost constraints for high-budget runs.
Contamination-aware evaluation for live ML systems (Industry, Research labs)
- Use case: Add PoT-style time splits to internal eval suites to reduce benchmark data contamination and retroactive memorization; quantify post-cutoff performance shifts that affect model ranking.
- Tools/products/workflows: Quarterly “post-cutoff refresh” jobs; auto-scoring pipelines linked to new leaderboards and citation updates; regression alerts when rankings shift after cutoff.
- Dependencies/assumptions: Stable data pipelines (e.g., Google Scholar, leaderboards); governance for periodic dataset refreshes.
Metascience analyses of human–model alignment (Academia, Publishing)
- Use case: Compare model forecasts vs. peer-review award outcomes to study alignment, bias, and variance; identify venues or tasks where agentic exploration adds value.
- Tools/products/workflows: Misalignment reports by venue/year; reviewer calibration workshops using PoT instances; reproducible notebooks operating entirely offline.
- Dependencies/assumptions: Ethical use of award and review signals; awareness that awards reflect community processes and may be biased.
Conference/journal tooling for submission triage and audit (Academia, Publishing)
- Use case: Pilot PoT-style “forecasting tracks” where models/agents predict post-cutoff indicators for accepted papers; use aggregate trends as a diagnostic (not a decision rule) to audit review consistency across areas.
- Tools/products/workflows: Offline submission bundles per track; dual-mandate dashboards (forecast vs. realized signals); no-influence safeguards so forecasts can’t affect acceptance decisions.
- Dependencies/assumptions: Clear policy boundaries to avoid automating merit decisions; legal/ethical approval for post-hoc analysis.
R&D portfolio horizon scanning (Industry, Finance)
- Use case: Use SOTA trajectory buckets and citation forecasts to inform portfolio bets (e.g., which benchmarks/directions are accelerating) with explicit uncertainty bands.
- Tools/products/workflows: Quarterly “frontier watch” reports; offline, field-specific snapshots (e.g., vision, robotics); cross-task ensemble forecasts.
- Dependencies/assumptions: Proxies can over/underestimate real-world adoption; domain shift when moving beyond NLP/AI benchmarks.
Domain adaptation to biomedicine for literature triage (Healthcare)
- Use case: Build a PoT-like offline sandbox on PubMed (pre-cutoff metadata, abstracts) to triage which new papers to prioritize for journal clubs, guideline committees, or systematic reviews.
- Tools/products/workflows: Hospital firewall-deployed PoT; agents that summarize evidence linking past cohorts to predicted influence or topic drift; token budget set lower for routine triage.
- Dependencies/assumptions: IRB and privacy review; biomedical-specific proxies (e.g., clinical guideline mentions) may be preferable to raw citations.
Internal knowledge-base impact forecasting (Industry)
- Use case: Predict which internal tech notes/whitepapers are likely to drive downstream adoption; plan documentation and enablement accordingly.
- Tools/products/workflows: Offline corpora snapshots; quarterly impact buckets; feedback loop with enablement teams.
- Dependencies/assumptions: Need internal engagement metrics as better proxies than public citations; author disambiguation.
Curriculum and training for “AI for Science” (Academia, Education)
- Use case: Classroom labs where students build and evaluate agents under PoT constraints to learn evidence-grounded forecasting, agentic tool-use, and scientific metrology.
- Tools/products/workflows: Course Docker images; assignments comparing zero-shot vs. agentic vs. structured prompts; trace-based failure analysis.
- Dependencies/assumptions: Compute quotas for message-limited agents; up-to-date pre-cutoff packages per term.
MLOps and compliance checklists for offline evaluation (Industry, Policy)
- Use case: Standardize “no-net” offline evaluation as a compliance step for models making future-facing claims; generate auditable traces to support procurement and regulatory reviews.
- Tools/products/workflows: Signed evaluation manifests (cutoff dates, sources, tools); artifact locking; third-party reproducibility scripts.
- Dependencies/assumptions: Agreement on acceptable proxies; versioned evidence snapshots; legal readiness for audits.

Long-Term Applications

These applications require further research, cross-domain scaling, governance, or infrastructure maturation before high-stakes deployment.

Decision support in grantmaking and peer review (Policy, Academia)
- Use case: Use PoT-style forecasts as a calibrated, audited secondary signal to help panels surface overlooked proposals, stress-test hype, and track long-term review calibration.
- Tools/products/workflows: “Forecast overlay” in reviewer dashboards; periodic post-cutoff audit panels; bias diagnostics by field/demographics.
- Dependencies/assumptions: Strong governance to avoid automating acceptance decisions; validated domain-specific proxies beyond citations/awards; community acceptance.
RL/learning-to-search from delayed, verifiable signals (Software, ML research)
- Use case: Train agents using delayed outcomes (e.g., next-year citation buckets, benchmark progress) as reward to improve long-horizon evidence gathering and commitment strategies.
- Tools/products/workflows: Offline policy optimization with frozen snapshots; counterfactual evaluation; off-policy safety constraints.
- Dependencies/assumptions: Sparse, delayed rewards; proxy misspecification risk; need for stable, refreshable datasets.
Cross-domain expansion to patents, standards, and regulation (Industry, Policy, IP law)
- Use case: Forecast patent citations/claims, standards adoption, or regulatory inflection points using PoT’s time-partitioned design.
- Tools/products/workflows: Patent office snapshots; standards body minutes; policy tracker sandboxes; “tech readiness” bucket forecasts.
- Dependencies/assumptions: Data licensing; robust entity resolution; different lag structures than academic citations.
Clinical and translational science forecasting (Healthcare)
- Use case: Predict which preclinical findings will translate to clinical trials or guideline updates; prioritize replication and funding.
- Tools/products/workflows: Multimodal pre-cutoff snapshots (preprints, trial registries); outcome proxies (trial initiation, Phase progression); conservative, human-in-the-loop workflows.
- Dependencies/assumptions: Very high cost of error; medical ethics; better proxies than citations; long horizons.
National technology foresight dashboards (Policy, National labs)
- Use case: Government horizon scanning across strategic areas (AI, energy, materials), combining SOTA trajectory forecasts with trend acceleration alerts.
- Tools/products/workflows: Federated PoT instances per domain; uncertainty-aware visual analytics; periodic public reports with audit trails.
- Dependencies/assumptions: Data-sharing agreements; standard taxonomies across agencies; governance to avoid policy overreliance.
Editorial and venue process redesign (Publishing)
- Use case: Post-accept forecasting challenges with delayed scoring to inform awards, tracks, and reviewer calibration; detect systematic mismatches between novelty and eventual impact.
- Tools/products/workflows: Venue-integrated PoT pipelines; anonymization protocols; delayed feedback loops to program chairs.
- Dependencies/assumptions: Community buy-in; careful framing to avoid “gaming” future metrics; fairness safeguards.
Enterprise innovation underwriting and risk pricing (Finance, Insurance)
- Use case: Use future-verifiable proxies to underwrite R&D risk or price innovation-linked instruments (e.g., milestone financing keyed to benchmark progress).
- Tools/products/workflows: PoT-based scorecards; instrument triggers tied to post-cutoff signals; independent verification services.
- Dependencies/assumptions: Legal structuring; avoidance of self-fulfilling loops; robust anti-manipulation controls.
Research career planning and advising tools (Academia, Education)
- Use case: Advisors and scholars use PoT-style topic-drift and agenda continuity forecasts to plan collaborations, grants, and reading priorities.
- Tools/products/workflows: Privacy-preserving faculty sandboxes; interactive “trajectory what-ifs”; integration with ORCID/GS profiles.
- Dependencies/assumptions: Consent and privacy; risk of narrowing exploration; improved field taxonomies.
Safety and governance audits for agentic systems (Policy, Standards bodies)
- Use case: Create certification schemes where future-facing claims by agents must be backed by PoT-like, time-indexed, offline evaluations with verifiable post-cutoff scoring.
- Tools/products/workflows: Compliance test suites; third-party attestation; standardized reporting (cutoff dates, tools, budgets).
- Dependencies/assumptions: Standards harmonization; reproducibility infrastructure; enforcement mechanisms.
Marketplaces and leaderboards for “evidence-grounded” agents (Industry, Platforms)
- Use case: Platform ratings for agents that must demonstrate accuracy-cost trade-offs on PoT tasks under offline constraints, with continuous post-cutoff updates.
- Tools/products/workflows: Public leaderboards with delayed ground truth; token-efficiency badges; model cards with post-cutoff deltas.
- Dependencies/assumptions: Sustainable refresh cadence; anti-contamination policies; clear licensure of evaluation datasets.

Notes on feasibility across applications:

Proxy risk: Citations, awards, and benchmarks are imperfect proxies for “quality”; sectors may need domain-specific alternatives (e.g., clinical guideline adoption, patent grants).
Data and horizons: Quality of results depends on stable pre-/post-cutoff data pipelines and appropriate horizons ( $t_1 - t_0$ ) for each domain.
Offline realism: Sandbox constraints improve auditability but may understate performance when live retrieval is allowed; dual-mode eval (offline + controlled retrieval) may be needed.
Cost-performance trade-offs: Agentic gains depend on message budgets; operational costs and latency must be managed in production.
Ethics and governance: Avoid using model forecasts as sole decision criteria in high-stakes settings; ensure transparency, fairness audits, and human oversight.

View Paper Prompt View All Prompts

Glossary

Ablation: Systematic removal or variation of components to test their impact on performance. "controlled ablations over tool access, offline prompting, and message-budget scaling."
Agent-native evaluation: An evaluation setup tailored to agent systems that measures tool use and interaction under controlled constraints. "Agent-native evaluation: we introduce an offline sandbox protocol that makes tool use measurable and supports controlled ablations over tool access, structured offline prompting, and message-budget (test-time) scaling"
Agent overhead: The extra cost and complexity introduced by running agentic processes at inference time. "where agent overhead yields diminishing returns."
Agentic: Pertaining to autonomous, tool-using model behavior and reasoning via multi-step interactions. "higher interaction budgets generally improve agentic performance"
Award tier: A categorical level of conference recognition (e.g., Findings, Main, Outstanding, Best). "the task is to predict the paper's award tier."
Benchmark Data Contamination (BDC): Leakage of benchmark content into training data that inflates evaluation scores. "Web-scale pretraining makes static benchmarks increasingly vulnerable to benchmark data contamination (BDC)"
Benchmark trajectories: The evolution of benchmark performance or state-of-the-art over time. "technological frontier forecasting (SOTA benchmark trajectories)"
Bibliometric features: Metadata-derived variables (e.g., citation counts, author networks) used to study scientific impact. "using bibliometric features, topic signals, and network structure"
Budget exhaustion: Running out of the allotted interaction or message budget before completing a task. "Budget exhaustion"
Contamination-resistant: Designed to prevent or reduce training-data leakage into evaluation sets. "contamination-resistant variants of classic test sets"
Downstream signals: Later-observed outcomes used as proxies for impact or quality, such as citations or awards. "links scientific idea judgments to downstream signals that become observable later"
Efficiency frontiers: Curves characterizing the trade-off between performance gains and resource costs. "efficiency frontiers that show accuracy gain over zeroshot against token overhead"
Exact-match accuracy: A strict scoring metric that counts a prediction as correct only if it exactly matches the gold label. "We report exact-match accuracy for all tasks."
Evidence snapshot: A frozen, pre-cutoff set of artifacts and metadata available to the solver. "where $\mathcal{E}_{\le t_0}$ is the evidence snapshot available up to cutoff $t_0$ "
Exposure effects: Differences in outcomes due to visibility or timing rather than intrinsic quality. "so comparisons are not driven by natural exposure effects such as earlier publication or venue-wide visibility."
Frontier models: The latest, most capable models available at evaluation time. "covering the latest frontier models from Anthropic, Google, and OpenAI."
Gold labels: Ground-truth outcomes used for scoring that can be verified externally. "gold labels are not subjective annotations but outcomes that can be checked later."
Headroom: Remaining potential for improvement on a task given current performance levels. "limited headroom for tool use on this coarse bucketed variant."
Interaction budget: The allowed number of agent interaction steps during inference. "higher interaction budgets generally improve agentic performance"
Leaderboard: A ranked listing of model performance on a benchmark. "popular benchmarks and leaderboards as of October 2025"
LLM: A model trained on vast text corpora to perform language understanding and generation. "LLMs are increasingly being used to assess and forecast research ideas"
LLM-as-judge: Using an LLM to evaluate or classify agent traces or outputs. "using an LLM-as-judge protocol (Gemini 3 pro)"
Looping/thrashing: Repetitive agent behavior that fails to make progress and wastes budget. "Looping / thrashing"
MCQ (Multiple-Choice Question): A discrete-answer format where models select from predefined options. "In MCQ instances, the solver selects which candidate will have the most citations by the target horizon."
Message-budget scaling: Varying the number of allowed messages to study performance vs. compute. "message-budget (test-time) scaling"
Message limit: The cap on environment interaction turns before a final answer must be given. "we vary a message limit: the maximum number of environment interaction turns before the agent must finalize an answer."
Metascience: The study of how science is conducted and evolves. "This progress has renewed interest in metascience"
Misalignment analysis: Examining divergences between model assessments and human judgments. "supports misalignment analysis: because some outcomes reflect human judgment (e.g., peer-review awards)"
Network-isolated sandbox: A restricted environment without internet access used to control evidence exposure. "inside a network-isolated sandbox"
Non-convergence: Failure of an agent run to reach a final, correct answer within constraints. "failure modes such as non-convergence"
Offline constraints: Limitations arising from operating without live web access and using only pre-cutoff data. "under fixed offline constraints"
Offline sandbox: A local, network-disabled environment containing the frozen evidence for tasks. "placed in an offline sandbox"
Parametric-only forecasting: Making predictions using only the model’s internal parameters, without external tools. "This tests parametric-only forecasting ability."
PoT (Proof of Time): A time-partitioned, semi-verifiable benchmarking framework linking judgments to future signals. "We introduce PoT, a semi-verifiable benchmarking framework"
Post-cutoff evaluation: Scoring predictions against outcomes that occur after a designated time cutoff. "Post-cutoff evaluation can materially change conclusions about model performance"
Pre-cutoff: The time window before the benchmark’s cutoff when evidence is frozen. "pre-cutoff snapshot of evidence"
Prospectively verifiable: Targets that can be checked once future outcomes become available. "(ii) defining targets that are prospectively verifiable as time passes."
ReAct: An agent paradigm that interleaves reasoning and acting steps. "All agentic runs use a single-agent ReAct loop"
Scientometrics: Quantitative analysis of scientific literature and impact. "Scientometrics has long studied how papers accrue impact over time"
Semi-verifiable: Using verifiable signals as proxies for constructs that are not directly observable. "We call PoT semi-verifiable because the benchmark uses verifiable downstream outcomes as imperfect proxies for idea quality."
SOTA (State of the Art): The best-known performance on a benchmark at a given time. "The SOTA task measures whether solvers can reason about frontier model performance and the pace of benchmark progress."
State tracking: Maintaining and updating task-relevant information across multi-step interactions. "success depends on multi-step interaction, state tracking, and adherence to constraints."
Structured prompt: A prompt that prescribes explicit steps or policies for agent behavior. "the structured agent prompt has family-dependent effects."
Taxonomy: A structured categorization of fields or topics used for classification. "choose from a fixed taxonomy"
Test-time compute: The computational resources or interactions used during inference. "corresponding to low, medium, and high test-time compute."
Time-indexed: Tied to specific points in time, affecting what can be known or predicted. "Many consequential scientific questions are intrinsically time-indexed"
Time-partitioned: Split by time to separate training-era evidence from future outcomes. "we formalize PoT as a time-partitioned benchmark design"
Tool-using agents: Agents that invoke external tools (e.g., bash, python) to process evidence and reason. "comparing tool-using agents to non-agent baselines"
Zero-shot: Producing answers without task-specific examples, tools, or additional context. "Zero-shot (direct generation)."

Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments

Summary

Proof of Time: Semi-Verifiable Benchmarking for Scientific Idea Judgment in AI

Introduction and Motivation

Benchmark Design and Task Formalization

Experimental Setup

Results and Analysis

Scaling with Test-Time Compute

Agentic Versus Zero-Shot Performance

Prompt Engineering and Robustness

Post-Cutoff versus Pre-Cutoff Evaluation

Failure Modes and Agentic Diagnostics

Cost, Efficiency, and Practical Trade-Offs

Theoretical and Practical Implications

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Proof of Time (PoT): A Simple Guide

What this paper is about

The big questions the paper asks

How they tested it (in everyday terms)

What they found (in clear terms)

Why this matters

In short

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (7)

Collections

Tweets

YouTube

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research