AI for Auto-Research: Roadmap & User Guide
Abstract: AI-assisted research is crossing a threshold: fully automated systems can now generate research papers for as little as $15, while long-horizon agents can execute experiments, draft manuscripts, and simulate critique with minimal human input. Yet this productivity frontier exposes a deeper integrity problem: under scientific pressure, even frontier LLMs still fabricate results, miss hidden errors, and fail to judge novelty reliably. Studying developments through April 2026, we present an end-to-end analysis of AI across the complete research lifecycle, organized into four epistemological phases: Creation (idea generation, literature review, coding & experiments, tables & figures), Writing (paper writing), Validation (peer review, rebuttal & revision), and Dissemination (posters, slides, videos, social media, project pages, and interactive agents). We identify a sharp, stage-dependent boundary between reliable assistance and unreliable autonomy: AI excels at structured, retrieval-grounded, and tool-mediated tasks, but remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment. Generated ideas often degrade after implementation, research code lags far behind pattern-matching benchmarks, and end-to-end autonomous systems have not yet consistently reached major-venue acceptance standards. We further show that greater automation can obscure rather than eliminate failure modes, making human-governed collaboration the most credible deployment paradigm. Finally, we provide a structured taxonomy, benchmark suite, and tool inventory, cross-stage design principles, and a practitioner-oriented playbook, with resources maintained at our project page.
First 10 authors:
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
Think of doing a school science fair project: you come up with an idea, read what others have done, run experiments, write a report, get feedback, and then make a poster or slides. This paper looks at how AI can help with each step of that “research journey,” not just one part. It explains where AI is already helpful, where it still struggles, and how people and AI can work together responsibly.
The authors organize the research journey into four phases:
- Creation: ideas, reading papers, writing code and running experiments, making tables and figures
- Writing: drafting the paper
- Validation: peer review and responding to comments (rebuttals and revisions)
- Dissemination: turning the paper into posters, slides, videos, websites, or social posts
What questions does the paper try to answer?
The paper focuses on a few simple but important questions:
- Where does AI genuinely help researchers, and where does it break down?
- Can AI handle the whole research process on its own, or does it need humans to stay in charge?
- What kinds of AI techniques are used across different stages of research?
- How should we evaluate and govern AI in research so that results are trustworthy?
How did the authors study this?
Instead of running one new experiment, the authors did a careful “map and measure” of the field from 2023 to early 2026:
- They built a roadmap of the full research lifecycle (idea → paper → review → presentation) and grouped tools by the phase and stage they target.
- They reviewed many systems and benchmarks that test AI at each step (for example, tools that help write code from papers, tools that check citations, tools that generate figures, tools that draft reviews, etc.).
- They summarized five common AI approaches in everyday terms:
- Prompt engineering: telling an AI exactly how to respond with clear instructions and examples.
- Retrieval-augmented generation (RAG): letting the AI “look things up” in papers, code, or databases while it answers, so it’s grounded in real sources.
- Agentic methods: AI that can plan, break tasks into steps, use tools (like code runners or search), remember, and iterate—like a self-organizing assistant.
- Training-based methods: teaching a model to be a specialist (e.g., better at peer reviews or scientific writing) by training it on lots of examples.
- Hybrid systems: mixing the above—e.g., an AI that plans steps, looks up sources, and uses specialist mini-models.
- They traced how the field moved from single-task helpers (just writing or just coding) to multi-step “research agents” that try to run a whole workflow.
Throughout, they explain technical terms with practical meanings (for example, “verification” = checking evidence is correct; “provenance” = where information came from; “long-horizon” = many-step tasks that take a while).
What did they find?
Here are the main takeaways, explained simply:
- There’s a sharp boundary between “safe help” and “risky autonomy.”
- AI is strongest when tasks are structured and checkable, like finding papers, cleaning up writing, formatting references, making basic plots, or turning text into slides.
- AI is much weaker when tasks are open-ended and require deep judgment, like inventing truly new ideas, running research-level experiments well, or deciding if a result is genuinely novel and important.
- Making stuff is easier than checking stuff—for AI.
- AI can quickly generate ideas, code, figures, and whole papers that look polished.
- But proving the ideas are new, the code implements the right thing, the results are correct, and the claims are supported is much harder. In short: generation outruns verification.
- The most reliable setup is human-governed collaboration, not full automation.
- AI can remove a lot of “friction” (searching, drafting, organizing, plotting) and can even help plan or run experiments.
- Humans still need to stay in charge of the core scientific parts: judging novelty, designing solid experiments, interpreting results, and taking responsibility for what is claimed.
- Good systems are layered, not just big.
- The best results come from combining planning, tool use, retrieval, and checks (for example, “look it up,” “run the code,” “plot the data,” “double-check the claim”).
- How you orchestrate steps and keep track of evidence matters as much as the size of the AI model.
- It’s a governance problem more than a detection problem.
- As AI use becomes normal, the big questions are: Did you disclose how AI was used? Can you show where information came from? Are claims accountable and reproducible? Who is responsible for mistakes?
They also share stage-by-stage insights:
- Idea generation: AI can suggest many creative-sounding ideas, but many look weaker once implemented (the “ideation–execution gap”).
- Literature review: AI is improving at finding and summarizing papers when it can “look things up,” but it can still miss key work or misrepresent sources.
- Coding and experiments: AI’s ability drops on truly new or research-level code; it may write code that runs but doesn’t implement the right algorithm.
- Tables and figures: Tools exist, but this area is less mature compared to others.
- Writing: AI is good at grammar, structure, and even drafting sections, but it can “smooth over” unsupported claims if not checked.
- Peer review and rebuttal: AI-generated reviews can sound reasonable but can be too gentle or inconsistent; rebuttals might promise fixes that aren’t delivered later.
- Dissemination: Turning papers into posters, slides, and videos is handy, but oversimplification or losing the nuance of evidence is a risk.
Finally, they contribute a taxonomy (organized map), a tool inventory, and a suite of benchmarks to help the community test and compare systems fairly.
Why do these results matter?
- For students and researchers: AI can be a powerful research assistant—like a fast, tireless teammate that helps you search, draft, code, and visualize. But it shouldn’t be the scientist in charge. You still need to think critically, check evidence, and make judgments.
- For the research community: Clear rules and norms (disclosure, attribution, and responsibility) are needed so that AI use strengthens science instead of weakening trust.
- For builders of AI tools: Focus on verification, provenance (showing your sources), reproducibility, and good “workflow design,” not just more generation. Layered systems that plan, look up, run, and check are the way forward.
In short
AI can already do a surprising amount across the whole research journey, sometimes even producing full papers cheaply and quickly. But there’s a big difference between creating research-like documents and doing real, reliable science. The safest and most productive path today is human-led teamwork with AI: let AI handle the mechanical and well-checked parts, and let humans handle the judgment, design, and accountability. This roadmap and its benchmarks are meant to help everyone—students, scientists, and toolmakers—use AI to make research faster and better, without losing what makes science trustworthy.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a focused list of concrete gaps the paper leaves unresolved; each item highlights a specific opportunity for future research or evaluation design.
- Quantifying the “stage-dependent reliability boundary”: no cross-phase, standardized metrics demonstrating where assistance remains reliable and where autonomy fails across Creation → Writing → Validation → Dissemination.
- Measuring the ideation–execution gap: lack of datasets linking proposed ideas to their implemented code, experiments, and outcomes to quantify degradation from “novel on paper” to “impactful in practice.”
- Robust novelty assessment: absence of time-split, leakage-safe benchmarks and methods that distinguish shallow recombinations from substantive, field-advancing ideas.
- Literature review coverage and fidelity: no gold-standard, multi-paper synthesis sets with citation-level provenance and metrics for completeness, contradiction detection, and version consistency (including preprint vs camera-ready drift).
- Citation faithfulness and paraphrase integrity: missing evaluation protocols that test whether summaries and attributions remain faithful at the sentence/claim level.
- Retrieval bias and corpus dynamics: no methods/benchmarks to quantify and mitigate venue, geography, language, or time biases in RAG pipelines as corpora evolve.
- Paper-to-code semantic alignment: limited techniques to verify that generated code implements the intended algorithm/method, beyond “it runs”; need spec-to-code semantic equivalence and test generation tied to claims.
- Reproducibility-by-default tooling: lack of standardized agent outputs that capture environment (containers, seeds, hardware, data versions) and enable third-party reruns without manual repair.
- Resource-aware experiment orchestration: missing frameworks for uncertainty-aware scheduling, compute budgeting, early stopping, and principled exploration–exploitation in long-horizon experiments.
- Tables/figures faithfulness: no benchmarks mapping raw data and method specs to canonical visualizations with automated checks for fabrication, cherry-picking, and misleading design.
- Writing-stage evidence grounding: need section-level audits that link every claim to verifiable evidence (citations, code, or runs), with measurable “argument coverage vs evidence coverage.”
- Peer review agent calibration: insufficient blinded studies comparing AI reviews with expert committees; no adversarial tests for susceptibility to persuasion, self-citation, or author manipulation.
- Rebuttal commitment tracking: lack of automated pipelines that diff rebuttal promises against camera-ready revisions and verify execution of promised experiments/analyses.
- Dissemination fidelity: missing quantitative measures of simplification error and misrepresentation across posters/slides/videos/social posts, plus guardrails for acceptable abstraction.
- End-to-end provenance graphs: no machine-checkable schemas that link manuscript claims to sources, code, data, run logs, and figures across all phases, embedded in published artifacts.
- Governance frameworks beyond rhetoric: absence of concrete, testable disclosure standards, attribution protocols, and accountability assignments piloted with real venues and audits.
- Cross-domain generalization: heavy CS/ML bias; limited evidence for wet-lab, clinical, and social sciences where safety, ethics, and experimental constraints differ substantially.
- Human–AI collaboration design: lack of controlled studies on oversight levels, workflow insertion points, and their effects on researcher learning, creativity, judgment, and deskilling.
- Cost/energy accounting: no standardized reporting of token/compute/carbon per auto-research run; no efficiency benchmarks or scaling laws for reliability vs spend.
- Robustness and security: limited defenses against mass paper spam, review gaming, data poisoning in retrieval corpora, and tool-chain compromise in agent workflows.
- Bias and inclusivity: inadequate multilingual and non-Western literature coverage; no metrics/interventions for equitable retrieval, synthesis, and credit across regions and venues.
- Temporal robustness: need evaluation protocols that respect publication time (avoiding hindsight contamination) and methods for continuous updates without leaking future knowledge.
- Interoperability and APIs: no widely adopted open schemas/APIs for experiment logs, provenance, tool interfaces, and dataset packaging to enable plug-and-play, reproducible pipelines.
- Legal/IP clarity: unresolved ownership/licensing of AI-generated text/code/figures, compliance with upstream licenses, and standards for citation of AI-assisted contributions.
- Formalizing “scientific judgment”: lack of operational definitions and benchmarks that require hypothesis formation, methodological choice, ablation design, and result interpretation under uncertainty.
- Acceptance-standard trials: no controlled, blinded A/B studies comparing AI-assisted vs human-only submissions with pre-registered acceptance criteria at major venues.
Practical Applications
Below is an overview of practical, real‑world applications implied by the paper’s findings, methods, and roadmap. Each item names likely sectors, sketches tools/products/workflows that could emerge, and notes key assumptions/dependencies that affect feasibility.
Immediate Applications
- AI-assisted literature review and evidence synthesis
- Sectors: academia, pharma/biotech R&D, enterprise R&D, finance research, government research units
- Tools/workflows: retrieval-augmented research copilots (e.g., PaperQA2-, STORM-, AutoSurvey-like), citation-graph traversal, “deep research” agents that iteratively gather and summarize evidence with source links
- Assumptions/dependencies: access to up-to-date corpora (incl. paywalled content), robust RAG pipelines, citation fidelity checks, human oversight for scope/coverage and interpretation
- Claim verification and citation provenance checking in manuscripts and reports
- Sectors: academic publishing, journals/conferences, corporate technical communications, standards bodies
- Tools/workflows: claim–evidence matchers and checkers (e.g., ClaimCheck-like) integrated into writing suites; automatic “evidence cards” with doc/figure/code provenance
- Assumptions/dependencies: trustworthy retrieval indices, versioned sources, standardized claim–evidence formats; acceptance of AI tooling in editorial pipelines
- Paper drafting and editing with grounded assistance
- Sectors: academia, industrial labs, think tanks, policy institutes
- Tools/workflows: section-level drafting (Introduction/Related Work/Methods) with embedded citations (CycleResearcher/ScholarCopilot/XtraGPT-like), structured templates, grammar/style polishing
- Assumptions/dependencies: disclosure policies; human-authored argumentation and interpretation; venue-specific style/citation constraints
- Paper-to-code scaffolding and experiment orchestration for ML/data science
- Sectors: software/ML teams, applied research groups, analytics/BI teams
- Tools/workflows: paper-to-code translators and coding agents (PaperCoder-, AIDE-, R&D-Agent-like) plus MLOps integration (MLflow/Weights & Biases) for automating baselines, ablations, and sweeps
- Assumptions/dependencies: clear task specs and test harnesses; limited novelty in target code; compute budget; guardrails to catch “code runs but wrong algorithm” failure modes
- Autonomous experiment runners and research MLOps
- Sectors: ML product teams, A/B testing teams, research labs
- Tools/workflows: agentic pipelines (CURIE/MLGym-like) for planning experiments, launching jobs, tracking metrics, and auto-generating result summaries and plots
- Assumptions/dependencies: secure infra access (GPUs, clusters), cost controls, robust experiment tracking, human-in-the-loop promotion criteria
- Automated tables, figures, and scientific visualization
- Sectors: academia, technical marketing, internal analytics
- Tools/workflows: Matplotlib/LaTeX diagram synthesis (MatPlotAgent/AutoFigure/DeTikZify-like), benchmark tables with auto-citation, method schematics with editable vector output
- Assumptions/dependencies: data cleanliness, style guides, manual verification for faithfulness and readability
- Peer review support and editorial triage
- Sectors: journals, conferences, preprint servers
- Tools/workflows: structured-review drafting, reviewer–paper matching (MARG-like), review quality screening and meta-review support (DeepReviewer-like), AI-review use disclosure and detection
- Assumptions/dependencies: strict human oversight; bias and conflict-of-interest controls; venue policy alignment
- Rebuttal triage and revision planning
- Sectors: academia, industrial research labs
- Tools/workflows: rebuttal planners (RebuttalAgent/Paper2Rebuttal-like) that parse reviewer comments, link each point to required evidence, generate checklists and edit plans, and track fulfilled commitments
- Assumptions/dependencies: accurate comment classification and evidence mapping; audit trails to prevent unfulfilled commitments
- Paper2X dissemination pipelines (slides/posters/videos/webpages/social posts)
- Sectors: academia, corporate R&D communications, education/outreach
- Tools/workflows: automated generation of slides/posters (PPTAgent/SlideGen/Paper2Poster-like), faithful video summaries (Paper2Video-like), project pages/social threads with figures and captions
- Assumptions/dependencies: fidelity constraints and disclaimers; institutional branding/templates; human review for over-claim prevention
- Competitive and technology landscape intelligence
- Sectors: enterprise strategy, venture/market research, policy think tanks
- Tools/workflows: trend-detection agents (e.g., Nova-like) for emerging topics, citation-graph heatmaps, competitor-benchmark tables with linked evidence
- Assumptions/dependencies: comprehensive corpora; recency and de-duplication controls; domain expert validation
- Research governance and provenance logging
- Sectors: universities, journals, funders, corporate R&D
- Tools/workflows: AI-use disclosures embedded in manuscripts; automated provenance logs for literature, code, data, and figures; checklists for integrity across phases
- Assumptions/dependencies: shared disclosure standards; integration with editorial and grant submission systems; acceptance by committees and IRBs
- Education and training for research skills
- Sectors: higher education, professional development
- Tools/workflows: interactive paper agents; guided literature reviews; peer-review practice sets; visualization labs; writing tutors grounded in sources
- Assumptions/dependencies: carefully curated corpora; guardrails against shortcut learning; assessment designs that incentivize critical thinking
Long-Term Applications
- End-to-end autonomous research agents that produce publishable, novel contributions
- Sectors: academia, industrial research, national labs
- Tools/workflows: multi-stage agents that ideate, implement, verify, write, and respond to critique with strong novelty and judgment; layered architectures with built-in verification
- Assumptions/dependencies: reliable scientific judgment, robust novelty assessment, reproducibility guarantees, governance frameworks for attribution and accountability
- Reliable AI peer reviewers and meta-reviewers with scientific judgment parity
- Sectors: academic publishing, standards bodies
- Tools/workflows: trained evaluators that ground critiques in evidence, detect over-claiming and hidden errors, and resist manipulation; reviewer assignment at scale
- Assumptions/dependencies: high-quality review/rebuttal datasets, bias mitigation, COI management, transparent policies for AI involvement
- Autonomous lab experimentation in the physical world
- Sectors: chemistry/materials, biology/biotech, robotics/automation
- Tools/workflows: agents integrated with lab robots and ELNs to design, execute, and analyze experiments; closed-loop hypothesis testing and iteration
- Assumptions/dependencies: safe hardware integration; regulatory and biosafety compliance; robust causal reasoning; high-fidelity simulation-to-reality transfer
- Formal verification and reproducibility-by-default for scientific claims
- Sectors: academia, industry R&D, publishers, funders
- Tools/workflows: claim-checking against code/data logs; automated replication pipelines; formal proofs where applicable; “executable papers” as standard
- Assumptions/dependencies: community standards for artifacts and provenance, compute resources for replication, incentives and credit for verification
- Cross-domain, generalist “ResearchOps” platforms
- Sectors: enterprise R&D across software, energy, advanced manufacturing
- Tools/workflows: orchestration suites combining RAG, agents, tool use, and verification across disciplines; modular plugins for domain tools and datasets
- Assumptions/dependencies: domain adapters and ontologies; secure data integration; scalable monitoring and auditability
- Living, interactive research objects (“paper agents”) as primary knowledge artifacts
- Sectors: academia, education, science communication
- Tools/workflows: papers that answer questions, run code snippets, regenerate figures from raw data, and reflect errata/updates automatically
- Assumptions/dependencies: standardized packaging of text/code/data; hosting and sandboxing; versioned DOIs and archival practices
- Healthcare-grade evidence synthesis and protocol design
- Sectors: healthcare, public health, regulators
- Tools/workflows: agents for systematic reviews, guideline drafting, and RCT protocol generation with rigorous audit trails and bias checks
- Assumptions/dependencies: regulatory approval (e.g., for clinical decision support), gold-standard datasets, explicit uncertainty calibration, continuous expert oversight
- Materials and energy discovery pipelines
- Sectors: energy storage, catalysts, semiconductors, clean tech
- Tools/workflows: closed-loop design using simulation + lab robots; cross-modal retrieval from patents/papers; multi-objective optimization with safety constraints
- Assumptions/dependencies: expensive compute and lab throughput; IP constraints; validated surrogate models; interdisciplinary teams
- Finance and policy research automation with auditability
- Sectors: finance research, regulatory agencies, policy institutes
- Tools/workflows: automated literature + data analysis for policy briefs or research notes with traceable sources; scenario generation and sensitivity analyses
- Assumptions/dependencies: strict provenance/audit trails; model risk management; legal and compliance alignment
- Education at scale via research-grade AI tutors and studio courses
- Sectors: higher education, online learning
- Tools/workflows: end-to-end research projects guided by agents that teach literature synthesis, experiment design, coding, visualization, and critique
- Assumptions/dependencies: pedagogy-aligned guardrails; assessments that measure understanding; institutional policies on AI assistance
- Governance, disclosure, and auditing infrastructure for AI in science
- Sectors: publishers, funders, universities, government
- Tools/workflows: automated disclosure capture across the lifecycle; integrity dashboards; grant/paper submission checks for provenance and replication readiness
- Assumptions/dependencies: policy consensus, interoperability standards, incentives for compliance, minimal burden on researchers
Notes on feasibility across applications:
- The paper identifies a stage-dependent reliability boundary: tools are strongest in structured, retrieval-grounded, tool-mediated tasks and weakest in tasks demanding novelty and scientific judgment. Immediate deployments should therefore emphasize human-governed collaboration and external verification.
- Automation can obscure error modes; layered designs that integrate planning, execution, and verification with provenance logging are a practical prerequisite for scaling.
- As usage becomes ubiquitous, governance (disclosure, attribution, accountability) matters more than detection—policy and workflow adoption will be decisive for long-term impact.
Glossary
- Agentic extensions: Add-on capabilities that let LLMs plan, use tools, and act autonomously across tasks. "LLMs and their agentic extensions are no longer limited to local writing or coding support;"
- Autonomous experiment orchestration: Automated planning, execution, and management of experiments by AI agents. "This stage includes code generation, paper-to-code translation, autonomous experiment orchestration, and result interpretation."
- Chain-of-thought reasoning: Prompting technique where a model generates intermediate reasoning steps before answers. "It includes direct prompting, chain-of-thought reasoning, role assignment, structured templates, rubric-based instructions, and output constraints."
- Citation-graph traversal: Navigating and analyzing networks of citations to find and relate relevant literature. "Modern systems span semantic retrieval, citation-graph traversal, survey generation, and deep research agents that iteratively explore the literature."
- Citation provenance: Tracking and verifying the origins and accuracy of cited claims and sources. "including phase-boundary faithfulness, scientific judgment, reproducibility, citation provenance, governance, cross-domain generalization, and cognitive ownership."
- Cognitive ownership: Attribution of ideas and intellectual contributions between humans and AI systems. "including phase-boundary faithfulness, scientific judgment, reproducibility, citation provenance, governance, cross-domain generalization, and cognitive ownership."
- Cross-domain generalization: The ability of a method to perform well across different research areas without domain-specific tuning. "including phase-boundary faithfulness, scientific judgment, reproducibility, citation provenance, governance, cross-domain generalization, and cognitive ownership."
- Domain foundation models: Large pretrained models specialized for a scientific domain that enable downstream tasks. "while domain foundation models such as AlphaFold~3 illustrated the broader potential of AI systems to transform specialized scientific discovery."
- Epistemological phases: Stages organized by how knowledge is created, validated, and communicated in research. "organized into four epistemological phases"
- Fidelity constraints: Requirements that derived artifacts (e.g., slides, posters, videos) remain faithful to the paper’s evidence. "Each output format targets a different audience and requires distinct design choices, fidelity constraints, and communication strategies."
- Governance: Policies, disclosures, and oversight mechanisms that ensure responsible AI use in research workflows. "AI use in research is becoming a governance problem rather than a detection problem"
- Human-in-the-loop: Workflows where humans guide or supervise AI systems during decision-making or generation. "IRIS uses MCTS in a human-in-the-loop ideation platform to allocate search as ideas converge,"
- Ideation--execution gap: The discrepancy where promising ideas degrade when implemented and evaluated. "yet suffers from an ideation--execution gap in which seemingly novel ideas often weaken after implementation."
- Instruction tuning: Fine-tuning models on instruction–response pairs to improve following task-specific directions. "They include supervised fine-tuning, instruction tuning, preference optimization, reinforcement learning, and domain-specific adaptation."
- Judge model: A model trained to score or evaluate generated ideas, plans, or outputs. "Spark combines retrieval-augmented generation with a judge model trained on $600$K OpenReview reviews"
- Knowledge-graph reasoning: Using graph-structured scientific knowledge (entities and relations) to derive new hypotheses. "knowledge-graph reasoning, and multi-agent collaboration for structured hypothesis formation."
- Meta-review: An oversight review that synthesizes individual reviews and assesses overall paper quality. "Generating structured reviews, matching reviewers to manuscripts, assessing review quality, and supporting meta-review decisions."
- MCTS (Monte Carlo Tree Search): A search algorithm that uses randomized simulations to guide planning in large spaces. "IRIS uses MCTS in a human-in-the-loop ideation platform"
- Multi-agent collaboration: Coordinated interaction among multiple AI agents to critique, refine, and synthesize research ideas. "knowledge-graph reasoning, and multi-agent collaboration for structured hypothesis formation."
- Next Idea Prediction: A training paradigm where models learn to predict the next plausible research idea from context. "DeepInnovator trains a $14$B model under a ``Next Idea Prediction'' paradigm"
- Orchestration: Coordinating tools, retrieval, models, and verification steps across a multi-stage research workflow. "orchestration, provenance, and feedback design are as important as model scale."
- Paper2X: Converting a paper into other formats (e.g., posters, slides, videos, project pages, agents). "research agents, writing assistants, scientific coding tools, automated reviewers, rebuttal systems, and Paper2X applications"
- Parametric knowledge: Information stored within a model’s learned parameters rather than retrieved from external sources. "Direct LLM generation is limited by the model's parametric knowledge"
- Preference optimization: Training methods that optimize models to align with human or rubric-based preferences. "They include supervised fine-tuning, instruction tuning, preference optimization, reinforcement learning, and domain-specific adaptation."
- Provenance: The documented origin and evidence trail supporting generated content or scientific claims. "without preserving evidence or provenance."
- Retrieval-augmented generation (RAG): Generating outputs grounded in retrieved external documents or data. "Retrieval-augmented generation (RAG) grounds model outputs in external sources"
- Retrieval-grounded: Outputs explicitly supported by retrieved evidence during generation. "AI excels at structured, retrieval-grounded, and tool-mediated tasks"
- Rubric-based instructions: Prompts that specify evaluation criteria to shape model outputs toward desired qualities. "It includes direct prompting, chain-of-thought reasoning, role assignment, structured templates, rubric-based instructions, and output constraints."
- Scientific judgment: Expert evaluation of novelty, validity, significance, and rigor of research contributions. "requiring novelty, implicit domain knowledge, long-horizon reasoning, or scientific judgment."
- Semantic retrieval: Retrieving documents by meaning using embeddings or semantic similarity rather than keyword match. "Modern systems span semantic retrieval, citation-graph traversal, survey generation, and deep research agents that iteratively explore the literature."
- Test-time compute: Adjusting the amount of inference-time reasoning or search to improve output quality. "adaptive test-time compute treats reasoning effort as a controllable resource."
- Tool-mediated: Tasks where AI leverages external tools (e.g., code runners, search, plotting) to achieve goals. "AI excels at structured, retrieval-grounded, and tool-mediated tasks"
Collections
Sign up for free to add this paper to one or more collections.