Closing the Loop to Discover Psychological Theories with an Automated Cognitive Scientist

Published 24 Jun 2026 in q-bio.NC and cs.AI | (2606.26448v1)

Abstract: Across the sciences, autonomous systems are increasingly being used in closed-loop discovery, proposing new theories and designing and running experiments to test them. This approach is yet to be applied in the field of cognitive science, where the central bottleneck is theory-building: the creative step of turning the accumulated failures of existing models into better ones. Theory generation has remained manual even as data collection, modeling, and experiment design have been automated. We present the Automated Cognitive Scientist (AutoCog), a fully autonomous agentic-AI system that closes this loop. Large-language-model agents advocate competing theories, each expressed as an executable cognitive model, design experiments that best discriminate them, collect behavioral data from participants recruited online, score theories against collected data based on their generative performance, diagnose why they fail, and synthesize a better successor. Repeating this cycle allows them to search the space of theories, models, and experiments. In the domain of decision-making, AutoCog recovered known decision-making strategies from simulated behavior, including unconventional ones, showing that its discoveries are ultimately driven by the data rather than strictly bound by the priors of the underlying LLMs. When run with human participants, it produced theories that outperformed the established theories it was seeded with and generalized to held-out studies in two different experimental settings. It also surfaced a novel theory of multi-cue decision-making in which choices show diminishing sensitivity to feature values. The distinctive predictions of this theory were confirmed in a preregistered study with new participants. AutoCog demonstrates how an automated discovery system can be used to turn cognitive theory-building into an explicit, executable, and cumulative science.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces AutoCog, an automated agent-based system that integrates theory formation, experiment design, behavioral data collection, and revision.
It demonstrates robust recovery of canonical and non-canonical psychological models, achieving low mean squared error under varied noise conditions.
The system outperforms traditional models in human data experiments, enabling scalable, auditable closed-loop discovery and innovative theory revision.

AutoCog: Autonomous Closed-Loop Discovery of Psychological Theories

Introduction and Motivation

The paper "Closing the Loop to Discover Psychological Theories with an Automated Cognitive Scientist" (2606.26448) introduces AutoCog, a fully autonomous agentic-AI framework for psychological theory discovery. Unlike prior systems constrained to automating data collection, hypothesis testing, or model fitting, AutoCog operationalizes closed-loop scientific discovery in cognitive psychology—integrating theory formation, experiment design, behavioral data collection, theory comparison, and revision. The system's automation extends to the generative, creative aspects of theory-building, conventionally considered resistant to mechanization in empirical disciplines such as psychology.

The impetus for the approach is the persistent bottleneck in manual theory-building due to open-ended conceptual search required for human cognition. While substantial progress has been realized in closed-loop discovery in domains such as chemistry, materials science, and mathematics, closing the loop in cognitive science demands mechanisms that both explore and meaningfully update mechanistic theories in response to empirical falsification—beyond fitting candidate models to pre-assembled datasets.

System Architecture and Methodology

AutoCog operates as a cyclic closed-loop system (Figure 1), in which each execution involves:

Experimental Design (Stage 1): LLM agents representing competing theories (encoded as executable models) propose experiments that are maximally discriminative. This follows an adversarial design principle—each agent constructs an experiment likely to empirically discredit its competitor. Proposed metrics for quantifying performance are generated alongside the experimental designs.
Behavioral Data Collection (Stage 2): Automated administration of the constructed experiments, yielding behavioral datasets from human participants, cognitive models, or foundation models.
Analysis and Arbitration (Stages 3–4): Theories are scored via forward simulation, not fitted post hoc, to evaluate generative adequacy against empirical data according to pre-registered metrics. An LLM-based arbitrator interprets model–data discrepancies and issues structured verdicts about theory revision or replacement.
Theory Revision (Stage 5): Based on arbitration, underperforming theories are either programmatically revised (via code synthesis) or replaced at both the description (natural language) and implementation levels. Verification via simulation guarantees behavioral improvement before replacement.
Figure 1: Schematic of the AutoCog discovery loop, showing end-to-end automation from experiment design to theory revision across N cycles.

This agentic, multi-agent architecture emphasizes open-ended search over both experimental and model spaces. It is designed for generalizability, with theory and experiment modules specified through typed interface contracts that abstract away from the specifics of the decision task.

Ground-Truth Recovery: Synthetic Case Study

To validate the discovery dynamics, AutoCog was tested on synthetic data generated from known canonical and non-canonical cognitive strategies (e.g., Take-the-Best [TTB], Tallying, Weighted Additive [WADD], Alternating, Perseveration, Random). The evaluation criterion was the mean squared error (MSE) between the discovered and the ground-truth model's probabilistic choice profile, measured on held-out diagnostic tasks.

Key findings are:

Exact Recovery in the Noise-Free Regime: For noiseless simulated datasets, the best surfaced theories achieved $MSE_{\hat{p}(B)}=0.0025,\, SEM=0.0011$ —statistically indistinguishable from the generating mechanism, despite initialization with confounding seed models.
Robustness to Noise: As action noise increased to $\epsilon=0.5$ and $\epsilon=0.75$ , recovery performance gracefully degraded but remained substantially better than random or seed theories.
Diverse Mechanism Recovery: Non-canonical strategies were successfully surfaced (e.g., Alternating $M=0.003$ , Perseveration $M=0.005$ ). For more challenging spurious strategies (e.g., single-cue or anti-majority), extended cycles were required to transcend the LLM's priors and converge on the generating rule.
Figure 2: Quantitative recovery of generating theories under increasing action noise, for canonical and non-canonical strategies.

A LLM-as-judge rubric confirmed high mechanism similarity between discovered and generating models for low-noise cases, and a principled degradation of mechanism similarity with increasing behavioral stochasticity.

Human Data: Discovery of Decision-Making Theories

The system was extended to behavioral data from human participants in a two-alternative multi-attribute decision-making task, configured in both binary and cardinal cue spaces. The key result is the progressive surfacing of population-invariant, task-invariant cognitive models that outperform canonical seeded theories, generalize to unseen task designs, and yield novel psychological hypotheses.

Trajectory of Model Improvement: Across five closed-loop cycles, $MSE_{\hat{p}(B)}$ for the winning model decreased from $0.093$ down to $0.0097$ (SEM $=0.0018$ ), an order-of-magnitude reduction.
Cumulative Theory Generalization: The winning model, a Non-linear Subjective Weighting Model, unified classical heuristics as special cases by positing a free exponent on subjective weighting of cue validities—a representational innovation not enforced by explicit parsimony or diversity constraints.
External Generalization: When challenged on diagnostic stimuli from Hilbig et al. (2014), the surfaced model achieved $MSE_{\hat{p}(B)}=0.0377$ (SEM $\epsilon=0.5$ 0), outperforming TTB, Tallying, and linear WADD.
Figure 3: Discovery trajectory with theory lineage, model code, and fit metrics showing the stabilization on a non-linear subjective weighting model over cycles.

Extension to Complex Decision Spaces and Prospective Validation

Expanding to cardinal-valued cues permitted discrimination among finer-grained utility transformations. The system autonomously discovered a Diminishing Returns WADD model, hypothesizing concave transformation of feature utilities—a curvature not predicted by seeded heuristics.

Remarkably, the theory was prospectively validated via preregistered human studies that tested three nontrivial predictions:

Model Discrimination: The Diminishing Returns WADD model produced match rates with human choices significantly higher than competitors (e.g., difference to TTB: $\epsilon=0.5$ 1, $\epsilon=0.5$ 2).
Steep-vs-Flat Region Effects: Participants preferred low-range advantages (steep regions of the utility curve) over high-range (flat regions), consistent with concave transformation.
Range-Shift Effects: Preferences for a given cue structure were attenuated when shifted to higher absolute value regions, a non-linear effect captured by the model.

The distinctive predictions held in preregistered, prospectively collected data, underscoring the system's capacity for genuine theoretical innovation, not mere refitting.

Figure 4: Closed-loop discovery and external prospective validation of the Diminishing Returns WADD model in a cardinal-cue decision task.

Process Visualization and Systemic Auditability

The modular, agent-native architecture supports fine-grained process auditability and reproducible automation. Additional figures depict the explicit computational pipeline, including adversarial design, data collection, arbitration, and revision:

Figure 5: Adversarial experimental design with LLM-driven experiment proposals and verification.

Figure 6: Automated collection of behavioral data from humans or simulators, employing the same computational pipeline.

Figure 7: Forward simulation-based theory analysis and neutral arbitration, scoring all theories on all datasets.

Figure 8: Theory revision loop with program synthesis and aggregate-fit validation.

Practical and Theoretical Implications

AutoCog provides a generalized, extensible paradigm for fully autonomous theory discovery—turning computational/theoretical psychology into an explicit, executable, and cumulative enterprise. The integration of agentic proposal, self-verification, and accumulation of behavioral constraints induces pressure for generalizability and explanatory adequacy, rather than mere post hoc fit.

Practically, this architecture can support both in silico-theory preselection (with behavioral foundation models or cognitive simulators) and direct human-data-driven cumulative theory search. The open-ended, logged process enables human researchers to continuously audit, intervene, or modify discovery trajectories—shifting the creative role toward the specification of desiderata, evaluation criteria, and design/bias constraints.

From an AI perspective, AutoCog demonstrates the viability of fully agentic, program-synthesizing, multi-agent scientific discovery frameworks. This approach is not only domain-agnostic within psychology but, with appropriate representational scaffolding, extensible to any empirical science amenable to closed-loop experiment–theory iteration. The implications for AI-driven autonomous science are far-reaching.

Future work may target explicit diversity objectives, alternative discovery paradigms beyond adversarial contest, richer human-in-the-loop oversight, and generalization to higher multiplicity or hierarchical theory spaces.

Conclusion

AutoCog realizes a closed-loop discovery system in cognitive science, automating not only modeling and experiment design but the generative process of theory-building from empirical behavioral evidence. It achieves strong empirical results in recovering and outperforming canonical psychological models, surfaces novel theories that withstand preregistered prospective testing, and establishes a foundation for scalable, auditable scientific automation in psychology and related fields (2606.26448).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview: What this paper is about

This paper introduces AutoCog, an “automated cognitive scientist.” It’s an AI system that does the full science cycle on its own to discover how people think and make decisions. Instead of just running one step (like collecting data), AutoCog:

suggests new theories about how people decide,
designs experiments to test those theories,
runs the experiments with real people online,
checks which theory best matches the results, and
improves or replaces the weaker theory.

By repeating this loop, the system learns better and better explanations of human decision-making.

The big questions the paper asked

Can an AI system invent and improve psychological theories, not just analyze data?
If given some starting ideas (like well-known decision strategies), can it discover the correct strategy that generated the data?
When tested with real people, can it create new theories that predict behavior better than classic ones?
Can it uncover genuinely new patterns in how people choose?

How AutoCog works (in everyday terms)

Think of science as a loop: plan → test → learn → improve → repeat. AutoCog automates that loop.

Here’s the idea using an analogy: imagine two coaches with different training plans (theories). AutoCog asks, “What drills (experiments) would make their predictions differ the most?” It runs those drills with players (participants), sees who performs as predicted (data), judges which plan fits better, and then helps the weaker coach redesign their plan (revise the theory). Then it repeats.

AutoCog’s loop has four steps:

Experimental design: It picks or creates tasks where the current two theories disagree the most. That makes the test informative.
Behavioral data collection: It runs the tasks with people online (sometimes also with computer models or AI stand-ins).
Analysis and arbitration: Instead of forcing models to fit the data, it “presses play” on each theory’s code to generate what it would do, then compares those predictions to what people actually did. An AI “arbiter” explains who did better and why.
Theory revision: The weaker theory is rewritten or replaced as executable code (a working program), using AI to synthesize a better idea that addresses the failures.

Key terms in simple language:

Theory as code: A theory isn’t just words; it’s a small computer program that outputs choices. Think of it as a recipe you can run to see what it would choose on each trial.
Closed loop: The system completes the full plan–test–learn–improve cycle repeatedly, not just one part.
Generative comparison: Rather than bending a model to match data, you let the model act, then compare its actions to humans’ actions.

What experiments they ran

They focused on decision-making where people pick between two products described by several “cues” (like expert ratings). They tried two versions of the task:

Binary cues: Each cue is either 0 or 1 (e.g., no vs. yes).
Cardinal cues: Each cue is an integer like 0 to 5 (so size of differences matters).

They seeded AutoCog with classic strategies:

Take-The-Best (TTB): Look at the most important cue first and decide as soon as that cue differs.
Tallying: Count how many cues favor each option; pick the one with more.
WADD (Weighted Additive): Sum up cue values, multiplying each by how reliable that cue is (its “validity”).

They tested AutoCog in two ways:

With simulated data: Could it rediscover the true strategy that generated the data (including some unusual strategies like alternating choices or sticking with the last choice)?
With real people online: Could it improve on the starting theories and find better ones that generalize to new studies?

Main findings and why they matter

Recovery of known and unusual strategies

With simulated data, AutoCog reliably rediscovered the ground-truth strategy—even when the data had noise (randomness).
It also learned less common patterns (like alternating choices or always picking the worse option) after enough cycles. This shows the system isn’t just biased toward famous theories; it can follow the data.

Why it matters: It proves the loop can “home in” on the correct explanation, not just repeat what it already knows.

Discovery with real people (binary cues)

AutoCog proposed a simple but powerful idea: people transform cue importance in a non-linear way (like raising it to a power) before summing—so the same framework can behave like tallying, WADD, or TTB depending on that one exponent.
This new model predicted human choices better than the seed theories and matched results from a classic, separate study—showing it generalized.

Why it matters: A single, compact theory can explain several well-known strategies as special cases.

Discovery with real people (cardinal cues) and a new psychological pattern

When cue values could be big or small (0–5), AutoCog found a new twist: people show diminishing sensitivity to cue size. In plain terms, an increase from 1 to 2 “feels” bigger than an equal increase from 4 to 5. That’s like a “diminishing returns” curve—steep at the start and flatter later.
AutoCog also surfaced a second, simpler idea: people might turn multi-level ratings into pass/fail based on a threshold, then do a weighted tally. Both models beat the seed theories, but the diminishing-returns model was chosen as best.

Why it matters: This “diminishing returns” pattern is a new, testable rule about human choice in this setting, similar in spirit to the famous idea from prospect theory that we feel changes less when we’re already at a high level.

A preregistered, forward-looking test confirmed the new theory

The team preregistered three predictions (ahead of time) and ran a new study:
- On trials designed to split the models, people matched the diminishing-returns model more often than the classic models.
- Equal-sized steps mattered more in the low-value range than in the high-value range (steep-vs-flat effect).
- The same pattern “weakened” when the whole scale was shifted up (level-shift effect).

All three predictions were supported.

Why it matters: This is strong evidence that the newly discovered theory captures how people actually choose—not just in hindsight, but in fresh data planned in advance.

What this could mean for science

Faster, clearer theory-building: AutoCog turns the creative step of forming new theories into an explicit, executable process. Every step—experiments, code, data, decisions—is logged, so the discovery path is auditable and cumulative.
Broader and better testing: Because the loop designs targeted experiments where theories disagree the most, it can learn faster and waste less effort.
Human–AI partnership: People still set the goals and boundaries (what tasks and model forms to explore). The AI explores within those guardrails and proposes improvements.
Scalable discovery: In some domains, early loops might run “in-silico” (with AI stand-ins) before moving to real participants, saving time and cost—then confirming with humans where it counts.
New insights into the mind: The diminishing-returns pattern suggests people compress big values—an idea that could connect to how we judge money, time, risk, and more.

A few caveats:

The system can only search within the experiment and model spaces we give it. If the right idea lies outside those boundaries, it won’t find it.
It wasn’t explicitly rewarded for “simplicity” or “interpretability,” though many surfaced theories ended up being quite simple.
Results depend on good experimental design, careful participant recruitment, and fair comparisons—all of which the loop works to automate responsibly.

Bottom line

AutoCog shows that an AI can do more than analyze data—it can help invent, test, and refine psychological theories in a full, repeating loop. It rediscovered known strategies, found a compact model that unifies classic ideas, and uncovered a new, confirmed pattern in how people weigh information: diminishing sensitivity to larger cue values. This points toward a future where building theories about the mind becomes more transparent, testable, and steadily improving.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, stated concretely to guide follow-up research.

Scope and generalization

The system is only demonstrated in two-alternative, multi-attribute choice; it is unknown whether AutoCog scales to multi-option choice sets, ranking, or continuous response settings.
Generalization to other cognitive domains (e.g., category learning, memory, language, reinforcement learning, planning, causal reasoning) is untested.
All human data come from online convenience samples; cross-cultural, developmental, clinical, or high-stakes contexts are not evaluated.
External validation uses a single held-out literature dataset; performance across a broader suite of standardized benchmarks is unknown.
Discovered “diminishing returns” on cue magnitudes is not positioned against established multi-attribute utility theory and multi-attribute/prospect-theory models (e.g., concave attribute utilities, configural weighting); it is unclear whether the surfaced account is novel, equivalent, or a re-derivation of known formulations.

Model design space and identifiability

The theory/model interface appears to support only trial-level choice probabilities with limited use of history; applicability to learning dynamics, attentional switching, sequential sampling, or memory-based process models is unclear.
No formal parameter estimation (individual or hierarchical) is performed; how parameters are set/aggregated for predictions is unspecified, and parameter uncertainty is not quantified.
Model comparison relies on pooled MSE of choice proportions; likelihood-based metrics, predictive log-loss, WAIC/LOO, or Bayes factors with complexity control are not used, risking favoring flexible models without penalties.
Pooled proportion metrics ignore within-subject patterns and heterogeneity; the system does not report individual-level predictive accuracy or hierarchical fits.
Identifiability among closely related candidates (e.g., Diminishing Returns WADD vs. Threshold-based Binarization/Satisficing WADD) is not fully probed; there is no preregistered head-to-head discrimination between these two finalists.
Non-additive cue interactions, context effects, and configural processing are not explicitly explored; it is unknown if the search space permits such mechanisms.

Experimental design, metrics, and arbitration

Adversarial experimental design is LLM-driven; there is no comparison to formal optimal experimental design (e.g., information gain) or power analyses, nor guarantees on statistical efficiency.
LLM-proposed metrics are “self-verified,” but the risk of metric selection bias and miscalibration (Type I/II error) is not assessed; statistical properties of the metrics across cycles remain unspecified.
Arbitration and revision are performed by LLMs; stability across random seeds, prompt variants, and agent configurations, as well as agreement with human experts, is not reported.
The search uses two concurrent theories per cycle; the effect of beam width, exploration–exploitation trade-offs, and path dependence on the seeds is not quantified.
Focusing designs on current-theory disagreements may miss stimuli informative for discovering mechanisms outside the incumbents’ span; coverage of the broader stimulus space is not measured.

LLM dependence and robustness

Discovery may be biased by LLM priors; while some unconventional strategies are recovered, the magnitude and structure of this bias across domains and prompts are not characterized.
Only one foundation model (Gemini 3.1) is used; cross-LLM robustness (model family, temperature, system prompts) and sensitivity to prompt engineering are not evaluated.
The “LLM-as-judge” measure of mechanism similarity is not validated against human expert ratings or alternative automated criteria; its reliability and calibration are unknown.
Safety of LLM-generated code (sandboxing, dependency control) and protections against prompt injection or data leakage are not detailed.

Human experiments and data quality

Per-experiment sample sizes are small (n≈25); statistical power per cycle and resulting sensitivity to detect model differences are not established.
Attention/comprehension checks, exclusion criteria, and incentive structures are not described here; data-quality controls and their effects on results remain unclear.
Only choices are modeled; richer signals (reaction times, confidence, process tracing, eye-tracking) that could distinguish theories are not collected or analyzed.
Noise is treated as simple lapse/ε-greedy; more realistic structured noise models, individual variability, and trial-by-trial nonstationarities are not incorporated.

Validation, reproducibility, and efficiency

Human closed-loop results are shown for one run per setting; variability and reproducibility across independent runs are not reported.
Computational cost, wall-clock time, and resource requirements of the closed loop (LLM calls, synthesis, verification, data collection) are not provided, limiting assessment of scalability.
The availability of code, prompts, experiment logs, and data for full reproducibility is not specified in the main text.
There is no head-to-head comparison with alternative automated-discovery pipelines (e.g., GeCCo/ASMR) on identical tasks/datasets to quantify relative efficiency and quality.

Ethical and governance considerations

Procedures for ethical oversight (e.g., IRB, content moderation, participant risk mitigation) in fully autonomous study deployment are not detailed.
Governance mechanisms to prevent harmful, deceptive, or manipulative experiments/metrics and to enforce pre-registration and audit trails are not described.

Task assumptions and external validity

Experiments assume validities are explicitly shown to participants; generalization to settings where reliabilities must be learned or inferred over time is untested.
Only 4–5 cues and two options are used; behavior with larger attribute sets, sparsity, or correlated cues is not examined.
Normalization choices (e.g., validity normalization across experiments) may affect comparability; sensitivity analyses for these preprocessing steps are not reported.

Open methodological extensions

How to integrate explicit objectives for parsimony, unification, and interpretability into the loop (beyond predictive performance) remains open.
The potential of in-silico pre-screening (foundation models as participants) to reduce human data collection without biasing discoveries is not validated.
Methods for adaptive sample sizing and sequential experimental design within cycles (e.g., stopping rules, multi-armed bandit allocation) are not explored.
Meta-discovery—learning or expanding the experiment/model design spaces themselves—has not been attempted.
Incorporation of multi-modal constraints (e.g., neural/physiological data) to narrow theory space is not addressed.

View Paper Prompt View All Prompts

Practical Applications

Below is a concise mapping from the paper’s findings and methods to practical, real-world applications. Each item notes sector(s), what can be done, and the main assumptions/dependencies that affect feasibility.

Immediate Applications

Sector: Software, e-commerce, marketing, UX research
- Use AutoCog-like workflows to propose discriminating A/B/n tests that pit concrete behavioral hypotheses against each other, recruit participants online, and select the winning design/copy/flow based on generative model predictions rather than post-hoc fitting.
- Bundle as a “Behavioral Experiment Copilot” integrated with Qualtrics/SurveyMonkey/Prolific for near-automatic study launch, arbitration, and iteration.
- Tools/workflows that might emerge:
- Adversarial experimental design module that auto-generates stimuli where candidate explanations diverge.
- Audit-trace dashboards that log the full loop for reproducibility and governance.
- Assumptions/dependencies:
- Ready access to online participant pools; IRB/ethics alignment for rapid experimentation; LLM reliability and guardrails; clear specification of experiment and model design spaces.
Sector: Software, search/recommendation, marketplaces, hiring, ad-tech
- Update scoring functions to apply a concave (diminishing-returns) transformation to attribute magnitudes before weighting by reliability, reflecting the paper’s “Diminishing Returns WADD” result.
- Prioritize improvements on low-rated attributes rather than over-optimizing already-high ones (e.g., product quality subscores, candidate screening attributes, ad quality metrics).
- Tools/workflows that might emerge:
- A drop-in “nonlinear weight transform” library: score = softmax(Σ w_i * f(x_i)), with f concave (e.g., power law with exponent < 1).
- Offline replay and shadow-mode A/B to validate ranking changes and monitor fairness.
- Assumptions/dependencies:
- The discovered diminishing-sensitivity regularity generalizes to your domain; availability of reliable cue validities; fairness audits to ensure the transform doesn’t entrench disadvantage on protected groups.
Sector: Healthcare, finance, government services
- Redesign patient/investor/benefit choice interfaces to emphasize gains on low-range attributes (e.g., small reductions in co-pay or risk where baseline is low), which people weight more heavily than equal gains at high ranges.
- Present trade-offs using visualizations that highlight low-range improvements.
- Tools/workflows that might emerge:
- “Choice-Impact Calculator” that predicts user preference shifts under concave value assumptions; template libraries for steep-region-focused messaging.
- Assumptions/dependencies:
- Domain-specific validation and ethical review (e.g., avoiding manipulation); ability to estimate or elicit attribute validities and ranges; compliance with sector regulations.
Sector: Academia (cognitive science, behavioral science, HCI)
- Use AutoCog-style loops to propose, verify, and run discriminating experiments; synthesize executable models for theory revision; publish the machine-readable trace as a research artifact.
- Tools/workflows that might emerge:
- “Theory Synthesis IDE” for programmatic model generation and self-verification; preregistration templates auto-filled from the loop’s arbitration/metrics.
- Assumptions/dependencies:
- Institutional IRB acceptance of LLM-assisted design; computational budget for multiple cycles; standards to report LLM prompts/outputs for transparency.
Sector: ML/AI evaluation and safety
- Use the loop to create tasks that maximally discriminate between competing AI agent policies (e.g., prompting strategies, RL policies) without fitting to data.
- Tools/workflows that might emerge:
- “Behavioral Benchmark Generator” for LLMs/agents with generative arbitration and auto-metric proposal.
- Assumptions/dependencies:
- Clear specification of agent inputs/outputs; sandboxed evaluation; guardrails to avoid prompt leakage or test contamination.
Sector: Market research, insights agencies
- Use behavioral foundation models as a first-pass stand-in for humans to down-select candidate experiments or messaging; confirm finalists with human samples.
- Tools/workflows that might emerge:
- “Hybrid Panel” workflow (foundation model → human) with confidence thresholds and divergence diagnostics.
- Assumptions/dependencies:
- Match between foundation-model behavior and target population; explicit calibration and rejection criteria to avoid over-reliance on synthetic responses.
Sector: EdTech, assessment
- Auto-design item sets that separate candidate cognitive strategies (e.g., rule-based vs. weighted-integration) and use generative comparison to choose instructional adaptations.
- Tools/workflows that might emerge:
- Assessment item generator that emphasizes strategy-identifiability; teacher-facing dashboards with model-based diagnostics.
- Assumptions/dependencies:
- Alignment with curricular standards; student data privacy; validation of theory generalization from lab tasks to classroom contexts.
Sector: Policy communication and public services
- Use AutoCog-like loops with representative panels to compare message frames where behavioral predictions diverge and document the evidence trail.
- Tools/workflows that might emerge:
- “Policy Message Lab” with preregistration-by-default, audit logs, and harm-minimization checks.
- Assumptions/dependencies:
- Ethical oversight; demographic representativeness; constraints on automated human-subjects experimentation in public agencies.

Long-Term Applications

Sector: Cross-industry R&D, academia
- Always-on closed-loop systems that continuously propose theories, run discriminating experiments across panels, revise models, and update organizational knowledge bases.
- “Theory Graph Repositories” that track lineage, arbitration decisions, and generalization across tasks/populations.
- Assumptions/dependencies:
- Scalable participant recruitment with diversity guarantees; robust governance (consent, privacy, bias mitigation); budgeted compute and human-in-the-loop oversight.
Sector: EdTech, digital therapeutics, personalization
- Auto-discovered subpopulation- and individual-level theories driving adaptive tutoring, habit formation, or adherence support; real-time experiment selection to learn “what works for whom.”
- Assumptions/dependencies:
- Longitudinal data, privacy-preserving analytics, and regulatory clearance (e.g., for medical claims); reliable detection of strategy shifts over time.
Sector: Robotics, HRI, autonomous systems
- Robots and agents that discover domain-specific human decision heuristics (e.g., how operators trade off speed vs. safety) and adjust interaction policies accordingly.
- Assumptions/dependencies:
- Safe, real-world experiment integration; sample-efficient loops (human time is costly); standards for testing in safety-critical contexts.
Sector: Public policy and governance
- Closed-loop platforms that iteratively refine regulations, benefits enrollment flows, or crisis communications using discriminating experiments; audit trails that meet legislative transparency requirements.
- Assumptions/dependencies:
- New regulatory frameworks for autonomous experimentation; independent oversight boards; equity and harm assessments as hard constraints.
Sector: Healthcare research and clinical trials
- Automated theory building for how patients respond to multi-attribute trade-offs (cost, side effects, efficacy), informing trial design or digital therapeutic content.
- Assumptions/dependencies:
- FDA/EMA and IRB approvals; integration with EHRs and secure consent flows; rigorous preregistration and monitoring.
Sector: Finance, retail investing, insurance
- Theory-guided disclosures that account for diminishing sensitivity in multi-factor products (fees, returns, risk), tested in closed loops with suitable panels.
- Assumptions/dependencies:
- Regulatory guardrails on experimentation; suitability and fairness reviews; explainability requirements.
Sector: Energy and sustainability
- Iterative discovery of how households weigh multi-attribute offers (rebate, comfort, peak timing), with diminishing-return-aware presentations to boost uptake.
- Assumptions/dependencies:
- Utility partnerships, seasonal dynamics, and privacy-preserving data collection; evaluation across diverse communities.
Sector: ML/AI tooling and safety
- Synthesis pipelines with static analysis and formal methods to guarantee properties (e.g., monotonicity, boundedness) of generated models; red-teaming of experiment proposals.
- Assumptions/dependencies:
- Advances in code synthesis verification; standards for explainability and interpretability; integration with MLOps governance.
Sector: Cognitive science and psychology
- Expanded model/experiment spaces enabling discovery beyond decision making (e.g., learning rules, generalization patterns), with preregistered, prospective validations.
- Assumptions/dependencies:
- Richer task batteries, more diverse participant panels, domain-specific metrics, and ethics frameworks adapted to sensitive topics.
Sector: Standards and policy
- Standards for auditability, consent, preregistration, data use, and LLM-agent involvement; certification programs for “responsible automated experimentation.”
- Assumptions/dependencies:
- Multistakeholder consensus (academia, industry, regulators, civil society); incident reporting and enforcement mechanisms.

Notes on cross-cutting assumptions and dependencies

Model and experiment design spaces must be thoughtfully bounded; garbage-in yields poor theories regardless of automation.
LLMs need guardrails, verification, and iterative prompting; program synthesis outputs must be executable and sandboxed.
Human-subjects protections (IRB, consent, privacy) and domain-specific regulations (e.g., healthcare, finance) are essential constraints.
Foundation-model stand-ins require validation of human–model fidelity; use hybrid pipelines to mitigate mismatch.
Generalization is not guaranteed; replicate across tasks, populations, and contexts; use preregistration and holdouts as in the paper.
Fairness and equity considerations are first-class requirements whenever models impact people’s opportunities or outcomes.

View Paper Prompt View All Prompts

Glossary

Adversarial experimental design: A paradigm that selects experiments to maximally distinguish competing models by targeting where they disagree. Example: "adversarial experimental design"
Agentic-AI system: An AI setup composed of autonomous agents that plan and act to achieve goals across a pipeline without human intervention. Example: "agentic-AI system"
Alternating (strategy): A non-canonical decision policy that alternates choices across trials regardless of stimuli. Example: "alternating"
Anti-majority (strategy): A non-canonical decision policy that chooses against the option supported by the majority of cues. Example: "anti-majority"
Arbitration (LLM-agent-based): A stage where an LLM agent interprets evidence and recommends which theory to revise or retain. Example: "LLM-agent-based arbitration procedure"
Behavior adapter: A component that presents the same task uniformly to humans, cognitive models, and foundation models for comparable data. Example: "Behavior adapter"
Behavioral foundation model: A large model used as an in-silico proxy for human behavior in cognitive tasks. Example: "behavioral foundation model"
Cardinal ratings: Feature values expressed on a multi-level numeric scale (e.g., 0 to r_max), reflecting magnitude. Example: "cardinal ratings"
Closed-loop discovery: An end-to-end scientific cycle where hypotheses, experiments, data collection, and theory revision are automated and iterated. Example: "closed-loop discovery"
Compensatory strategy: A decision approach that integrates information across all cues, allowing strengths on some dimensions to offset weaknesses on others. Example: "a compensatory strategy (Weighted Additive)"
Concave utility function: A transformation of feature values that yields diminishing marginal increases in subjective value as objective values grow. Example: "concave utility function"
Cue validity: The reliability or predictive accuracy associated with a cue used to weight its influence on decisions. Example: "validities"
Diminishing Returns WADD: A weighted-additive model that applies a concave utility transform to cue values before validity-weighted integration. Example: "Diminishing Returns WADD"
Diminishing sensitivity: The property that differences between large feature values matter less than equal-sized differences between small values. Example: "diminishing sensitivity to feature values"
Epsilon-greedy rule: A noise model where, with probability ε, choices are replaced by random selections irrespective of model predictions. Example: "an $\epsilon$ -greedy rule"
Foundation models: Large pretrained models (e.g., LLMs) used as general-purpose components or simulators within scientific pipelines. Example: "foundation models"
Generative behavior: Model-based simulated responses used to compare theories by forecasting full distributions of behavior. Example: "generative behavior"
Generative performance: How well a theory’s generative model reproduces observed data, used for theory scoring. Example: "generative performance"
Holm procedure: A multiple-comparisons correction method controlling family-wise error by stepwise adjusted p-values. Example: "Holm procedure"
In-silico: Conducted via computer simulation rather than with human participants or physical experiments. Example: "in-silico stand-in"
Large-language-model agents: Autonomous LLM-driven agents assigned roles such as theorist, experiment designer, or arbiter. Example: "Large-language-model agents"
Lapse rate: The proportion of trials on which choices are random or inattentive, independent of stimuli. Example: "high lapse rate"
Linear mixed-effects model: A statistical model combining fixed effects and random effects (e.g., random intercepts per experiment) to analyze repeated or grouped data. Example: "linear mixed-effects model"
Multi-attribute decision-making: Choice scenarios where options are described by multiple features (cues) that must be integrated to decide. Example: "multi-attribute decision-making"
Multi-attribute utility theory: A framework modeling choices by computing and combining utilities across multiple attributes, often with subjective value functions. Example: "multi-attribute utility theory"
Non-compensatory heuristic: A rule-based strategy (e.g., TTB) that focuses on the most important cue(s) and does not trade off across attributes. Example: "a non-compensatory heuristic (Take-The-Best)"
Non-linear Subjective Weighting Model: A model where cue validities are transformed nonlinearly (e.g., via a power function) before weighted summation. Example: "Non-linear Subjective Weighting Model"
Optimal experimental design: Methods that choose experiments to maximize expected information gain or model discrimination. Example: "optimal experimental design"
Perseveration (strategy): A non-canonical policy that repeats the previous choice regardless of current evidence. Example: "perseveration"
Probabilistic Strategy Mixture Model: A model positing that people stochastically switch between strategies (e.g., TTB and WADD) on each trial. Example: "Probabilistic Strategy Mixture Model with Flexible Compensatory Component"
Program synthesis: Automatically generating executable code (e.g., cognitive models) from specifications or natural language descriptions. Example: "program synthesis"
Preregistered study: A study whose hypotheses and analysis plans are registered before data collection to curb researcher degrees of freedom. Example: "preregistered study"
Satisficing: Choosing an option that meets a threshold rather than optimizing; here implemented by binarizing cardinal cues. Example: "Satisficing WADD"
Self-driving laboratories: Automated lab systems that design and run experiments without human-in-the-loop execution. Example: "self-driving laboratories"
Self-verifying: A practice where proposed models or experiments are checked against their own predictions before observing data. Example: "self-verifying"
Softmax choice: A probabilistic choice rule mapping option values to choice probabilities via a temperature-controlled exponential (softmax) function. Example: "softmax choice with lapse noise"
Take-the-worst (strategy): A non-canonical heuristic that selects the option with the worst value on the most important differing cue. Example: "take-the-worst"
Take-the-Best (TTB): A non-compensatory heuristic that bases decisions on the most valid cue that discriminates between options. Example: "Take-the-Best (TTB)"
Tallying: A simple heuristic that counts the number of favorable cues per option without weighting by validity. Example: "Tallying"
Threshold-based binarization: Converting multi-level ratings into binary indicators based on a satisficing threshold before applying WADD. Example: "Threshold-based Binarization (Satisficing WADD)"
Validity-weighted sum: Aggregating cues by weighting each by its validity and summing to compute an option’s score. Example: "validity-weighted sum"
Weighted-Additive (WADD): A compensatory model that sums cue values weighted by their (subjective or provided) validities. Example: "Weighted-Additive (WADD)"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Closing the Loop to Discover Psychological Theories with an Automated Cognitive Scientist

Summary

AutoCog: Autonomous Closed-Loop Discovery of Psychological Theories

Introduction and Motivation

System Architecture and Methodology

Ground-Truth Recovery: Synthetic Case Study

Human Data: Discovery of Decision-Making Theories

Extension to Complex Decision Spaces and Prospective Validation

Process Visualization and Systemic Auditability

Practical and Theoretical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview: What this paper is about

The big questions the paper asked

How AutoCog works (in everyday terms)

What experiments they ran

Main findings and why they matter

What this could mean for science

Bottom line

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Scope and generalization

Model design space and identifiability

Experimental design, metrics, and arbitration

LLM dependence and robustness

Human experiments and data quality

Validation, reproducibility, and efficiency

Ethical and governance considerations

Task assumptions and external validity

Open methodological extensions

Practical Applications

Immediate Applications

Long-Term Applications

Notes on cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research