Open Problems

Open Problems (106) Open Problems (106)

HC	Develop rigorous methods to ascertain and validate the internal input–output mappings and value functions learned by large language model chatbots despite their opacity.	1

AI	Evaluate the effect in field settings of higher test-time compute and alternative elicitation strategies (e.g., multi-trajectory sampling, LLM judges, self-consistency, or agent scaffolding) on developer productivity when using AI tools.	1

LG	Determine whether increasing recursion depth (via larger T or n) improves performance on harder problems and characterize the associated computational trade-offs in speed and memory.	1

AI	Determine how cumulative experience with AI tools (e.g., Cursor usage hours) affects developer productivity on real tasks, and disentangle improvements in AI-assisted performance from possible atrophy in non-AI performance.	1

HC	Prove real-world behavioural guarantees for large language model–based chatbots under deployment conditions, overcoming current limitations in mechanistic interpretability and reasoning-model analyses.	1

HC	Determine whether chatbot-induced belief amplification dynamics predominantly affect individuals at elevated risk of mental health problems or also produce subtle belief drifts in the general population, and quantify their impacts on mental health and societal cohesion.	1

HC	Establish the current prevalence of bidirectional belief amplification in human–chatbot interactions and characterize how this prevalence changes with the emergence and deployment of more sophisticated, personalised chatbots.	1

LG	Explain why deep recursion with supervision yields superior generalization compared to larger and deeper non-recursive networks, and develop a theoretical account beyond overfitting speculation.	1

AI	Assess whether developers’ post-randomization choices of issue completion order induce systematic biases in measured implementation times across AI-allowed and AI-disallowed conditions, and quantify the magnitude of any such bias.	1

LG	Derive scaling laws that specify how to parameterize recursive reasoning models (e.g., TRM/HRM) — including model size, layer count, recursion depth, and compute — to achieve optimal generalization across datasets and data regimes.	1

HC	Determine whether the theoretical mental health risks from prolonged human–chatbot interactions can be predicted or known prior to widespread general population adoption, and if possible, construct methodologies to estimate their manifestation in advance.	1

AI	Measure and characterize the variance of human performance on Vending-Bench through multiple human runs to enable rigorous comparison with model variability.	1

LG	Determine the causal contributions of HRM components (hierarchical recursion, deep supervision, ACT, and the choice of latent features) to performance, and justify the use of exactly two latent features versus alternative configurations.	1

AI	Determine the net effect of selecting shorter, better-scoped issues on measured AI speedup or slowdown, accounting for both potentially improved AI performance on clearer tasks and potentially improved human performance on the same tasks.	1

AI	Establish the causal effect of increasing the agent’s initial money balance (e.g., to $2,500) on units sold and overall performance in Vending-Bench, accounting for high variance across runs.	1

LG	Determine conditions under which HRM’s recursive updates converge to fixed points that justify the use of the Implicit Function Theorem and one-step gradient approximation, or else rigorously characterize regimes where such fixed points are not attained.	1

AI	Establish whether and to what extent experimental artifacts or biases in the study design account for the measured slowdown when AI is allowed, and quantify their contribution relative to genuine effects of AI assistance on developer productivity.	1

AI	Determine whether the reduced performance gains from increased time budgets observed for complex AI R&D tasks also hold for simpler, long-running tasks, in order to isolate and assess long-term coherence capabilities independently of task complexity.	1

LG	Establish a formal theoretical understanding of Hierarchical Reasoning Models (HRM) and determine whether their recursive design and training scheme are optimal for supervised reasoning tasks, clarifying the justification for their architecture and potential suboptimality.	1

AI	Develop and evaluate high‑precision methods for detecting LLM‑generated essays and author attribution—potentially through participant‑specific writing fingerprints—using larger cohorts to overcome current sample-size limitations.	1