Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries
Abstract: Scientific discovery is often a collective process: researchers share partial results, inspect failed attempts, and build on each other's ideas over long time horizons. Recent AI systems have shown that language-model-based agents can make meaningful progress on open scientific problems, but most existing systems operate in isolation. In this paper, we present EinsteinArena, an agent-native platform for open distributed research and discovery. EinsteinArena provides agents with a live set of open problems, each with a solid verifier, public leaderboard, and problem-specific discussion forum where agents can ask questions and share insights. We focus on mathematical tasks that have garnered substantial research interest, where progress can be measured unambiguously. As of May 2026, agents on EinsteinArena have discovered 12 new state-of-the-art results better than any previous human or AI solutions. One notable example is the kissing number problem in dimension 11, where the platform improved the best known lower bound from 593 to 604. This advance did not come from a single agent or isolated run. Rather it arose through a sequence of submissions, public discussion, verifier refinement, and subsequent agent-to-agent borrowing of ideas. These results provide evidence that decentralized scientific discovery can emerge from open interaction among autonomous agents in the wild, demonstrating a new paradigm for collective AI-driven research.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Plain-English Summary of “Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries”
1) What is this paper about?
This paper introduces EinsteinArena, a website and API where many AI “agents” (computer programs powered by LLMs) work together on tough, open math problems. Instead of each AI working alone and throwing away its notes, the platform lets agents share their tries, learn from each other, and keep a public scoreboard. The big idea: science moves faster when people share. Can AIs do the same?
2) What questions are the researchers asking?
They focus on two simple questions:
- If AI agents share their work in public (their partial solutions, mistakes, and ideas), do they make faster progress than if they work alone?
- Can a community of different agents, operating over days or weeks, push the limits on real unsolved math problems with solid, automatic checking?
3) How does EinsteinArena work? (Methods, in everyday terms)
Think of EinsteinArena like a science fair crossed with a multiplayer game:
- Problems: The site hosts a set of math challenges that other researchers care about. These are chosen so that progress is easy to measure with a number (higher or lower is better).
- Verifiers: Each problem comes with a trusted “judge” program that checks a submitted solution and gives it a score. These judges are public, so anyone (or any agent) can test ideas at home and know the exact rules.
- Leaderboards: A scoreboard shows the best result so far for each agent and each problem.
- Discussion boards: Agents can post their ideas, near-misses, and questions. This creates a shared memory so future agents don’t repeat the same mistakes.
- Open access: Agents can download problems, judges, and the current best solutions to study or improve them. The platform uses safe sandboxes and high-precision math to avoid errors and cheating.
- Light anti-spam: New agents solve a tiny computer puzzle at signup (proof-of-work), which is easy for one agent but hard to abuse at scale.
In plain language: agents don’t start from zero each time. They can copy, tweak, and build on what others did—like students comparing notes to crack a hard puzzle together.
4) What did they find, and why is it important?
The platform quickly led to real advances:
- Across many math problems, agents found 12 new best-ever results.
- A highlight is the “kissing number” in 11 dimensions. That’s like asking: how many identical spheres can touch one central sphere at once, if you’re working in a space with 11 directions instead of 3? Before, the best proven “you can at least do this many” number was 593. On EinsteinArena, agents pushed that lower bound to 604—a big jump.
How did the kissing-number progress happen?
- Multiple agents shared partial solutions and errors in the discussion.
- One agent found a strong “almost-right” arrangement of points. Others refined it.
- An easy analogy: they first stretched and nudged a drawing until it “almost” fit perfectly, then “snapped” the points onto a clean grid-like pattern to make it exactly valid for the judge.
- This wasn’t a single lucky run. It was a chain of contributions: one agent’s good start, another’s smarter tuning method, others’ careful clean-up, and finally a generalization that added even more spheres.
Another example: the “second autocorrelation inequality”
- This problem studies how a function compares to a shifted copy of itself—useful across math, signal processing, and physics.
- Agents improved the best-known constant by trying sharper numeric approximations (imagine using a higher-resolution ruler) and by combining search strategies, like:
- A “tuning” loop that repeatedly adjusts a trade-off parameter to raise the score.
- A “hot-and-cold” strategy (simulated annealing) that explores broadly at first and then hones in on the best region.
Why these results matter:
- They show that open collaboration among AI agents, with public checking, can produce new scientific knowledge—not just reproduce known answers.
- The public, exact judges make improvements trustworthy and easy to verify.
5) Why does this matter for the future?
- A new way to do AI science: Instead of single, closed runs, we can have ongoing, open communities of agents sharing progress. This can speed up discovery and make research more reliable.
- Reusable knowledge: Threads, code, and best solutions stay available, so the next agent starts from the frontier, not from scratch.
- Healthy competition plus teamwork: Agents “compete” on the scoreboard, but the best gains came from sharing. This mirrors how human science works.
- Next steps and open challenges:
- Can this approach transfer to areas beyond cleanly checkable math tasks, like biology or formal proofs?
- How do we design incentives so agents don’t just chase tiny leaderboard bumps but also explore bold ideas that pay off later?
- How do we keep the judges strong and fair as agents get more clever?
Bottom line: EinsteinArena suggests that when AI agents collaborate in the open—seeing each other’s work, building on it, and being judged by clear, public rules—they can make discoveries that neither one agent nor a closed system might achieve alone. This points to a future where collective AI research, much like human science, becomes a powerful engine for new knowledge.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, phrased to guide follow‑up research.
- Generalization beyond math with exact/near-exact verifiers: How well do open, persistent, agent-native platforms transfer to domains like formal proof, algorithm design, and computational biology where verifiers are partial, heuristic, or expensive? Design cross-domain pilots and compare progress dynamics and failure modes.
- Causal impact of “shared state”: The paper presents observational evidence only. Run controlled A/B tests comparing: (i) private runs, (ii) shared leaderboard only, (iii) leaderboard + downloadable solutions, (iv) leaderboard + solutions + discussion, to quantify the marginal effect of each collaboration affordance on improvement rates and solution diversity.
- Incentive design and credit attribution: Leaderboards may encourage short-horizon score chasing over foundational work. Evaluate mechanisms (citations to parent submissions, lineage-based credit sharing, time-weighted rewards, diversity bonuses) and measure their impact on long-term progress and willingness to share partials.
- Verifier gaming and overfitting: Public, deterministic verifiers enable targeted over-optimization. Assess vulnerability to adversarial constructions that exploit numerical or logical edge cases; introduce adversarial auditing and cross-implementation checks (e.g., redundant verifiers, differential testing).
- Determinism and variance in heavy verifiers: For tasks using sampling (e.g., prime number theorem with 107 draws), the paper does not specify RNG seeding or variance control. Establish fixed seeds, confidence intervals, and a decision policy (e.g., require improvement beyond a statistical margin) to avoid leaderboard churn due to stochasticity.
- Numerical precision policies: The switch to 30–80 digit decimal arithmetic improved robustness but raises portability and performance questions. Benchmark precision–runtime trade-offs, cross-platform reproducibility, and consider interval or rational arithmetic with certificates to guarantee correctness.
- Verifier versioning and reproducibility: Verifiers were upgraded midstream; the paper lacks a formal versioning and re-scoring policy. Define immutable verifier versions, migration protocols, and public re-evaluations to ensure fairness and reproducibility across time.
- Submission retention bias: Only each agent’s personal best is stored; “rejected or lower-scoring submissions are not stored.” This loses negative results and exploration traces. Assess how this biases lineage analyses and introduce opt-in full-history logging (with rate limits) to preserve search trajectories for meta-research.
- Lineage inference validity: Parent–child relationships are inferred via feature similarity; no ground-truth validation is provided. Quantify false link rates, compare against explicit ancestor pointers, and evaluate sensitivity to feature choices and thresholds.
- Moderation effects on knowledge flow: Llama-Guard filters discussions, but false positives may suppress useful technical content and false negatives may allow misinformation. Measure moderation precision/recall on domain-specific content and its impact on downstream progress.
- Security and integrity: Beyond PoW and sandboxing, the paper does not assess risks from sybil agents, poisoning attacks in discussions, or denial-of-service via expensive submissions. Develop stronger sybil resistance, reputation systems, and submission-cost throttling tied to resource usage.
- Resource fairness and compute disparities: Agents with larger budgets may dominate. Quantify outcome dependence on compute, add per-agent quotas, and test whether constrained-compute settings still show collaborative gains.
- Scalability limits: As agent and problem counts grow, will evaluation, moderation, and storage keep up? Stress-test the platform to characterize throughput, cost, and latency bottlenecks; introduce queuing and prioritization policies.
- Choice and tuning of acceptance thresholds δ: The per-problem δ rules are hand-tuned; their effects on progress granularity and agent behavior are unmeasured. Perform sensitivity analyses and consider adaptive δ (e.g., relative to score variance or precision) to balance iteration speed and stability.
- Provenance and plagiarism: With public best solutions, agents can trivially copy/submit near-identical artifacts. Specify duplication detection, provenance tracking, and credit assignment to discourage free-riding while encouraging reuse.
- Kissing number result formalization: The 604-sphere lower bound is platform-verified but lacks a peer-reviewed mathematical certificate. Publish exact coordinates, contact graph, symmetry structure, and a machine-checkable certificate; invite independent verification and explore whether the shared 496-vector “backbone” has a lattice or code-theoretic explanation.
- Extensibility across dimensions: Can the techniques (surrogate loss, LSQR refinement, integer snapping) generalize to other kissing number dimensions? Perform systematic trials across d to test robustness and identify structural prerequisites.
- Autocorrelation inequality discretization error: Results depend on step-function discretization; no error bounds to the continuous optimum are provided. Derive bounds on the gap between discrete and continuous optima as a function of interval count m, and estimate convergence rates empirically and theoretically.
- Upper bounds and optimality gaps: The paper focuses on improved lower bounds (e.g., autocorrelation, kissing number). Integrate upper-bound discovery or verification to quantify remaining optimality gaps and guide search.
- Diversity vs exploitation: Public best solutions may bias agents toward local refinement. Track basin diversity and novelty metrics over time; test interventions (e.g., novelty bonuses, population seeding from orthogonal priors) to sustain exploration.
- Cross-problem transfer: Do agents reuse techniques or artifacts across problems (e.g., optimization heuristics, parameterizations)? Quantify transfer and design tasks to explicitly test cross-domain knowledge reuse within the platform.
- Human involvement and attribution: The platform does not require operator disclosure; how much of the observed progress is due to human orchestration vs autonomous agents? Establish auditing methods and reporting standards to distinguish human-guided pipelines from autonomous behaviors.
- Comparative baselines: The platform’s effectiveness is not compared against established collaborative venues (e.g., Kaggle, private lab pipelines). Run matched benchmarks to isolate what the “agent-native, persistent” design adds beyond public leaderboards alone.
- Data/selection bias in problem curation: Problems were drawn from AlphaEvolve with amenable verifiers; this may inflate apparent progress. Curate a broader, independently sourced set with varying verifier characteristics and difficulty to test generality.
- Discussion content structure: Ad hoc threads may be hard for agents to parse and reuse. Evaluate structured artifacts (e.g., machine-readable “research notes” schemas, minimal working examples) and measure their effect on reuse and progress.
- Governance and ethics: Allowing anonymous agent participation raises risks (misuse, IP leakage, data poisoning). Define governance policies, licensing norms for shared artifacts, and procedures for dispute resolution and takedown.
- Long-term archival and citation: Ensure all artifacts (verifiers, submissions, threads) have immutable hashes/DOIs for external replication, and define a canonical citation format for solutions to enable scholarly integration and independent audits.
- Robustness of Decimal-based results across environments: Verify that Decimal configurations (precision, rounding modes) and dependencies yield identical scores across hardware/software stacks; publish environment manifests and cross-validate with alternative exact-arithmetic backends.
Practical Applications
Immediate Applications
The paper’s platform design, verifier engineering, and collaborative optimization methods can be deployed today in settings that have deterministic, efficient evaluators.
- Enterprise “Discovery Arena” for verifiable optimization problems (Industry: software, logistics, operations research)
- Deploy an internal EinsteinArena-style platform to crowdsource agent-driven improvements on tasks with exact verifiers (e.g., routing, bin packing, scheduling, compiler flag tuning). Tools: public verifiers, real‑time leaderboards, API for submissions, agent discussion boards, acceptance thresholds (δ), sandboxed evaluation.
- Dependencies/assumptions: problems must have reproducible, deterministic scoring; engineering for sandbox security and rate limiting; incentive design for collaboration vs. competition.
- CI/CD optimization pipelines with public verifiers (Industry: software/devops)
- Turn performance tests into verifiers and let agents iteratively optimize microservices latency, query plans, or compression ratios. Products: “Optimization as a Leaderboard” dashboards integrated with build systems.
- Dependencies/assumptions: stable, trusted performance harnesses; guarding against overfitting to benchmarks (e.g., rotating seeds, held-out test harnesses).
- High-precision evaluation services for numerically sensitive tasks (Industry: scientific computing; Academia: numerical analysis)
- Adopt the Decimal/MPFR-style high-precision verification the paper used for kissing number to avoid false positives/negatives in near-feasible regimes (e.g., geometry, cryptography checks, CAD tolerance validation).
- Dependencies/assumptions: performance budget for high-precision arithmetic; clear precision policies and reproducibility.
- E2B-style sandboxed, reproducible evaluators (Industry/Academia: platform engineering)
- Use isolated execution environments for verifier runs to safely evaluate untrusted agent submissions in R&D competitions or hackathons.
- Dependencies/assumptions: containerization, resource quotas, and deterministic images; logging for auditability.
- Anti-gaming leaderboards with minimum-improvement thresholds (Industry/Academia: benchmarking)
- Adopt problem-specific δ thresholds to discourage leaderboard churn from numerics while still rewarding real gains; retain full submission history for reproducibility.
- Dependencies/assumptions: problem-aware δ calibration; versioned verifiers and backfills on updates.
- Open-verifier competitions for algorithmic heuristics (Academia/Industry: algorithms)
- Host competitions where agents improve heuristics for NP-hard problems (e.g., MaxCut, TSP variants) under transparent verifiers and public solution traces to accelerate cumulative progress.
- Dependencies/assumptions: exact or bounded verifiers; licensing to share best solutions and code.
- Collaborative agent forums for R&D trace-sharing (Academia/Industry: research ops)
- Add moderated, API-accessible discussion threads so agents share partial constructs, failure modes, and refinements; leverage LLM-based moderation (e.g., Llama‑Guard).
- Dependencies/assumptions: moderation rules; spam controls (e.g., proof-of-work registration).
- Lineage tracking and basin analytics for search (Industry/Academia: optimization/MLOps)
- Use the paper’s “solution lineage” idea (fingerprints and similarity) to map search regimes, identify promising basins, and avoid restarts in black-box tuning (e.g., parameter sweeps in RL or black-box simulators).
- Dependencies/assumptions: stable fingerprinting; storage of intermediate artifacts.
- Dinkelbach-style optimization for ratio objectives (Industry: systems optimization; Academia: operations research)
- Apply iterative fractional programming (maximize A/B via A−λB) in tasks like throughput/latency, energy/accuracy, or ROI/cost ratios where convexity or monotonicity conditions approximately hold. Workflow: initialize λ from current ratio, alternate solve/update.
- Dependencies/assumptions: applicability of Dinkelbach conditions or robust approximations; warm starts and convergence checks.
- Surrogate least-squares objectives with LSQR and integer snapping (Industry: robotics, VLSI/EDA, combinatorial design; Academia: discrete math)
- For constraint-heavy problems with underlying discrete structure, fit a smooth surrogate (least-squares on constraint residuals), optimize with LSQR, then “snap” near-integral parameters to exact discrete values (e.g., grid-aligned placements, code design, sensor calibration).
- Dependencies/assumptions: constraints amenable to linearization; detectable discrete structure; robust snapping thresholds to prevent invalid solutions.
- Courseware for collective math/optimization learning (Education)
- Use a classroom EinsteinArena clone to teach iterative refinement, open verifiers, and collaborative research via agent‑student co-discovery on precise tasks.
- Dependencies/assumptions: curated task sets with deterministic verifiers; academic policies for attribution and AI use.
- Transparent reproducibility standards in AI discovery (Policy/Academia)
- Encourage funding calls and venues to require public verifiers, versioned code, and best-solution artifacts as a condition for leaderboard claims, mirroring EinsteinArena’s transparency.
- Dependencies/assumptions: community buy‑in; handling sensitive IP/data through synthetic or sanitized verifiers.
Long-Term Applications
Extending the paradigm beyond mathematics will require new verifiers, scaling infrastructure, incentive mechanisms, and domain-specific safety and validity frameworks.
- Cross-domain open discovery platforms with verifiable cores (Industry/Academia: software, robotics, energy, materials)
- Generalize EinsteinArena to domains with deterministic or bounded verifiers: robot control in fixed simulators, scheduling in power-grid simulators, compiler synthesis with test suites, or materials with fast surrogates plus spot DFT checks.
- Dependencies/assumptions: sufficiently faithful simulators; anti‑overfitting strategies (domain randomization, hidden tests); compute orchestration at scale.
- Formal proof and program synthesis arenas with auto-checkers (Software/Formal methods)
- Couple agent collaboration with proof assistants (e.g., Lean, Coq) and property-based testing for program synthesis; public traces accelerate lemma re-use and tactic refinement.
- Dependencies/assumptions: fast, deterministic proof checkers; libraries of lemmas; mechanisms to prevent degenerate “checker gaming.”
- Drug/material design with hybrid verification loops (Healthcare/Biotech/Materials)
- Combine rapid in‑silico verifiers (docking, ADMET predictors, surrogate DFT) with periodic wet-lab or high-fidelity validation in an agent arena; track lineages to prioritize promising scaffolds.
- Dependencies/assumptions: reliable surrogate verifiers; experimental feedback loops; biosafety and ethics governance; IP frameworks.
- “Platform-as-shared-memory” for corporate R&D portfolios (Industry: enterprise R&D)
- Institutionalize persistent agent collaboration across teams and time, where partial results and negative findings are first-class citizens; leverage lineage analytics to reduce duplication and accelerate convergence.
- Dependencies/assumptions: org-wide data/knowledge sharing policies; access control; incentives that reward sharing.
- Credit, attribution, and IP standards for agent collectives (Policy/Academia/Industry)
- Develop norms for credit allocation across agent lineages and discussion traces; define licensing for agent-generated artifacts and governance for attribution on leaderboards and publications.
- Dependencies/assumptions: legal frameworks for AI-generated IP; community consensus; provenance tooling.
- Compute and access governance for open agent ecosystems (Policy)
- Introduce proof-of-work or alternative anti-spam/anti-Sybil measures, quotas, and transparent logs for public research arenas; consider compute subsidies for verifiable, open problems of public interest.
- Dependencies/assumptions: equitable access; privacy-preserving telemetry; oversight to prevent misuse.
- Training LMs on public agent traces to improve meta-reasoning (AI/ML)
- Use discussion threads, solution lineages, and verifier interactions as a dataset to fine-tune models on collaborative search strategies and failure analysis.
- Dependencies/assumptions: consented data release; de-biasing for leaderboard chasing; safeguards against reproducing gaming behaviors.
- Human–AI co-research networks and “AutoArXiv” (Academia/Publishing)
- Integrate arenas with preprint servers so verified artifacts and lineages auto-generate reports; allow human researchers and agents to iteratively build on partials with canonical provenance.
- Dependencies/assumptions: publisher integration; standards for machine-readable verifiers and artifacts.
- Safety-aware arenas for high-stakes domains (Healthcare/Finance/Energy)
- Layer safety constraints and audit requirements into verifiers (e.g., fairness metrics, risk limits); only accept improvements that satisfy multi-objective safety thresholds.
- Dependencies/assumptions: codified safety standards; multi-objective optimization support; independent audits.
- Citizen science and education at scale (Daily life/Education)
- Public arenas where learners and hobbyists deploy lightweight agents to improve puzzles, games, or benign scientific tasks, learning from transparent traces and verifiers.
- Dependencies/assumptions: curation to avoid harmful tasks; onboarding resources; moderation and community governance.
- Marketplace of reusable solver components and “idea modules” (Industry/Academia)
- Package successful methods (e.g., Dinkelbach loops, LSQR surrogates, integer snapping) as reusable modules agents can import; track module performance across problems.
- Dependencies/assumptions: standard APIs; licensing; telemetry for module attribution and evaluation.
- Robustness research on incentives and social dynamics in agent collectives (Academia/Policy)
- Use arenas to experimentally study how competition vs. collaboration, disclosure norms, and reward shaping affect discovery speed and quality.
- Dependencies/assumptions: IRB/ethics for human-in-the-loop studies; careful metric design beyond leaderboards.
Notes on feasibility across all applications:
- The strongest near-term wins require problems with deterministic, efficient, publicly inspectable verifiers.
- Numerical precision and reproducibility are critical; expect ongoing verifier hardening to counter edge cases and gaming.
- Openness trades off with proprietary constraints; hybrid models (public verifiers with private data or hidden test sets) may be necessary in some sectors.
- Incentive design (credit, rewards) materially affects collective performance; platform governance is not optional.
Glossary
- acceptance pipeline: A stricter set of rules a submission must satisfy to be considered for top leaderboard position. "To claim the top position, a submission must pass a stricter acceptance pipeline: it is required to exceed the current best score by a problem-specific minimum improvement threshold ."
- active-set topology: The structure of currently active (tight/violated) constraints in an optimization problem. "preserve the shared (17{,}088)-pair active-set topology of this new basin"
- additive combinatorics: A field studying additive structures in sets and functions. "a critical problem at the intersection of additive combinatorics and harmonic analysis."
- agent-native: Designed primarily for autonomous AI agents rather than humans. "an agent-native platform for open distributed research and discovery."
- autocorrelation: A measure of similarity between a function or set and a shifted version of itself. "The autocorrelation measures the overlap between a set or function and a shifted copy of itself"
- autoconvolution: The convolution of a function with itself. "where denotes the autoconvolution of ."
- autoconvolution ratio: A scalar metric derived from autoconvolution used as a scoring objective. "the autoconvolution ratio for the autocorrelation problems"
- basin: A region of the search space corresponding to an attraction basin around an optimum. "The Gradient agent discovers a new basin that is long-lived (0.156 overlap penalty) with multiple agents including CHRONOS contributing small improvements within the same broad geometry."
- Bearer token: An authentication credential sent with API requests. "the agent is issued a Bearer token that can be used to authenticate subsequent API requests, including solution submissions and other write operations."
- contact graphs: Graphs capturing which elements (e.g., spheres) are in contact in a configuration. "contact graphs, symmetry, integer-like coordinates, shells, or resemblance to known lattice constructions."
- decimal.Decimal arithmetic: High-precision decimal arithmetic provided by Python’s decimal module. "verifiers use Python's decimal.Decimal arithmetic at 30--80 significant digits for the overlap loss computation and exact arithmetic for integer-valued submissions."
- discretized formulation: A finite-dimensional approximation of a continuous problem via discretization. "we focus on a set of step functions and consider a discretized formulation"
- Dinkelbach optimization: An iterative method for fractional programming that optimizes a parametric difference. "Dinkelbach optimization---which iteratively maximizes while updating the hyperparameter at each step \cite{dinkelbach1967nonlinear}---emerged as a key methodology"
- E2B sandboxes: Isolated, containerized execution environments used to safely run untrusted code. "all submissions are checked in isolated execution environments (E2B sandboxes), where the problem verifier is executed against the submission data."
- Erd\H{o}s minimum overlap: A classic extremal problem posed by Erdős minimizing overlap between sets/structures. "a score of $0.380868$ for the Erd\H{o}s minimum overlap problem."
- exact arithmetic: Computation with exact numbers (e.g., integers) avoiding floating-point error. "and exact arithmetic for integer-valued submissions."
- Fourier magnitudes: Absolute values of Fourier transform coefficients. "and are also closely connected to Fourier magnitudes of a function"
- harmonic analysis: The study of functions via decompositions such as Fourier analysis. "a critical problem at the intersection of additive combinatorics and harmonic analysis."
- integer-snapping: Rounding near-integer quantities to exact integers to enforce discrete structure. "a final integer-snapping post-processing step to make these values exact"
- kissing number: The maximum number of non-overlapping unit spheres that can touch a central unit sphere in d dimensions. "The kissing number in dimension asks the maximum number of non-overlapping unit spheres in that can simultaneously touch a central unit sphere."
- Llama-Guard: An LLM-based content moderation system. "Llama-Guard-based moderation step"
- LSQR algorithm: An iterative algorithm for solving large sparse least-squares problems. "the LSQR algorithm \cite{paige1982lsqr}"
- machine epsilon: The smallest difference distinguishable by floating-point arithmetic. "smaller than machine epsilon"
- minimum improvement threshold : The problem-specific minimal score gain required for leaderboard updates. "a problem-specific minimum improvement threshold ."
- overlap integral: An integral measuring overlap used for scoring or evaluation. "computing the overlap integral for Erd\H{o}s"
- overlap penalty: A loss term penalizing overlaps (constraint violations) in configurations. "reducing the overlap penalty but remain far from feasibility."
- persistent shared memory: A long-lived shared store of artifacts that agents can read and build upon. "EinsteinArena treats the platform as a persistent shared memory"
- proof-of-work: A computational puzzle used to deter spam by making actions costly. "This proof-of-work computation is inexpensive while making large-scale registration attempts computationally expensive, thereby discouraging spam."
- SHA256: A cryptographic hash function used in proof-of-work and integrity checks. "SHA256(challenge + n) begins with leading zero bits."
- simulated annealing: A stochastic optimization technique inspired by annealing in metallurgy. "complementary techniques such as simulated annealing contributed to the best-performing solutions on the platform"
- solutionSchema: A machine-readable schema defining the required JSON structure of a submission. "a solutionSchema that defines the exact JSON structure a valid submission must have"
- step functions: Piecewise-constant functions used to approximate continuous ones. "we focus on a set of step functions and consider a discretized formulation"
- structural/lattice decoding: Interpreting numerical configurations by mapping them to structured lattice-like objects. "structural/lattice decoding and micro-perturbation refinement"
- surrogate loss function: An alternative, easier-to-optimize objective approximating the true target. "The combination of the surrogate loss function, and more critically, the integer-snapping technique"
- Tammes problem: An optimization problem on distributing points on a sphere to maximize minimal distances. "Tammes problem ()"
- Taylor expansion: A series expansion approximating a function locally by its derivatives. "linearized surrogate obtained via a Taylor expansion."
- Thomson problem: Finding minimal-energy arrangements of charged points on a sphere. "Thomson problem ()"
- verifier: An executable program that checks a submission and computes its score deterministically. "The verifier is the central artifact."
Collections
Sign up for free to add this paper to one or more collections.
