Bolzano: Case Studies in LLM-Assisted Mathematical Research

Published 18 Apr 2026 in cs.CL, cs.AI, cs.LG, and cs.LO | (2604.16989v1)

Abstract: We report new results on six problems in mathematics and theoretical computer science, produced with the assistance of Bolzano, an open-source multi-agent LLM system. Bolzano orchestrates rounds of interaction between parallel prover agents and a verifier agent while maintaining a persistent knowledge base that is carried across rounds. Classified using the significance-autonomy taxonomy of Feng et al., four of the six results reach the level of publishable research, and three of the six were produced essentially autonomously by Bolzano. Our results provide evidence that LLMs can contribute meaningfully to mathematical research, complementing recent reports by Bubeck et al., Woodruff et al., and others.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper demonstrates Bolzano’s multi-agent LLM pipeline, proving autonomous generation of publishable mathematical proofs across various domains.
It employs a cyclic interaction of prover, verifier, and summarizer agents to solve open problems in complexity, combinatorics, cryptography, and data structures.
The study reveals that combining agent diversity with human oversight can effectively augment traditional mathematical research methods.

Bolzano: Case Studies in LLM-Assisted Mathematical Research

Overview

This work presents six substantial case studies that demonstrate the practical efficacy of Bolzano, an open-source, multi-agent LLM-based system designed to facilitate mathematical research. Bolzano leverages rounds of interaction between multiple parallel "prover" agents and a distinct "verifier" agent, maintaining a persistent knowledge base throughout the research process. The research spans open questions and previously unsolved problems in complexity theory, combinatorics, cryptography, and data structures, often producing publishable-level results with varying degrees of human intervention. The paper adopts the significance-autonomy taxonomy of Feng et al., identifying three autonomous results among six, and underscores the emerging paradigm in which LLMs meaningfully augment or even autonomously generate new mathematical advances (2604.16989).

System Architecture

Bolzano's architecture is characterized by a cyclic research pipeline comprising prover, verifier, and summarizer agents, each instantiated as LLMs (including GPT, Gemini, Claude, etc.) guided by custom prompts. Each round involves provers independently proposing proof candidates, special cases, counterexamples, or constructions; the verifier critically evaluates these outputs, consolidates viable solutions, and updates the knowledge base. The process is further supported by a summarizer, which distills each round for human review and subsequent agent consumption. Model diversity across provers is supported, mitigating self-bias and expanding the solution space. Only the verifier can modify the persistent files (notes, proofs, summary), giving the system a controlled, rigorous workflow reminiscent of collaborative multi-agent frameworks. Human guidance is integrated optionally between rounds, often leading to stronger results, particularly in problem selection and strategic shifts.

Case Studies of Solved Problems

1. Complexity Theory – PWPP and Oracle Separation

Bolzano contributes a black-box oracle separation establishing that the complexity class PWPP (Polynomial Weak Pigeonhole Principle) is not closed under adaptive Turing reductions. The system rapidly generated an instance (NestedCollision) requiring dependent collision finding, provided a core construction, and, after minimal expert steering, completed a formal proof. The counterexample demonstrates that, unlike PLS, PPA, and PPAD, adaptive and non-adaptive oracle access diverge for PWPP (2604.16989).

Result: The theorem formally asserts that black-box $\mathrm{PWPP}$ is not Turing-closed; the adaptive reduction cannot be realized via shallow collision-formulation.

2. Additive Combinatorics – Structural Tilings

For translational monotilings of $\mathbb{R}^2$ , Bolzano autonomously produced non-trivial tilings by axes-parallel polygonal tiles, informing limits of current techniques. Notably, it constructed explicit tiles that admit decompositions into periodic tilings with distinct lattices for each partition, providing evidence of increased flexibility compared to discrete settings and contributing to the weak periodic tiling conjecture.

Result: For irrational $\alpha$ in $(2/3,1)$ , there exists an axes-parallel polygonal tile that is not a column tile, with a decomposition into distinct periodic tilings.

3. Cryptography – Special Soundness for KZG Batching

Bolzano delivered a formal proof of special soundness for multi-polynomial, multi-point batching in the univariate KZG commitment scheme, addressing a gap in standard-model security proofs. Given a streamlined base-case proof, the system extended the analysis to multi-point settings, generating a lengthy but rigorous technical argument.

Result: Multi-polynomial, multi-point KZG batching achieves $(m, L)$ -special soundness for product-structured transcript trees, under standard falsifiable assumptions.

4. Data Structures – Working Set Properties in Heaps

Bolzano independently reproved and quantitatively strengthened the equivalence of "weak" and "strong" working set properties for heaps. The established result provides a tightened bound, showing that lifetime-based locality and strong locality costs for extracted elements are essentially equivalent up to sublinear additive terms, contrary to prior belief that the strong property was strictly stronger.

Result: For any $\varepsilon > 0$ , the sum of logarithmic lifetimes for extracted elements is at most $(1+\varepsilon)$ times the sum of strong working set costs plus $O(m/\varepsilon)$ , i.e., the two properties are strictly equivalent.

5. Combinatorics – Partitioning Under Function Preimage Constraints

Bolzano disproved a conjecture from the KAMAK workshop on function-preimage partitions by constructing a counterexample for $k \geq 3$ , showing that the original fiber-size condition is vacuously satisfied in high chromatic shift graphs. The system went further to suggest a corrected hypothesis ("pairwise $n$ -boundedness"), and proved that this stronger constraint yields bounded chromatic number and therefore bounded partitions.

Result: Pairwise $\mathbb{R}^2$ 0-boundedness enables partitioning into $\mathbb{R}^2$ 1 parts, yielding strict disjointness across all function images within each part.

6. Cultural Dynamics – KKOS Optimization Complexity

Bolzano autonomously determined the complexity of optimization in the KKOS model, showing that finding a closest equilibrium distribution (minimizing cost from an arbitrary distribution) is NP-complete on general graphs, but solvable in polynomial time on forests (characterized by dissociation sets). Bolzano successfully corrected specification errors and provided structural results for chordal graphs.

Result: KKOS optimization is NP-complete generally, but for forests, feasible supports are precisely dissociation sets, enabling an $\mathbb{R}^2$ 2 algorithm.

Implications and Future Directions

Bolzano's case studies substantiate the claim that LLMs, orchestrated in a persistent multi-agent architecture, can autonomously and autonomously generate formal, publishable mathematical advances. The evidence complements prior reports from systems including OpenAI's GPT-5, Google's Gemini, and Aletheia, establishing a new paradigm for AI-augmented mathematical research. The architecture's persistent knowledge base, agent diversity, and human-guided steering collectively facilitate deeper synthesis, proof generation, and error correction than single-session chatbot interactions, although the advantage remains unquantified.

Practically, this points to immediate deployment potential for AI-powered research assistants in mathematical and theoretical computer science domains. Theoretically, the demonstrated ability to construct counterexamples, extend proof templates, and establish separation results presages more generalized frameworks for automated mathematical discovery.

Future developments may include large-scale benchmarking across formal and informal proof domains, quantification of multi-agent versus single-agent performance, and further integration of formal verification tools to increase coverage and guarantee. The case studies highlight the ongoing relevance of human input in problem selection, verification, and high-level strategy, but suggest that agent architectures will increasingly take over the bulk of the technical work.

Conclusion

Bolzano, via its multi-agent LLM pipeline, establishes a robust methodology for AI-assisted and AI-autonomous mathematical research, producing significant results across a diverse set of hard open problems. The system's strengths in counterexample construction, proof synthesis, and cross-domain reasoning, combined with strategically applied human guidance, signal a shift in mathematical research workflows. The practical and theoretical implications invite further investigation into AI's role in formal sciences and automated reasoning.

Markdown Report Issue