SATLUTION: Autonomous SAT Solver Evolution

Updated 10 September 2025

SATLUTION is an autonomous framework for repository-scale evolution of SAT solvers that employs LLM coding agents to dynamically update both code and evolutionary policies.
It implements an agent-based, iterative process where Planning and Coding agents execute a Generate → Compile → Test → Analyze → Evolve loop under strict correctness and performance guarantees.
Evolved SATLUTION solvers have outperformed top human-designed competitors by achieving superior PAR-2 scores and solving more SAT instances on established benchmarks.

SATLUTION is an autonomous framework for repository-scale code evolution in Boolean Satisfiability (SAT) solving, orchestrated by LLM coding agents. The system extends prior agentic frameworks from isolated kernel refinement to the evolution of entire SAT solver repositories—comprising hundreds of source files and tens of thousands of lines of code—under strict correctness guarantees and runtime feedback. SATLUTION self-evolves both solver code and its own evolutionary policies, producing solvers that have outperformed state-of-the-art, human-designed competition winners on canonical SAT benchmarks (Yu et al., 9 Sep 2025).

1. System Architecture and Code Evolution Paradigm

SATLUTION implements an agent-based, iterative code evolution architecture at repository scale. The framework consists of two principal agent types: the Planning Agent and the Coding Agent. The Planning Agent analyzes performance feedback and formulates improvement strategies, such as modifying heuristics, adjusting clause management policies, or refactoring architectural modules. The Coding Agent executes these plans, modifying C/C++ code directly within the repository, ensuring that all changes remain compilable and correctly integrated.

Each iteration produces a new solver version housed in its own directory (e.g., SATLUTION_x/), with mandatory ancillary files including HYPOTHESIS.md (listing improvement hypotheses), CHANGELOG.md (documenting modifications), and RESULTS.md (recording benchmark results), as well as updated configuration scripts. This repository structure supports full traceability, reproducibility, and rigorous lineage tracking across iterations.

SATLUTION introduces a scalable, rule-driven evolution protocol: agent actions are gated both by static domain rules and by self-evolving dynamic policies, ensuring disciplined exploration of the vast solver design space. The process is schematized as a “Generate → Compile → Test → Analyze → Evolve” loop, with built-in mechanisms for continuous improvement and error correction.

2. Target Domain: Boolean Satisfiability (SAT) and NP-Completeness

The evolution target for SATLUTION is Boolean Satisfiability (SAT), the first and most prominent NP-complete problem (per Cook’s Theorem). SAT underlies theoretical computational complexity and has broad applications in hardware verification, formal software analysis, cryptanalysis, and planning. Modern SAT solvers—the subject of annual SAT Competitions—employ highly optimized algorithms, such as CDCL (Conflict Driven Clause Learning) with sophisticated restart and decision heuristics.

SATLUTION leverages SAT as both a high-impact real-world benchmark and a stringent domain for demonstrating autonomous code evolution. Candidate solvers are seeded from SAT Competition 2024 repositories, and evolved solvers are validated against the SAT Competition 2025 benchmark suite. Performance is tracked using standard metrics, chiefly the PAR-2 score.

3. Evolutionary Cycle: Agents, Feedback, and Verification

The iterative evolution process central to SATLUTION involves the interplay of agents and a rigorous two-stage verification pipeline. After each code revision:

Stage 1: Immediate checks validate that both compilation and a smoke test on simple CNF instances succeed, screening for critical failures (e.g., crashes, segmentation faults).
Stage 2: Full-scale evaluation executes the solver on comprehensive SAT benchmarks using a distributed evaluator (800-node CPU harness), with correctness verified by checking SAT-assignment validity and using DRAT proofs for UNSAT cases.

The output metrics include number of instances solved, runtime distributions, memory consumption, and PAR-2 score: $\text{PAR-2} = \frac{1}{N} \left(\sum_{i=1}^{N} t_i + M \times (\# \text{unsolved instances})\right)$ where $t_i$ is the runtime for instance $i$ , $M$ is the timeout penalty (5000s), and $N$ is the total number of instances.

Benchmarks and feedback are used not only to guide the next evolution step but also to update the agent rulebase—if recurring errors or performance anomalies are detected, the Planning Agent revises policies to avoid ineffective or detrimental modifications.

4. Self-Evolving Rulebase and Policy Adaptation

SATLUTION distinguishes itself by its dynamic rulebase: the set of rules governing code evolution is itself subject to autonomous adaptation. Beginning with expert-defined SAT domain rules (covering core heuristics, forbidden code patterns, mandatory correctness constraints), the framework augments and revises these policies in response to post-mortem analyses after each evolution cycle.

For example, rules on clause learning, branching heuristics, and DRAT generation for UNSAT outcomes may be expanded to prohibit error-prone code structures, or to prioritize strategies associated with performance gains. Modules such as “Rule 04 – Forbidden Patterns” and “Rule 05 – Automatic Rule Evolution” are updated automatically, improving agent guidance and enabling more robust and efficient solver evolution.

The evolving rulebase serves two functions: enforcing correctness guarantees and facilitating safe exploration, steering agents away from deleterious modifications. This adaptive protocol produces empirically smooth solver improvement trajectories and mitigates regression.

5. Performance Benchmarks and Comparative Analysis

The evolved SATLUTION solvers have demonstrated superior performance in official SAT Competitions. On the SAT Competition 2025, SATLUTION-generated solvers solved significantly more instances (344–347) compared to competition winners (334–331), with lower PAR-2 scores indicating faster and more consistent runtime distributions. These improvements span both satisfiable and unsatisfiable instance classes and are validated by cactus plots and instance breakdowns presented in the source paper.

The evolved solvers outperform not only the original SATLUTION seed codebases from 2024 but also the hand-optimized human champion solvers from both 2024 and 2025 competition cycles. These results confirm that agent-driven, feedback-directed evolution is capable of producing algorithmic innovations beyond manual engineering in complex, high-stakes domains.

6. Verification Infrastructure and Technical Implementation

SATLUTION operates on an 800-node CPU cluster, allowing parallelized evaluation against hundreds of benchmark instances. The system tracks fine-grained performance metrics, with special attention to time thresholds, resource consumption, and correctness per DRAT and assignment checks. Multiple agent token types, including cache reads, output tokens, and rule-compliance signals, are fed into policy modules that guide agent operations.

All evolutionary cycles require full documentation and build-system integrity, ensuring that each new solver version is both fully reproducible and independently verifiable. The framework’s technical architecture supports not only solver evolution but also the traceable evolution of its own codebase and policies.

7. Prospects and Generalization

Future work in SATLUTION may focus on extending agentic evolution to verification sub-modules themselves—eventually enabling autonomous construction and optimization of verification tools and procedures. The framework could also be generalized to cover broader NP-complete domains or optimization problems in electronic design automation, formal synthesis, and beyond.

A plausible implication is that SATLUTION’s scalable protocol for repository-wide code evolution, coupled with domain-aware self-updating policies, may accelerate progress in fields where correctness and performance are simultaneously paramount. The methodology demonstrated in Boolean Satisfiability is structurally applicable to other mission-critical areas, contingent on analogous benchmarking and verification protocols.

Summary Table: SATLUTION Features, Performance, and Infrastructure

Aspect	Details	Performance Metric
Scope	Full C/C++ SAT solver repositories (100s files)	PAR-2 score, #instances solved
Agents	Planning & Coding, rule-driven evolution	Runtime, memory usage distribution
Verification	Two-stage pipeline, DRAT proof checks	Correctness on SAT & UNSAT
Hardware	800-node CPU test harness	Speed of evaluation, parallelization
Documentation	Mandatory HYPOTHESIS.md, CHANGELOG.md, RESULTS.md	Reproducibility
Rulebase	Expert + self-evolving policies	Error reduction, rule compliance

SATLUTION demonstrates repository-scale autonomous code evolution guided by dynamic policy adaptation. Its success in outperforming leading human designs in Boolean Satisfiability highlights the viability of agentic frameworks for advanced algorithmic engineering and suggests opportunities for broader applications in complex, correctness-constrained computational domains (Yu et al., 9 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Autonomous Code Evolution Meets NP-Completeness (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to SATLUTION.