Socratic-Solver-8B: Autonomous Math Reasoning LLM

Updated 30 September 2025

Socratic-Solver-8B is a large language model that uses an autonomous, agent-based co-evolutionary curriculum for advanced mathematical problem solving.
It integrates a Teacher, Solver, and Generator in a closed-loop system that dynamically refines problems and solutions via direct preference optimization.
Empirical benchmarks show the model surpasses several state-of-the-art LLMs, with improvements up to 20.2 percentage points on diverse math tests.

Socratic-Solver-8B is a LLM trained for advanced mathematical reasoning via a fully autonomous, co-evolutionary curriculum framework. Unlike traditional training regimes that depend on static, massive, human-annotated datasets, Socratic-Solver-8B leverages an agent-based closed-loop system that generates, refines, and evaluates problems and solutions from minimal seed data. The result is a self-improving curriculum that enables the model to achieve strong empirical performance across standardized mathematical reasoning benchmarks and, notably, surpass several state-of-the-art commercial LLMs on aggregate challenging tasks (Wang et al., 29 Sep 2025).

1. Agent-Based Co-Evolutionary Framework

The Socratic-Zero framework consists of three interacting agents:

Teacher Agent: Implemented as a fixed, high-capacity LLM, the Teacher executes two key functions: (i) a deterministic verifier, V(q, y), to determine solution correctness, and (ii) a problem refinement function, G(q, y_fail), to generate new (problem, solution) pairs targeting Solver weaknesses. The Teacher serves as an adaptive adversary, continuously identifying failure modes in the Solver’s reasoning.
Solver Agent: This learner LLM (the subject of Socratic-Solver-8B) utilizes a policy πθ_S to map textual mathematical problems q to solution trajectories y. Solver training is conducted with direct preference optimization (DPO), explicitly reinforcing trajectories that pass Teacher verification and penalizing incorrect ones.
Generator Agent: The Generator learns to emulate the Teacher’s refinement strategy, scaling curriculum generation via a value-weighted supervised fine-tuning (WSFT) objective. This ensures independent, large-scale synthesis of targeted problems, keeping the curriculum at an optimal challenge level for the evolving Solver.

The interaction loop is cyclical: Solver failures on current curriculum samples are used as seeds for the Teacher’s refinement, and these refined examples become new training data. The Generator internalizes this refinement process, autonomously producing new, high-fidelity problems (Wang et al., 29 Sep 2025).

2. Autonomous Data Generation and Curriculum Evolution

The initialization phase requires only 100 carefully selected, diverse mathematical problems, covering broad categories such as algebra, geometry, and number theory, with stringent quality controls to avoid ambiguity or triviality.

The data evolution proceeds as follows:

Solver generates solution trajectories to curricular problems.
Teacher agent (V) evaluates the solutions; incorrect or suboptimal attempts (y_fail) are used by the Teacher (G) to produce new, more challenging problems.
Generator agent is trained on (q, y_fail, q’) triplets with value-weighted supervision:

$L_{\mathrm{WSFT}}(\theta_G) = -\mathbb{E}_{(q, y_{\text{fail}}, q') \sim D_G}[U(q' | \pi_{\theta_S}) \cdot \log \pi_{\theta_G}(q' | q, y_{\text{fail}})]$

where $U(q'|\pi_{\theta_S})$ is an unnormalized Gaussian utility centered at a target Solver success rate µ=0.5.

Curriculum dynamically updates; as the Solver improves, the pipeline autogenerates more difficult tasks, maintaining optimal pacing (“zone of proximal development”) between stagnation and excessive failure.

This co-evolution is fully data-free (no large human-annotated corpora), with all generation, validation, and adaptation handled by the agents themselves.

3. Direct Preference Optimization and Solver Training

Solver training employs a preference-based loss contrasting correct and incorrect solution trajectories. The DPO loss applied is:

$L_{\text{DPO}} = -\mathbb{E}_{q, y_w, y_l}\left[\log \sigma\left(\beta \log \frac{\pi_{\theta_S}(y_w|q)}{\pi_{\text{ref}}(y_w|q)} - \beta \log \frac{\pi_{\theta_S}(y_l|q)}{\pi_{\text{ref}}(y_l|q)} \right)\right]$

where $y_w$ and $y_l$ are “winning” (correct) and “losing” (incorrect) trajectories as determined by the Teacher verifier, β is a scaling parameter, and $\pi_{\text{ref}}$ is a reference policy. This approach directly drives the Solver’s model distribution to favor solution paths validated by Teacher feedback over degenerate or failing reasoning chains, efficiently harnessing reward signals without human labeling.

4. Benchmarking and Empirical Performance

Socratic-Solver-8B, trained via the described protocol, demonstrates strong performance across seven major mathematical reasoning benchmarks: AMC23, AIME-24, AIME-25, Olympiad, MATH-500, Minerva, and GSM8K. Notable metrics include:

Benchmark	Socratic-Solver-8B Accuracy
AMC	63.7%
Minerva	52.4%
MATH-500	81.2%
GSM8K	87.3%
Olympiad	55.1%
AIME-25	24.6%
AIME-24	28.4%
Stage 3 Average	56.1%

The model achieves a +20.2 percentage point gain over the strongest prior data synthesis methods (measured at Stage 3). Synthetic data generated by Socratic-Generator-32B further enables student LLMs to attain average accuracy (37.72%) that matches or surpasses leading commercial models (including Qwen3-235B-A22B, DeepSeek-V3.1-671B, GPT-5, Gemini-2.5-Pro, Grok-4, and Claude-4.1-Opus) (Wang et al., 29 Sep 2025).

5. Innovations and Theoretical Contributions

Socratic-Solver-8B and the underlying Socratic-Zero pipeline introduce several departures from conventional LLM reasoning pipelines:

Data Minimalism: The system is bootstrapped from 100 seeds, eschewing massive, static, human-curated datasets.
Autonomous Curriculum Adaptation: Curriculum complexity self-adjusts in real time according to Solver learning curves, targeting weaknesses dynamically.
Agented Closed-Loop: Teacher, Solver, and Generator are each modular, permitting independent upgrades, theoretical analysis of convergence, and injection of bespoke evaluation logic.
Preference-Based Reinforcement: The DPO training scheme enables selection of preferable reasoning trajectories without explicit reward assignment.
Efficient Scaling: Generator agent’s value-weighted supervision ensures scalable, high-fidelity problem synthesis, with utility shaping to match Solver’s learning phase.
Benchmark Generality: Gains are robust across both Qwen3 and GLM4 families (Wang et al., 29 Sep 2025).

6. Implications and Prospects

The Socratic-Solver-8B method establishes a viable path toward reasoning-competent LLMs without heavy reliance on manual supervision. The fully autonomous, data-free co-evolution paradigm not only reduces annotation costs but provides natural support for curriculum adaptability—a key for robust mathematical generalization.

There are future research opportunities in extending this agented protocol to other scientific or logic-intensive domains, theoretical analyses of co-evolutionary dynamics, and further refinement of curriculum shaping functions. Given its modular ecosystem, individual agent roles (e.g., Teacher verification criteria, Generator synthesis policies) can be independently advanced, offering a flexible template for next-generation LLMs targeting hard reasoning domains.

In conclusion, Socratic-Solver-8B leverages preference-based optimization and closed-loop agent interactions to produce a mathematical reasoner that sets a new empirical standard for autonomous LLM training methodology (Wang et al., 29 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Socratic-Zero : Bootstrapping Reasoning via Data-Free Agent Co-evolution (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Socratic-Solver-8B.