- The paper introduces Lang2Logic, a bi-level framework that decouples reasoning into high-level task abstraction and low-level logic generation for enhanced accuracy.
- It employs optimization-guided formalization and bi-level reinforcement learning, ensuring modularity and systematic error traceability.
- Experimental results demonstrate over 10% accuracy gains on diverse benchmarks, highlighting improved interpretability and scalability in structured reasoning.
Bi-Level Structured Reasoning: From Language to Logic
"From Language to Logic: A Bi-Level Framework for Structured Reasoning" (Lang2Logic) introduces a principled approach to bridging the gap between unstructured natural language and formal logical reasoning in LLMs. The framework is motivated by the observation that existing reasoning paradigms—such as Chain-of-Thought (CoT), program-aided reasoning, and logic programming—either lack explicit structural modeling or are limited in generality and interpretability. Lang2Logic addresses these limitations by decomposing the reasoning process into two distinct, collaborative stages: high-level task abstraction and low-level logic generation.
Framework Overview
Lang2Logic consists of two specialized LLMs:
- Optimization-Guided Formalization (OGF) LLM: This component parses natural language queries into structured, formal models. Each model is represented as a five-tuple: problem overview, model type, variables, constraints, and objectives. This abstraction step is inspired by human problem-solving strategies in mathematics and operations research, where formal modeling precedes computation.
- Logic Generation (LG) LLM: Given the structured model, this component generates symbolic workflows or executable programs (primarily in Python), which are then executed to produce the final answer. The use of code as a universal symbolic representation enables modular, interpretable, and verifiable reasoning.
The two components interact in a collaborative, bidirectional manner. If the LG LLM encounters errors or unsatisfactory outputs, it can trigger feedback to the OGF LLM for model refinement, supporting iterative improvement and error traceability.
Training Methodology
Lang2Logic employs a two-stage training strategy:
- Model-Augmented Supervised Fine-Tuning: The OGF LLM is initially fine-tuned on a curated dataset of model-augmented samples. These samples are generated via best-of-N sampling from a strong LLM, filtered by code execution correctness, and ranked for clarity and conciseness. This ensures high-quality alignment with the formalization schema.
- Bi-Level Reinforcement Learning (RL): The framework is further optimized using a bi-level RL algorithm based on Group Relative Policy Optimization (GRPO). The OGF LLM (high-level policy) samples candidate models, while the LG LLM (low-level policy) samples candidate solutions for each model. Rewards are computed based on answer correctness and output format, with normalized advantage functions guiding policy updates. Alternating optimization ensures coordinated improvement across both abstraction and execution layers.
Experimental Results
Lang2Logic is evaluated on a comprehensive suite of nine reasoning benchmarks spanning mathematical, logical, causal, spatial, and temporal domains. The experiments utilize two model scales (qwen2.5-7B-instruct and qwen2.5-1.5B-instruct) and compare against strong baselines, including CoT, Plan-and-Solve, Self-Refine, and PAL.
Key empirical findings:
- Substantial accuracy gains: Lang2Logic achieves average accuracy improvements exceeding 10% over the strongest baselines, with gains as high as 40% on certain tasks.
- Cross-domain generalization: The framework demonstrates robust performance across diverse reasoning types, including causal inference, logical puzzles, and spatio-temporal reasoning.
- Enhanced interpretability and error traceability: The explicit separation of abstraction and execution yields more transparent reasoning chains and facilitates systematic error analysis.
- Scalability to complex tasks: On challenging mathematical reasoning datasets (e.g., GSM-hard), Lang2Logic outperforms PAL by 14.4 percentage points, highlighting its effectiveness in multi-step logical operations.
The following table summarizes selected results (accuracy, %):
Task / Model |
CoT |
Plan-and-Solve |
PAL |
Lang2Logic |
Δ Gain |
GSM-hard (7B) |
59.3 |
58.0 |
71.7 |
82.0 |
+14.4 |
AutoLogi (7B) |
55.8 |
54.1 |
55.4 |
63.0 |
+12.9 |
Test-of-Time (7B) |
61.0 |
41.3 |
75.0 |
83.5 |
+11.3 |
Next Step Prediction |
37.0 |
37.4 |
41.1 |
48.3 |
+17.5 |
Implications and Future Directions
Practical Implications:
- Modular deployment: The bi-level architecture allows for independent scaling and optimization of abstraction and execution components, facilitating deployment in resource-constrained environments.
- Domain adaptation: The formalization schema is sufficiently general to support adaptation to new domains by retraining or fine-tuning the OGF LLM on domain-specific abstractions.
- Error analysis and debugging: The explicit intermediate representations enable systematic tracing of reasoning failures, which is critical for high-stakes applications in scientific discovery, law, and engineering.
Theoretical Implications:
- Cognitive alignment: The framework operationalizes a human-like "modeling and solving" paradigm, moving beyond surface-level pattern recognition toward deeper logical understanding.
- Hierarchical learning: The bi-level RL approach provides a template for hierarchical policy learning in complex, multi-stage reasoning tasks.
Future Developments:
- Integration with external tools: Extending the LG LLM to interface with domain-specific solvers (e.g., SAT, CSP, theorem provers) could further enhance reasoning capabilities.
- Automated feedback loops: Developing more sophisticated mechanisms for automated error detection and model refinement could improve robustness and reduce human intervention.
- Scaling to larger models and datasets: Applying the framework to larger LLMs and more diverse datasets may yield further improvements in generalization and performance.
Conclusion
Lang2Logic represents a significant advance in structured reasoning with LLMs by explicitly decoupling task abstraction from logic generation and optimizing both via a bi-level learning paradigm. The empirical results demonstrate strong improvements in accuracy, interpretability, and adaptability across a wide range of reasoning tasks. The framework provides a foundation for future research on systematic, trustworthy, and scalable reasoning in AI systems.