- The paper presents ENVISIONS, a framework that reduces reliance on human annotation by employing neural-symbolic self-training for LLMs.
- It uses a three-stage process—self-exploration, self-refinement, and self-rewarding—to iteratively enhance symbolic reasoning capabilities.
- The method achieves performance improvements of up to 30%, outperforming traditional fine-tuning and RL-based self-training approaches.
Interactive Evolution: A Neural-Symbolic Self-Training Framework For LLMs
The paper "Interactive Evolution: A Neural-Symbolic Self-Training Framework For LLMs" addresses the challenge of reducing the reliance on human-annotated data for the fine-tuning of LLMs. The authors propose a novel framework named ENVISIONS, which aims to enhance the capabilities of LLMs by employing a neural-symbolic self-training methodology. This approach is designed to manage the scarcity of symbolic data and improve the proficiency of LLMs in processing symbolic language (SL).
Framework and Methodology
ENVISIONS is predicated on an "environment-guided" self-training strategy. The framework iteratively interacts with an embodied environment to gather training data, which alleviates the need for extensive human annotations. The process involves three main stages:
- Self-Exploration: The weak LLM generates multiple candidate symbolic solutions for a given task, which are then tested in the environment. The environment provides binary feedback on the correctness of these solutions.
- Self-Refinement: Using the initial solutions as references, the LLM generates refined symbolic solutions that are likewise tested, and feedback is recorded. This step helps in polishing the solutions to achieve better accuracy.
- Self-Rewarding: A soft reward score is calculated for each solution based on its execution probability, without the need for an external reward model. This score reflects the quality of symbolic solutions, thereby aiding in the reinforcement of effective solutions.
The framework then utilizes these rewarded trajectories, filtered based on a combination of binary and soft rewards, to update a candidate pool. Subsequent iterations refine the policy model by optimizing it with the new data, employing both supervised fine-tuning (SFT) and a specially designed RL-free loss function to learn from mistakes.
Datasets and Baselines
The framework was extensively evaluated across three domains: web agent tasks (MiniWob++), math reasoning (e.g., GSM8K, MATH), and logical reasoning (e.g., ProofWriter, RuleTaker). Baseline comparisons included approaches such as Distill-then-Finetune using teacher models like GPT-4 and Claude-2, as well as RL-based iterative self-training methods.
Results and Analysis
The results underscore ENVISIONS's effectiveness:
- LLaMA2-Chat (7B) and LLaMA2-Chat (13B) models demonstrated significant performance improvements with ENVISIONS, highlighting average gains of approximately 30.00% and 24.95% respectively.
- ENVISIONS outperformed the Distill-then-Finetune approach by 5.66%-7.13% and showcased superior sustainability and training efficiency when compared to RL-based self-training methods.
Detailed analyses reveal that ENVISIONS strikes a balance between exploratory ability and stability, crucial for effective self-training. The careful trajectory filtering, integration of a self-reward mechanism, and utilization of an RL-free loss contribute to maintaining a clear distinction between positive and negative solutions, which facilitates efficient LLM optimization.
Implications and Future Directions
The implications of this research are substantial for both practical applications and theoretical advancements. Practically, the reduced dependency on human-annotated data and the ability to self-improve through interaction with environments make LLMs more scalable and cost-effective. Theoretically, the neural-symbolic integration presents pathways to enhance the reasoning capabilities of LLMs, enabling them to tackle more complex tasks.
Future research could explore the synergy between ENVISIONS and other self-training methods to further optimize performance or extend the framework to other domains, such as visual environments or robotic control.
Overall, the paper provides a robust contribution to the field of AI, presenting a viable and efficient method for evolving LLMs from weak to strong without extensive human-annotated training data. The insights from this research pave the way for further exploration and innovation in self-training methodologies.