Dynamic Evaluation of LLMs via Adaptive Reasoning Graph
The static evaluation paradigm for LLMs is rapidly becoming inadequate due to inherent limitations such as data contamination and lack of alignment with the evolving capabilities of LLMs. In this context, the paper presents DARG (Dynamic Evaluation of LLMs via Adaptive Reasoning Graph), a novel framework designed to overcome these limitations by generating test data with controlled complexity and diversity.
Methodology
The DARG framework introduces several key innovations:
- Reasoning Graph Extraction: The framework initiates by extracting reasoning graphs from existing benchmarks. Nodes in these graphs represent basic reasoning units, while edges signify the relationships or operations between these units. This step is facilitated through in-context learning (ICL) capabilities of LLMs to ensure accurate graph representations.
- Graph Perturbation: The reasoning graphs are then perturbed to introduce varying degrees of complexity. This perturbation can modify numerical values, and structural aspects such as the width and depth of the graph, creating new data points that maintain linguistic diversity similar to the original benchmarks.
- Graph-to-Text Decoding: The perturbed graphs are translated back into natural language using an LLM. This process utilizes exemplars to maintain the linguistic style and coherence of the original data. However, to mitigate potential hallucinations by the LLMs, this generated text undergoes strict label verification using a code-augmented LLM agent.
- Evaluation and Analysis: The framework was applied on four distinct reasoning tasks - math (GSM8K), social (BBQ), spatial (BBH Navigate), and symbolic reasoning (BBH Dyck Language) - using 15 state-of-the-art LLMs. The performance was evaluated across various complexity dimensions introduced by DARG, highlighting the LLMs' susceptibility to increased complexity.
Key Findings
The application of DARG yielded significant insights:
- Performance Degradation: As expected, almost all evaluated LLMs showed a decrease in performance with the increasing complexity of the generated test data. For instance, GPT-4 Turbo, although performing impressively on static benchmarks such as GSM8K, exhibited a substantial performance drop in more complex scenarios, underscoring the limitations of static benchmarking in capturing the true reasoning capabilities of LLMs.
- Bias Amplification: In the context of social reasoning tasks, like those found in BBQ, higher complexity data generated by DARG revealed an increase in biases, particularly against protected groups. LLMs like GPT-4 Turbo and Gemini-1.5-Pro exhibited heightened sensitivity and bias, choosing the "Cannot be determined" option even when evidence was clear, suggesting an over-alignment to ethical guidelines at the cost of accuracy.
- Model Size and Resilience: Larger models and those employing Mixture of Experts (MoE) architectures demonstrated better resilience to complexity increases. This was evident from models like Mixtral-8×22B outperforming vanilla models of similar size, suggesting that enhanced model architectures and scaling could be beneficial in tackling complex reasoning tasks.
- Training Data Utility: The paper also explored the utility of training models on DARG-generated data, showing that models fine-tuned on this data set outperformed those fine-tuned on traditional data in handling increased complexity. This highlights the potential of DARG not just for evaluation but also for LLM enhancement.
Implications and Future Directions
The DARG framework provides a more nuanced and reliable assessment tool for LLM capabilities, offering several practical and theoretical implications:
- Dynamic Benchmarking: Transitioning to dynamic evaluation methods like DARG can offer a more accurate measure of an LLM's reasoning abilities, ensuring that benchmarks evolve in tandem with model capabilities.
- Bias and Fairness: The controlled perturbations in DARG can uncover latent biases in LLMs, providing researchers with valuable insights to inform the development of fairer and more ethical AI systems.
- Model Improvement: The utility of DARG-generated data for training indicates a promising direction for developing more robust model architectures capable of handling diverse and complex reasoning tasks.
Future research could extend the DARG framework to other domains beyond reasoning tasks, exploring its potential in natural language understanding and generation tasks. Additionally, further refinement of graph extraction and perturbation methods, possibly incorporating open-source models, could enhance the versatility and accessibility of DARG.
Conclusion
The DARG framework represents a significant advance in the dynamic evaluation of LLMs, addressing critical limitations of static benchmarks. By providing a means to generate controlled, diverse, and complex evaluation data, DARG offers a more accurate and comprehensive measure of LLM capabilities, thereby paving the way for further advancements in AI research and development.