Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback (2501.03916v3)

Published 7 Jan 2025 in cs.AI, cs.CL, and cs.CV

Abstract: The scientific research paradigm is undergoing a profound transformation owing to the development of AI. Recent works demonstrate that various AI-assisted research methods can largely improve research efficiency by improving data analysis, accelerating computation, and fostering novel idea generation. To further move towards the ultimate goal (i.e., automatic scientific research), in this paper, we introduce Dolphin, a closed-loop LLM-driven framework to enhance the automation level of scientific research. Dolphin first generates novel ideas based on feedback from previous experiments and relevant papers ranked by the topic and task attributes. Then, the generated ideas can be implemented using a code template refined and debugged with the designed exception-traceback-guided local code structure. Finally, Dolphin automatically analyzes the results of each idea and feeds the results back to the next round of idea generation. Experiments are conducted on the benchmark datasets of different topics and a subset of MLE-bench. Results show that Dolphin can continuously improve the performance of the input topic in a loop. We highlight that Dolphin can automatically propose methods that are comparable to the state-of-the-art in some tasks such as 3D point classification.

Summary

The paper introduces Dolphin, a framework automating scientific research through iterative idea generation, experimental verification, and feedback.
It employs literature retrieval, LLM-based filtering, and exception-traceback-guided debugging to enhance research quality and code execution success.
Evaluations on benchmarks like ModelNet40 show Dolphin’s potential to achieve state-of-the-art performance in tasks such as 3D point cloud classification.

Dolphin: Closed-Loop Auto-Research Through Thinking, Practice, and Feedback

The paper introduces Dolphin, a closed-loop, open-ended framework for automatic scientific research, which integrates idea generation, experimental verification, and results feedback (2501.03916). This framework aims to address limitations in existing AI-assisted research methods by automating the entire research process and incorporating experimental outcomes to refine subsequent ideas. Dolphin enhances research efficiency by improving the quality of generated ideas and the success rate of code execution, demonstrating its potential in tasks such as 3D point cloud classification.

Framework Overview

Dolphin's framework operates in a closed loop, mimicking the iterative nature of human scientific research (Figure 1).

Figure 1: Dolphin generates ideas, conducts experiments, and analyzes results in a closed-loop framework.

The framework begins with retrieving relevant papers based on a given research topic. It then filters these papers by topic and task attribute relevance to ensure the generation of high-quality ideas. Following idea generation, Dolphin formulates experimental plans, generates code, and debugs using an error-traceback-guided debugging process. Finally, the experimental results are automatically analyzed and fed back into the idea generation phase to guide subsequent iterations.

Key Components and Methodologies

Idea Generation

Dolphin's idea generation process is designed to emulate human researchers by first conducting literature reviews and then formulating ideas based on the reviewed material. The process involves paper retrieval and ranking, followed by idea generation and filtering. Relevant papers are retrieved using the Semantic Scholar API and ranked based on their relevance to the input topic and the alignment of their task attributes. LLMs score papers, and those scoring below a threshold are discarded. Ideas are then generated by prompting an LLM to produce novel and non-redundant concepts, each including a title, experimental plan, and summary. The generated ideas undergo filtering to eliminate redundant or non-novel concepts. Independence is checked by extracting sentence-level embeddings and comparing cosine similarity with existing ideas stored in a bank, discarding ideas exceeding a similarity threshold. Novelty is assessed by prompting LLMs to determine if the idea is novel based on the searched papers.

Experimental Verification

Experimental verification is crucial for validating AI-generated ideas, which many existing studies lack. Dolphin generates detailed experimental plans and modifies reference codes accordingly. To improve the execution success rate, an exception-traceback-guided debugging process is employed. This process analyzes the local code structure related to the error-traceback to guide debugging. Dolphin extracts information from exception tracebacks, including function names, lines, and code, focusing on custom codes while excluding library function calls. The LLM generates the code structure under the guidance of this information and makes necessary modifications to enable automatic code execution (Figure 2).

Figure 2: Debugging is performed using traceback information to guide modifications to the local code structure.

This debugging process repeats until successful execution or a maximum number of attempts is reached.

Results Feedback

Dolphin analyzes experimental results, categorizing them as improvement, maintenance, or decline compared to reference codes. The system discourages ideas leading to stagnant or declining performance while promoting those that enhance model performance. Embeddings of summaries from ideas that maintain or improve performance are incorporated into an idea bank. Ideas are filtered out if they are similar to previous ineffective ideas, avoiding redundant verification. Performance-enhancing ideas are integrated into the idea generation prompt for subsequent loops.

Experimental Evaluation

The framework's effectiveness was evaluated on benchmark datasets such as ModelNet40, CIFAR-100, and SST-2, covering 2D image classification, 3D point cloud classification, and sentiment classification. Results show that Dolphin generates ideas that improve performance compared to baselines like PointNet, WideResNet, and BERT-base. For instance, Dolphin proposed methods based on PointNet achieved performance comparable to state-of-the-art 3D classification methods. Quality of generated ideas improved through feedback, validating the effectiveness of the closed-loop design.

Case Studies

Case studies highlight Dolphin's ability to generate effective ideas and corresponding code (Figure 3).

Figure 3: The case studies compare ideas and codes generated by Dolphin with those of human researchers.

One example involves a PointNet-based method that achieved an overall accuracy of 93.9% on ModelNet40, outperforming many human-designed methods and achieving comparable performance to state-of-the-art approaches like GPSFormer. Further analysis revealed that Dolphin's approach, PointNet-CSR, achieved better and faster performance than existing methods through a more concise architecture.

Conclusion

Dolphin presents a significant step toward automating scientific research by integrating idea generation, experimental verification, and results feedback in a closed-loop system. The framework's ability to generate effective ideas and improve performance on various tasks highlights its potential for advancing AI-driven scientific discovery. Future developments may involve enhancing code models, integrating multi-disciplinary knowledge, and developing more powerful code models for understanding and modifying project-level code.

PDF Markdown

Follow-up Questions

Related Papers

Authors (10)

Tweets

https://twitter.com/theomitsa/status/1878752475009912994

https://twitter.com/rohanpaul_ai/status/1877453415711662187

https://twitter.com/jbohnslav/status/1877166668088017020