- The paper demonstrates that advanced LRMs, when optimized with MCTS, achieve significant extraction accuracy improvements, with AC F1 gains up to +23%.
- The paper employs an iterative Monte Carlo Tree Search framework to refine prompts, leading to faster convergence and more stable performance compared to general LLMs.
- The paper finds that LRMs, notably DeepSeek-R1, generate concise and effective prompts that reduce key event extraction errors and outperform traditional LLM approaches.
This paper investigates whether advanced Large Reasoning Models (LRMs), like DeepSeek-R1 and OpenAI's o1, still require prompt optimization for complex tasks, specifically focusing on end-to-end Event Extraction (EE). The paper compares LRMs against general-purpose LLMs like GPT-4.5 and GPT-4o, evaluating them both as task models performing EE and as optimizers refining prompts (2504.07357).
Methodology:
The paper employs a Monte Carlo Tree Search (MCTS) framework, similar to PromptAgent [wang2024promptagent], to optimize prompts for EE. The EE task involves identifying event triggers and their arguments according to predefined schemas. Prompts are represented using Python code, consisting of a task instruction and event schema guidelines. The optimization process iteratively refines both the task instruction and the event guidelines using an optimizer model (Mopt) based on errors made by the task model (Mtask) on batches of training data.
- Task Model (Mtask): An LRM or LLM performing zero-shot EE using a given prompt.
- Optimizer Model (Mopt): An LRM or LLM analyzing errors from Mtask and generating feedback to refine the prompt.
- Process:
1. Mtask generates EE outputs for a batch of inputs using the current prompt.
2. Errors in the outputs are identified (e.g., parsing errors, incorrect spans).
3. Mopt analyzes errors and generates structured feedback.
4. Mopt uses the feedback to generate an updated prompt (task instruction + event guidelines).
5. The quality of the new prompt is evaluated on a development set (using average F1 across EE sub-tasks as reward).
- Dataset: Experiments use subsets of the ACE05 dataset (10 event types): ACElow (15 training samples) and ACEmed (120 training samples), plus development and test sets.
- Evaluation: Performance is measured using F1 scores for Trigger Identification (TI), Trigger Classification (TC), Argument Identification (AI), and Argument Classification (AC).
Key Findings:
- LRMs Benefit from Prompt Optimization: Even advanced LRMs show substantial performance gains from prompt optimization on the complex EE task. On ACEmed, DeepSeek-R1 and o1 achieved AC F1 score improvements of approximately +23% after a single MCTS optimization step. These gains were generally larger than those observed for LLMs (GPT-4.5: +20%, GPT-4o: +14%). Optimized LRMs significantly outperformed their non-optimized versions and also tended to outperform optimized LLMs.
- LRM Performance under Full-Scale MCTS: Extending MCTS to depth 5 yielded further, albeit smaller, improvements. LRMs continued to benefit more than LLMs (e.g., DeepSeek-R1 gained an additional +4.26% AC, while LLMs gained ~1-2%). LRMs scaled more consistently, converged faster, and maintained their performance advantage on the test set.
- LRMs as Better Optimizers: When used as Mopt, LRMs (especially DeepSeek-R1) produced more effective prompts than LLMs. This was particularly evident in the low-resource setting (ACElow). The prompts optimized by LRMs often contained more precise extraction rules, heuristics, and exception handling cases, resembling human annotation guidelines. DeepSeek-R1, in particular, generated shorter yet highly effective prompts.
- Efficiency and Stability of LRMs as Optimizers: LRMs, notably DeepSeek-R1, guided task models to peak performance more efficiently (at shallower MCTS depths) and with greater stability (lower variance across different optimization paths) compared to LLM optimizers like GPT-4.5.
Further Analysis:
- Prompt Quality: Survival plot analysis showed that DeepSeek-R1 as an optimizer generated a higher proportion of high-performing prompts compared to other optimizers.
- Prompt Length: DeepSeek-R1 achieved its best performance with significantly shorter prompts (~1750 tokens) compared to o1, GPT-4.5, and GPT-4o, suggesting a preference for concise instructions when acting as the task model.
- Feedback Adherence: DeepSeek-R1 demonstrated a more targeted approach to applying feedback, often refining only the specific event guidelines mentioned in the feedback and sometimes refusing edits if it deemed the original guideline adequate. Other models tended to rewrite larger portions of the guidelines irrespective of specific feedback.
- Error Analysis: Prompts optimized by LRMs helped reduce common EE errors, particularly trigger-related mistakes (identifying multiple or implicit events) and somewhat mitigated argument-level errors (coreferences, span overprediction).
Conclusion:
The paper concludes that prompt optimization remains highly beneficial even for advanced LRMs when tackling complex, structured tasks like event extraction. LRMs not only gain more from optimization than LLMs but also serve as superior prompt optimizers, generating effective, often concise, and robust prompts that lead to faster and more stable convergence. This underscores the continued importance of prompt engineering, even as model reasoning capabilities improve (2504.07357).