Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction (2504.07357v1)

Published 10 Apr 2025 in cs.CL

Abstract: Large Reasoning Models (LRMs) such as DeepSeek-R1 and OpenAI o1 have demonstrated remarkable capabilities in various reasoning tasks. Their strong capability to generate and reason over intermediate thoughts has also led to arguments that they may no longer require extensive prompt engineering or optimization to interpret human instructions and produce accurate outputs. In this work, we aim to systematically study this open question, using the structured task of event extraction for a case study. We experimented with two LRMs (DeepSeek-R1 and o1) and two general-purpose LLMs (GPT-4o and GPT-4.5), when they were used as task models or prompt optimizers. Our results show that on tasks as complicated as event extraction, LRMs as task models still benefit from prompt optimization, and that using LRMs as prompt optimizers yields more effective prompts. Finally, we provide an error analysis of common errors made by LRMs and highlight the stability and consistency of LRMs in refining task instructions and event guidelines.

Summary

  • The paper demonstrates that advanced LRMs, when optimized with MCTS, achieve significant extraction accuracy improvements, with AC F1 gains up to +23%.
  • The paper employs an iterative Monte Carlo Tree Search framework to refine prompts, leading to faster convergence and more stable performance compared to general LLMs.
  • The paper finds that LRMs, notably DeepSeek-R1, generate concise and effective prompts that reduce key event extraction errors and outperform traditional LLM approaches.

This paper investigates whether advanced Large Reasoning Models (LRMs), like DeepSeek-R1 and OpenAI's o1, still require prompt optimization for complex tasks, specifically focusing on end-to-end Event Extraction (EE). The paper compares LRMs against general-purpose LLMs like GPT-4.5 and GPT-4o, evaluating them both as task models performing EE and as optimizers refining prompts (2504.07357).

Methodology:

The paper employs a Monte Carlo Tree Search (MCTS) framework, similar to PromptAgent [wang2024promptagent], to optimize prompts for EE. The EE task involves identifying event triggers and their arguments according to predefined schemas. Prompts are represented using Python code, consisting of a task instruction and event schema guidelines. The optimization process iteratively refines both the task instruction and the event guidelines using an optimizer model (Mopt\mathcal{M}_{opt}) based on errors made by the task model (Mtask\mathcal{M}_{task}) on batches of training data.

  • Task Model (Mtask\mathcal{M}_{task}): An LRM or LLM performing zero-shot EE using a given prompt.
  • Optimizer Model (Mopt\mathcal{M}_{opt}): An LRM or LLM analyzing errors from Mtask\mathcal{M}_{task} and generating feedback to refine the prompt.
  • Process:

1. Mtask\mathcal{M}_{task} generates EE outputs for a batch of inputs using the current prompt. 2. Errors in the outputs are identified (e.g., parsing errors, incorrect spans). 3. Mopt\mathcal{M}_{opt} analyzes errors and generates structured feedback. 4. Mopt\mathcal{M}_{opt} uses the feedback to generate an updated prompt (task instruction + event guidelines). 5. The quality of the new prompt is evaluated on a development set (using average F1 across EE sub-tasks as reward).

  • Dataset: Experiments use subsets of the ACE05 dataset (10 event types): ACElow_{\text{low}} (15 training samples) and ACEmed_{\text{med}} (120 training samples), plus development and test sets.
  • Evaluation: Performance is measured using F1 scores for Trigger Identification (TI), Trigger Classification (TC), Argument Identification (AI), and Argument Classification (AC).

Key Findings:

  1. LRMs Benefit from Prompt Optimization: Even advanced LRMs show substantial performance gains from prompt optimization on the complex EE task. On ACEmed_{\text{med}}, DeepSeek-R1 and o1 achieved AC F1 score improvements of approximately +23% after a single MCTS optimization step. These gains were generally larger than those observed for LLMs (GPT-4.5: +20%, GPT-4o: +14%). Optimized LRMs significantly outperformed their non-optimized versions and also tended to outperform optimized LLMs.
  2. LRM Performance under Full-Scale MCTS: Extending MCTS to depth 5 yielded further, albeit smaller, improvements. LRMs continued to benefit more than LLMs (e.g., DeepSeek-R1 gained an additional +4.26% AC, while LLMs gained ~1-2%). LRMs scaled more consistently, converged faster, and maintained their performance advantage on the test set.
  3. LRMs as Better Optimizers: When used as Mopt\mathcal{M}_{opt}, LRMs (especially DeepSeek-R1) produced more effective prompts than LLMs. This was particularly evident in the low-resource setting (ACElow_{\text{low}}). The prompts optimized by LRMs often contained more precise extraction rules, heuristics, and exception handling cases, resembling human annotation guidelines. DeepSeek-R1, in particular, generated shorter yet highly effective prompts.
  4. Efficiency and Stability of LRMs as Optimizers: LRMs, notably DeepSeek-R1, guided task models to peak performance more efficiently (at shallower MCTS depths) and with greater stability (lower variance across different optimization paths) compared to LLM optimizers like GPT-4.5.

Further Analysis:

  • Prompt Quality: Survival plot analysis showed that DeepSeek-R1 as an optimizer generated a higher proportion of high-performing prompts compared to other optimizers.
  • Prompt Length: DeepSeek-R1 achieved its best performance with significantly shorter prompts (~1750 tokens) compared to o1, GPT-4.5, and GPT-4o, suggesting a preference for concise instructions when acting as the task model.
  • Feedback Adherence: DeepSeek-R1 demonstrated a more targeted approach to applying feedback, often refining only the specific event guidelines mentioned in the feedback and sometimes refusing edits if it deemed the original guideline adequate. Other models tended to rewrite larger portions of the guidelines irrespective of specific feedback.
  • Error Analysis: Prompts optimized by LRMs helped reduce common EE errors, particularly trigger-related mistakes (identifying multiple or implicit events) and somewhat mitigated argument-level errors (coreferences, span overprediction).

Conclusion:

The paper concludes that prompt optimization remains highly beneficial even for advanced LRMs when tackling complex, structured tasks like event extraction. LRMs not only gain more from optimization than LLMs but also serve as superior prompt optimizers, generating effective, often concise, and robust prompts that lead to faster and more stable convergence. This underscores the continued importance of prompt engineering, even as model reasoning capabilities improve (2504.07357).

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.