ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization (2502.04306v1)

Published 6 Feb 2025 in cs.CL

Abstract: Recent research has leveraged LLM multi-agent systems for complex problem-solving while trying to reduce the manual effort required to build them, driving the development of automated agent workflow optimization methods. However, existing methods remain inflexible due to representational limitations, a lack of adaptability, and poor scalability when relying on discrete optimization techniques. We address these challenges with ScoreFlow, a simple yet high-performance framework that leverages efficient gradient-based optimization in a continuous space. ScoreFlow incorporates Score-DPO, a novel variant of the direct preference optimization method that accounts for quantitative feedback. Across six benchmarks spanning question answering, coding, and mathematical reasoning, ScoreFlow achieves an 8.2% improvement over existing baselines. Moreover, it empowers smaller models to outperform larger ones with lower inference costs. Project: https://github.com/Gen-Verse/ScoreFlow

Summary

The paper introduces ScoreFlow, a novel framework that automates and optimizes LLM agent workflows using gradient-based methods.
It presents Score-DPO, an enhanced preference optimization technique that integrates evaluation scores for improved training efficiency.
Empirical results demonstrate an 8.2% performance improvement on benchmarks, with smaller models achieving lower inference costs.

The paper introduces ScoreFlow, a framework for automated optimization of LLM agent workflows, addressing limitations in flexibility, adaptability, and scalability found in existing methods. ScoreFlow leverages gradient-based optimization in a continuous space and introduces Score-DPO, a variant of direct preference optimization, which incorporates quantitative feedback.

The authors highlight the following contributions:

The ScoreFlow framework for automated agentic workflow generation and optimization.
Score-DPO, an optimization method integrating quantitative evaluation feedback into the preference optimization process.
Evaluations on question answering, coding, and mathematical reasoning benchmarks, achieving an 8.2\% improvement over baselines, with smaller models outperforming larger ones at a lower inference cost.

The paper discusses automated optimizations for prompt and hyperparameter tuning, and automated optimizations for workflow structure. Prior methods are limited by inflexibility and workflow representation. AFlow employs code as a representation for workflow, using a Monte Carlo Tree Search for optimization, but faces limitations with rapid convergence and discrete optimization, leading to suboptimal outcomes.

The paper then discusses learning from preferences for LLMs, covering Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO).

PPO

PPO refines a policy model $\pi_{\theta}$ by maximizing the reward assigned to its generated responses, while maintaining a soft KL divergence constraint to prevent degeneration. The objective is expressed as:

$\mathbb{E}_{x \sim D_{\pi}, y \sim \pi_{\theta}(y \mid x)} \big[R_{\phi}(x, y)\big] - \beta \mathbb{D}_{KL} (\pi_{\theta} || \pi_{ref})$

$R_{\phi}(x, y)$ : Reward model
$\pi_{\theta}$ : Policy model
$\pi_{ref}$ : Reference policy
$\beta$ : Hyperparameter controlling the KL penalty
$\mathbb{D}_{KL}$ : KL divergence

DPO

Direct Preference Optimization (DPO) facilitates direct policy optimization using preference data, eliminating the need for explicit reward models or active policy sampling. The DPO loss is expressed as:

$- \mathbb{E}_{(x, y_w, y_l) \sim D_{R} \big[\log \sigma \big(r(x, y_w) - r(x, y_l)\big)\big]$,

where $r(x, y) := \beta \log \big(\pi_{\theta^{\star}(y \mid x) / \pi_{ref}(y \mid x)\big)$. The paper proposes Score-DPO, which integrates evaluation scores into the training process to enhance performance.

ScoreFlow

The goal is to determine the optimal workflow $G(q)$ for a given task $q$ , where $G$ is the workflow generator. A workflow function $W_f$ maps the integration of task $q$ and agent set $V$ , $(q, V)$ , to executed results $W_f(q, V)$ . The agent set $V$ consists of agents characterized by system prompts and temperature settings. The workflow is defined as the combination of an agent set and a workflow function: $(V, W_f)$ . The optimization objective is to identify the optimal workflow generator:

$G^{\star} = \argmax_{G: \operatorname{Im}(G) \subset \mathcal{W} \mathbb{E}_{q \in D} \big[ S(q, G(q)) \big]$,

where $D$ represents the dataset of tasks, and $S$ is a third-party evaluator.

The paper uses code as a representation of the workflow function $W_f$ . The agents in $\mathcal{V}$ are operators, with system prompts customizable by the generator $G$ . The input to the generator consists of the combination of the task $q$ and guidance on generation, including format requirements and introductions to available operators, all formatted as a guidance prompt.

ScoreFlow Overview

ScoreFlow collects preference data, generating multiple workflows using the generator $G$ , evaluating execution results to obtain evaluation scores, and deriving preference pairs based on these scores. To optimize the generator $G$ using this preference dataset, Score-DPO is used to fine-tune the generator on preference dataset, and the updated generator is employed in the subsequent iteration.

Quantitative Labeling of Preference Workflows

For each task $q$ , $k$ workflows $g_i(q)$ are generated, with evaluation scores $s_i \in [0,1]$ . Preference pairs are constructed for task problem $q$ in the form $D_q = \{(q, g_i(q)), (q, g_j(q)) ) \mid s_i > s_j\}$ , aggregated to form the complete preference dataset, $D_{pre} = \bigcup_{q \in D} D_q$ .

Optimization via Score-DPO

Score-DPO is proposed to address the slow convergence and suboptimal performance when directly using DPO to finetune the generator on collected preference data. Score-DPO incorporates evaluation scores into the ranking objective.

The score-based Bradley-Terry (BT) ranking objective is defined as $\sigma (r_w^{\star} - r_l^{\star})$ , where $r_w^{\star} := f(s_w) r_w$ , $r_l^{\star} := (1 - f(s_l)) r_l$ , and $f(x): [0, 1] \rightarrow [0, 1]$ is a strictly monotonically increasing function. The loss function of Score-DPO is:

$\mathcal{L}_{\text{Score-DPO} = - \mathbb{E}_{(w, l) \sim P^{\star} [ \log \sigma (r^{\star}_{w} - r^{\star}_{l}) ]$.

Analysis of Score-DPO

The per-sample influence is defined as:

$I(z) = \frac{\partial}{\partial r_z} \mathbb{E}_{(w, l) \sim P^{\star} \left[ \log \sigma \big(r^{\star}_{w} - r^{\star}_{l}\big) \cdot \mathds{1}_{z \in \{w, l\} \right]$.

The paper includes the following theorem:

Let function $d(x, y): [0, 1]^2 \rightarrow [0, 1]$ be strictly monotonically increasing with respect to $x - y$ , and function $f(x): [0, 1] \rightarrow [0, 1]$ be strictly monotonically increasing in $x$ . The per-sample influence for a sample $z$ is given by:

$I(z) = \mathbb{E}_{(w, l) \sim P} [ d(s_w, s_l) \sigma (r^{\star}_{l} - r^{\star}_{w}) \big( f(s_w) \mathds{1}_{w = z} - (1 - f(s_l)) \mathds{1}_{l = z} \big) ]$,

which is strictly monotonically increasing with the score $s_z$ when $-(1 - f(s_z))^{-1}\le r_z \le f^{-1}(s_z)$ holds.

Experiments

The paper uses six public datasets, covering math problems, question-answering problems, and coding problems. The full datasets were used for HumanEval and MBPP. For GSM8K, the 1,319 data points in the test set were used. For the MATH dataset, problems with a difficulty level of 5 were selected from Combinatorics and Probability, Number Theory, Pre-algebra, and Pre-calculus. For DROP and HotpotQA, 1,000 samples were randomly selected from each dataset. The data was split into validation and test sets using a 1:4 ratio.

The manually designed static workflow baselines include: direct LLM invocation, Chain of Thought (CoT), Self-Consistency CoT (CoT SC), MedPrompt, MultiPersona Debate (MPD), and Self-Refine (SR). The automated workflow optimization methods ADAS and Aflow were also compared.

Llama-3.1-8B-Instruct was used as the base model for the generator, and GPT-4o-mini as the executor. All experiments used 2 A6000 GPUs using LoRA (Low-Rank Adaptation).

The solve rates (evaluated 3 times and averaged) are reported in the final results. GPT-4o-mini was used as the judge model for MATH, DROP, and HotpotQA to avoid format inconsistency issues. In each iteration of the optimization process (total 3 iterations), $k = 8$ workflows were generated for each problem and their evaluation scores were obtained, where the judge model was not used to reduce cost and computational overhead. The F1 score was used as the evaluation metric for DROP and HotpotQA, and solve rates for the remaining datasets (evaluated 3 times and averaged). To apply Score-DPO, $f(x) = x$ and $d(x, y) = (x - y)^3$ were used as the default choices.

Main Results

ScoreFlow consistently outperforms all manually designed workflow methods as well as automated workflow optimization methods included in the baselines across all benchmarks, achieving an average solve rate of 85.3\%, surpassing the baseline methods by a margin of 8.2\%. The two automated workflow optimization methods consistently underperform compared to ScoreFlow.

The authors demonstrated the utility of Score-DPO, by comparing against supervised finetuning (SFT), proximal policy optimization (PPO), and direct preference optimization (DPO). Score-DPO incorporates specific evaluation ranking information into DPO.

The loss-gradient optimization method, combined with an adaptive framework, enhances scalability by maintaining high performance when applied to more diverse and larger problem datasets. AFlow employs a standard discrete optimization method to optimize a single workflow, where a few failed cases are randomly selected and fed into the optimizer LLM to refine the workflow in each iteration.

In ScoreFlow, the task information is provided to the generator to facilitate the creation of adaptive workflows. Specifically, the generator has the flexibility to select appropriate operators and adapt the complexity of the workflow structure based on the characteristics of the given problem.

Ablation studies were conducted on both the generator and the executor. For the generator ablation, Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct were used. For the executor ablation, GPT-4o-mini, GPT-4o, DeepSeek-V3, and DeepSeek-coder were employed.

API costs were analyzed during the inference stage for different methods, across 4 different versions of executors, focusing on the HumanEval task. ScoreFlow enables weaker models to achieve better cost-effectiveness than stronger models, balancing performance and resource usage optimally.

The consistent increase in test solve rate, followed by its eventual convergence, demonstrates the effectiveness of the iterative approach.

In conclusion, ScoreFlow is an automated, high-performance, and adaptive framework for optimizing multi-agent workflows. The framework leverages Score-DPO to achieve robust and efficient optimization. By replacing traditional discrete optimization algorithms with loss-gradient-based optimization, flexibility and scalability are enhanced. Score-DPO reduces inaccuracies and variances in collected data pairs, thereby improving overall performance by incorporating evaluation scores directly into the optimization process.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - Gen-Verse/ScoreFlow: Official implementation for "ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization" (10 stars)

Tweets

https://twitter.com/rohanpaul_ai/status/1891626303264018834

https://twitter.com/arXivGPT/status/1888288289418879158

https://twitter.com/arXivGPT/status/1888650419347345836

https://twitter.com/arXivGPT/status/1889013017439854915

https://twitter.com/TheTuringPost/status/1889101521025151316