Papers
Topics
Authors
Recent
Search
2000 character limit reached

Do We Always Need Query-Level Workflows? Rethinking Agentic Workflow Generation for Multi-Agent Systems

Published 16 Jan 2026 in cs.AI | (2601.11147v1)

Abstract: Multi-Agent Systems (MAS) built on LLMs typically solve complex tasks by coordinating multiple agents through workflows. Existing approaches generates workflows either at task level or query level, but their relative costs and benefits remain unclear. After rethinking and empirical analyses, we show that query-level workflow generation is not always necessary, since a small set of top-K best task-level workflows together already covers equivalent or even more queries. We further find that exhaustive execution-based task-level evaluation is both extremely token-costly and frequently unreliable. Inspired by the idea of self-evolution and generative reward modeling, we propose a low-cost task-level generation framework \textbf{SCALE}, which means \underline{\textbf{S}}elf prediction of the optimizer with few shot \underline{\textbf{CAL}}ibration for \underline{\textbf{E}}valuation instead of full validation execution. Extensive experiments demonstrate that \textbf{SCALE} maintains competitive performance, with an average degradation of just 0.61\% compared to existing approach across multiple datasets, while cutting overall token usage by up to 83\%.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper looks at how to make teams of AI “agents” work together more efficiently to solve problems like math, coding, and reading comprehension. These teams follow a “workflow,” which is basically a plan that tells each agent what to do and when to talk to others. The authors ask two simple questions: Do we really need a brand-new plan for every single question? And do we need to spend a lot of time and computer power testing every plan in full? They show that the answer to both is often “no,” and they introduce a cheaper method called SCALE to pick good plans without heavy testing.

What questions does the paper try to answer?

  • Do we always need a custom workflow for each question (query-level), or can a small set of general workflows for the whole task (task-level) do just as well?
  • Is it necessary to fully run and test every candidate workflow on a large validation set, which is very costly, to find the best one?

How did they study it? (Methods in simple terms)

Think of multi-agent AI as a team sport:

  • Agents are teammates with different roles (like a problem solver, a checker, a summarizer).
  • A workflow is the playbook that tells teammates how to pass the ball and in what order to act.

There are two styles of playbooks:

  • Task-level: one good playbook that works for most games in a season.
  • Query-level: a custom playbook designed fresh for every single game.

The authors do two things:

  1. Rethink the need for custom playbooks for every game. They compare:
    • The best single task-level workflow (“Top-1”).
    • A small set of top task-level workflows (“Top-5”).
    • Running the same best workflow several times (to see how randomness helps).
    • A true per-question workflow generator (query-level).
  2. Rethink expensive testing. Fully testing each candidate workflow on many validation questions uses a lot of “tokens” (the chunks of text an AI reads/writes, which cost time and money). They show this full testing is both very expensive and not very helpful once you’re already near top performance.

Then they propose SCALE, which is like having the coach predict how well a new playbook will do before actually playing all the games, and then checking that prediction with a tiny number of real plays:

  • Warm-up: briefly do the usual full tests for a few rounds to collect some ground truth.
  • Self-prediction: ask the same AI that edits/generates workflows to predict a workflow’s score without running it fully.
  • Few-shot calibration: run the workflow on a small sample (about 1–3% of the validation set) to adjust the prediction.
  • Use this calibrated score instead of full testing to guide which workflows to keep improving.

Analogy: It’s like test-driving a car on a short route, not every road in the city, and combining that with an expert’s estimate to judge the car’s overall quality.

What did they find, and why does it matter?

Here are the main takeaways across several benchmarks (math, coding, and reading tasks):

  • A few general workflows go a long way. A single top task-level workflow already works well for many questions. Keeping a small pool of the top 5 workflows covers even more questions—often as many as making a brand-new workflow for each question.
  • Re-running the same strong workflow multiple times covers nearly as many questions as query-level methods, showing that some of the gains come from randomness in the AI’s answers, not necessarily from designing a unique plan each time.
  • Full testing is very costly and often doesn’t help much. As they generate more candidate workflows, the cost skyrockets while performance barely improves. Even the top 5 workflows look very similar in quality, so heavy testing doesn’t clearly separate winners from almost-winners.
  • SCALE keeps performance high while slashing cost. Compared to a popular task-level method (Aflow), SCALE cuts token usage by up to 83% while losing only about 0.61% in accuracy on average. That’s a big savings for a tiny drop in performance.
  • The calibrated predictions are reliable. The combined self-prediction plus small-sample calibration tracks the true full-test scores closely and ranks workflows in a similar order, which is what you need to search effectively.

Why it matters: If you can solve problems just as well with far fewer tokens, you save money, speed up systems, and make multi-agent AI more practical in real-world settings.

What’s the bigger impact?

This work suggests a smarter, cheaper way to build and improve multi-agent AI:

  • You don’t always need a custom plan for each question. Reusing a small set of strong, general plans can be enough for many tasks.
  • You can avoid expensive full testing by letting the AI estimate its own workflow quality and verifying that estimate with a small sample. This keeps quality high while cutting costs dramatically.
  • In practice, this means faster development cycles, lower compute bills, and more sustainable AI systems that still perform well.

Looking ahead, combining the strengths of both worlds—general task-level plans and selective per-question tweaks—could push performance even further without bringing back the high costs.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.