Papers
Topics
Authors
Recent
2000 character limit reached

Coordinator-Executor-State Tracker (CES)

Updated 4 December 2025
  • Coordinator-Executor-State Tracker (CES) is a modular framework that decouples high-level strategic planning from low-level perceptual execution in long-horizon GUI tasks.
  • Its three components—the Coordinator, Executor, and State Tracker—interact via clear API boundaries to improve planning accuracy and progress tracking.
  • Empirical evaluations show that CES significantly enhances GUI automation success rates and minimizes state-loss errors through staged reinforcement learning.

The Coordinator-Executor-State Tracker (CES) framework is a staged, modular multi-agent architecture designed to address long-horizon task automation problems, particularly in GUI environments. CES decouples high-level strategic scheduling and state management from low-level perceptual execution, resolving prevalent issues such as capability coupling, responsibility ambiguity, and loss of contextual awareness in single-agent models. CES comprises three principal components: the Coordinator (strategic task-decomposer), the Executor (pixel-level action performer), and the State Tracker (semantic memory compressor), each trained and evaluated with distinct algorithmic methodologies on compositional benchmarks. The resulting system yields robust improvements in planning accuracy, progress tracking, and fine-grained automation reliability, validated across several long-horizon GUI tasks (Deng et al., 27 Nov 2025).

1. Framework Architecture and Component Functions

CES is structured as a multi-agent, staged RL system with explicit inter-agent API boundaries. At each interaction step tt, the Coordinator ingests a triple (q,mt1,st)(q, m^{t-1}, s^t), where qq is the original task spec, mt1m^{t-1} is the memory summary from the previous timestep, and sts^t the current screen image. It emits an atomic instruction ltl^t that expresses a single, contextually-grounded sub-action—for example, “click the ‘Compose’ button” or “rename meeting to ‘Business Review’”.

The Executor, assumed to be either a frozen RL or supervised policy trained for generic GUI manipulation, translates ltl^t and sts^t into a low-level chain-of-thought ut=(tht,at)u^t = (th^t, a^t), consisting of a reasoning trace and associated GUI primitives. Simultaneously, the State Tracker consumes (q,mt1,ut)(q, m^{t-1}, u^t) and produces the next summary mtm^t, an unidirectional language condensation of intermediate progress.

The components interact exclusively via sharply delimited API calls: Coordinator \rightarrow Executor (atomic instruction), Executor \rightarrow State Tracker (chain-of-thought + action), State Tracker \rightarrow Coordinator (memory summary). Semantic tags (> , <answer>) are employed for message disambiguation and parsing.

2. Formal Model and Policy Definitions

CES formalizes GUI automation as a Markov Decision Process (MDP) with augmented multimodal state space and decomposable action space:

  • State: xt=(q,mt1,st)Xx^t = (q, m^{t-1}, s^t) \in \mathcal{X}
  • Action (Coordinator): ltπc(xt)Acl^t \sim \pi_c(\cdot|x^t) \in \mathcal{A}_c
  • Memory Update (State Tracker): mt=πs(q,mt1,ut)m^t = \pi_s(q, m^{t-1}, u^t)

The strategic task decomposition process is learned via a parameterized policy πc(θc)\pi_c(\theta_c), typically instantiated as a multimodal transformer with distinct heads for atomic-instruction generation and justification (<think>/<answer>). The State Tracker is separately realized as a natural-language LLM (Qwen3-4B) fine-tuned for summarization and progress retention.

An abstract policy-evaluation loop is:

1
2
3
4
5
6
for t = 1..T:
    input_state ← (q, m^{t-1}, s^t)
    l^t ← π_c(input_state)
    (u^t, s^{t+1}) ← Executor(l^t, s^t)
    m^t ← π_s(q, m^{t-1}, u^t)
end for

3. Reinforcement-Learning Algorithm and Staged Training

CES employs staged reinforcement learning using Group Relative Policy Optimization (GRPO), a variant of PPO adapted for multi-action candidate selection. The Coordinator is first trained while the State Tracker supplies ground-truth summaries, optimizing the clipped surrogate objective:

J(θc)=1Ni=1Nmin(ρi(θc)A^i,clip(ρi(θc),1ϵ,1+ϵ)A^i)βDKL(πθcπref)J(\theta_c) = \frac{1}{N} \sum_{i=1}^N \min(\rho_i(\theta_c)\widehat{A}_i, \mathrm{clip}(\rho_i(\theta_c),1-\epsilon,1+\epsilon)\widehat{A}_i) - \beta D_{KL}(\pi_{\theta_c}||\pi_{ref})

where A^i\widehat{A}_i is the advantage, ρi\rho_i is the importance ratio, ϵ\epsilon is the PPO clipping, and DKLD_{KL} is the KL-penalty.

Rewards for each candidate instruction are a weighted sum:

rit=α1Rformat(lit)+α2Rexecutor(ait)r_i^t = \alpha_1 R_{\text{format}}(l_i^t) + \alpha_2 R_{\text{executor}}(a_i^t)

Rexecutor=γ1Rtype(ait)+γ2Rparam(ait)R_{\text{executor}} = \gamma_1 R_{\text{type}}(a_i^t) + \gamma_2 R_{\text{param}}(a_i^t)

These quantify both correct interface syntax and executor action fidelity (type, argument bounding box, F1 text similarity).

Once the Coordinator is trained, its parameters are frozen. The State Tracker is then trained with the Coordinator and Executor fixed, so that its output summaries maximize downstream instruction reward under the fixed top-level policy.

This staged training avoids responsibility confusion and ensures that both planning and context compression are tuned for actual execution utility, rather than internal metrics.

4. State Tracker: Context Compression and Progress Management

The State Tracker module maintains mtm^t, a natural language summary engineered to encapsulate GUI progress and anticipated next steps. Internally, this is realized as a unidirectional LLM trained on paired (prior summary, executor output) → next summary exemplars, with a preference for memory-efficient, semantically rich compressions.

The summary mtm^t enables the Coordinator to focus on strategic decomposition rather than perceptual recapitulation; the model has access to explicit progress markers (“calendar opened,” “meeting named,” “invite sent”) rather than having to reconstruct history from pixel-level data.

Empirically, absence of an effective State Tracker leads to “State-Loss” failures—where the Coordinator issues substeps redundant with prior ones, or omits critical progress details.

5. Empirical Evaluation and Performance Metrics

CES has been tested on three established long-horizon GUI task benchmarks: AITZ (mean 7.5 steps), AMEX (mean 12.8 steps), and GUI-Odyssey (mean 15.3 steps). Three metrics are evaluated: Type accuracy (correct sub-task classification), grounding rate (correct low-level actuation), and overall success rate (SR).

Plugging the RL-trained Coordinator and State Tracker into a frozen GUI-R1 executor yields a 10–30 point SR improvement versus the executor alone. When baseline agents are replaced by generic LLM-prompted modules (GPT-5), gains are only ∼4 points, whereas RL specialization yields ∼20 points. Ablation studies show that the removal of the Coordinator causes a ∼12-point SR drop; removal of the State Tracker, ∼11 points; and skipping RL fine-tuning, ∼6 points.

Error analysis reveals that “State-Loss” errors are reduced from 14% to 2%, and planning errors are halved; perception errors attributable to executor freezing remain static.

Model scaling is also observed to be critical: Coordinator must be ≥7B parameters for fine-grained decomposition; smaller models (∼3B) underperform.

6. Analysis: Architectural Insights and Limitations

CES underscores the value of architecturally decoupling high-level and low-level execution agents. By separating strategic intent (Coordinator) from grounding and perceptual actions (Executor), and deploying a dedicated State Tracker, the system resolves the “responsibility coupling” and “capability conflict” bottlenecks of monolithic, end-to-end RL approaches. Empirically, staged RL and modular feedback ensure that learnable modules optimize toward verifiable executor rewards, not ungrounded surrogates.

A primary limitation observed is residual ambiguity in Coordinator instructions under ambiguous perception features (e.g., bubble icon misidentification), and occasional over-compression by the State Tracker (missing "enable security"). Prospective work includes bidirectional clarification requests and joint Coordinator/State Tracker training once executor reliability is assured.

7. Practical Impact and Generalizability

CES is demonstrated to be compositional and plug-and-play: the Coordinator and State Tracker can operate with any executor as long as API interfaces are matched. The modules are generalizable across task domains and executor architectures, with RL fine-tuning yielding robust gains in long-horizon planning and progress maintenance (Deng et al., 27 Nov 2025). This modular decoupling and staged training paradigm is applicable to other sequential decision-making environments with hierarchical action decomposition and state tracking requirements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Coordinator-Executor-State Tracker (CES).