Papers
Topics
Authors
Recent
2000 character limit reached

Pairwise Learning for Agent Self-Improvement

Updated 1 January 2026
  • Pairwise learning is a method where agents iteratively refine behavior using structured, ordered comparisons of outputs rather than scalar rewards.
  • It employs hard negative discovery and specialized roles such as self-critics, verifiers, and judges to provide actionable feedback for continuous improvement.
  • Empirical studies show that pairwise-guided systems enhance performance in tasks like text-to-image generation and harmful meme detection with minimal supervision.

Pairwise Learning Guided Agent Self-Improvement is a paradigm in which autonomous agents iteratively refine their behavior or representations using structured pairwise comparisons, often between self-generated outputs or trajectories. Leveraging pairwise preference signals—rather than scalar feedback or undifferentiated negative examples—these approaches enable agents to sharpen decision boundaries, extract nuanced semantic cues, and ultimately achieve self-improvement without relying exclusively on large-scale human supervision or explicit reward engineering.

1. Conceptual Foundations: Pairwise Preference as a Learning Signal

At the core of pairwise learning guided self-improvement is the use of ordered comparisons to drive optimization and discovery. Rather than solely relying on pointwise rewards (e.g., per-sample correctness or absolute scalar scores), agents learn by contrasting pairs of outputs—such as image–prompt pairs, behavioral trajectories, or instances from a pool—such that one element is preferred over another according to an implicitly or explicitly defined utility function.

This pairwise comparative structure leverages the following properties:

  • Hard Negative Discovery: Pairs are often chosen so that the "negative" sample is close to the boundary of success, enhancing the information content and gradient for policy or model refinement.
  • Preference Modeling: Utility is typically implicit, distilled via a judge or comparator (often an LLM or MLLM), which enacts a stochastic or deterministic ordering over pairs via latent representations or direct scoring.
  • Self-Evolving Signal: As the agent improves, the distribution of negatives and positives shifts, yielding an evolving curriculum that continuously pushes the model toward the success frontier (Wan et al., 12 Sep 2025, Jung et al., 27 Nov 2025, Lang et al., 25 Dec 2025).

2. Canonical Architectures and Roles

Pairwise learning-guided frameworks are often instantiated via multi-agent orchestration, with specialized agent roles interacting in tightly coupled loops.

  • Self-Critics: Multiple agents or subnetworks specialize in evaluating distinct facets of a given output, providing not only a pass/fail or scored assessment but also structured, interpretable edit signals in natural language.
  • Verifier: Integrates critic feedback under explicit constraints. This agent ensures that proposed revisions respect foundational criteria such as user intent or fidelity to the source prompt.
  • Judge (Comparator): Executes pairwise comparisons between candidate outputs, typically implementing a preference model via a learned or zero-shot utility function, selecting the most promising candidate for the next iteration.
  • Generator: Responsible for synthesis according to evolving instructions or parameters, often as a black-box component without gradient access (Wan et al., 12 Sep 2025).

In the co-evolving agent paradigm, an explicit "failure agent" may also be instantiated: this agent generates and ranks near-miss or hard negative examples, furnishing informative counterexamples to drive positive agent improvement (Jung et al., 27 Nov 2025). In label-free detection settings, an LMM agent evolves its internal reference set by contrasting explicit positive/negative pairs, progressively refining its detection schema (Lang et al., 25 Dec 2025).

3. Methodological Workflows

Maestro (Multi-Agent Orchestration for T2I Generation)

Maestro utilizes an iterative loop:

  1. Initialization: The user prompt is rewritten and decomposed into visual questions (DVQs).
  2. Generation: The black-box T2I model produces an image from the current prompt.
  3. Self-Critique: Each critic agent scores the image wrt its DVQ; if a score falls below threshold τ (e.g., 0.5), a natural-language edit suggestion is emitted.
  4. Verifier Integration: Edit suggestions are consolidated while enforcing semantic constraints, resulting in candidate prompt rewrites.
  5. Pairwise Judging: The judge agent runs binary tournaments, comparing newly generated image–prompt pairs against the incumbent, updating the current best.
  6. Iteration: The loop continues until the computational or patience budget is exhausted, at which point the top-performing pair is output (Wan et al., 12 Sep 2025).

Co-Evolving Agents from Hard Negatives

A target agent and a failure agent both generate trajectories. Preference pairs are constructed:

  • Expert vs. Agent/Fails: To anchor success.
  • Agent vs. Failure Agent: To directly teach the target to discriminate and improve on near-miss modes.

Both agents are optimized using direct preference optimization (DPO) loss, operating over their respective pairs, with hard negatives prioritized based on closeness to success. The joint optimization drives the target agent to internalize distinctions just beyond the current ability boundary (Jung et al., 27 Nov 2025).

ALARM (Label-Free Meme Detection)

The label-free ALARM approach proceeds as follows:

  • Confidence-Based Identification: High-confidence ("explicit") memes are isolated and pseudo-labeled using an LMM decoder.
  • Contrastive Pair Construction: Explicit memes are paired harm/benign by nearest neighbor in multimodal embedding space.
  • Experience Extraction: Chain-of-thought prompts elicit the agent's interpretation of the semantic differences in each pair.
  • Reference Refinement: An internal reference set is iteratively pruned and weighted using ADD, UPVOTE, DOWNVOTE, and EDIT operations.
  • Inference: The distilled references, encoding key detection principles, guide classification of more subtle, ambiguous input memes (Lang et al., 25 Dec 2025).

4. Mathematical Formulations

Pairwise learning systems formalize the learning signal via pairwise losses, preference margins, and explicit utility comparisons:

  • Pairwise Preference Loss (DPO):

LDPO(θ)=E(u,e+,e)[logσ(δ(u,e+,e;θ))]L_\mathrm{DPO}(\theta) = - \mathbb{E}_{(u,e_+,e_-)}\left[ \log \sigma(\delta(u,e_+,e_-;\theta)) \right]

with

δ(u,e+,e;θ)=β[logπθ(e+u)logπref(e+u)(logπθ(eu)logπref(eu))]\delta(u,e_+,e_-;\theta) = \beta \left[ \log \pi_\theta(e_+|u) - \log \pi_\mathrm{ref}(e_+|u) - (\log \pi_\theta(e_-|u) - \log \pi_\mathrm{ref}(e_-|u)) \right]

where σ\sigma is the logistic sigmoid and β\beta is a scaling hyperparameter (Jung et al., 27 Nov 2025).

  • Self-Evolving Utility in Generation:

P((pi,Ii)(pj,Ij))=σ(f(Ii,pi)f(Ij,pj))P\left( (p_i, I_i) \succ (p_j,I_j) \right) = \sigma\left( f(I_i,p_i) - f(I_j,p_j) \right)

Lpair=i,j1(ij)logσ(f(Ii,pi)f(Ij,pj))L_\mathrm{pair} = - \sum_{i, j} 1_{(i \succ j)} \cdot \log \sigma\left( f(I_i,p_i) - f(I_j,p_j) \right)

where f()f(\cdot) is the learned or implicit utility, and 1(ij)1_{(i \succ j)} is an indicator for the superior pair (Wan et al., 12 Sep 2025).

  • Contrastive Pair Construction in ALARM:

sim(Mi,Mj)=cos(Ψv(Ii),Ψv(Ij))+cos(Ψt(Ti),Ψt(Tj))\mathrm{sim}( \mathcal{M}_i, \mathcal{M}_j ) = \cos( \Psi_v(\mathcal{I}_i), \Psi_v(\mathcal{I}_j) ) + \cos( \Psi_t(\mathcal{T}_i), \Psi_t(\mathcal{T}_j) )

facilitating nearest-neighbor pairing for fine-grained semantic contrast (Lang et al., 25 Dec 2025).

5. Empirical Performance and Applications

Text-to-Image Generation

Maestro achieves distinct improvements on black-box T2I models (Imagen 3) across p2-hard and DSG-1K datasets:

  • DSGScore: 0.92 vs. 0.90 for the best baseline (OPT2I).
  • Pairwise AutoSxS Win-Rate: Maestro outperforms the next best in approximately 70–80% of trials.
  • Human Preference: >60% preference over LM-BBO. Ablation confirms that each role—critics, iterative editing, pairwise comparator, and verifier—yields additive benefits (Wan et al., 12 Sep 2025).

Task-Driven Agents

Co-evolving agents demonstrate robust performance on WebShop, ScienceWorld, and InterCodeSQL:

  • Average Final Reward (Llama-2-7B): 64.1 (Ours) vs. 58.3 (ETO), marking a +5.8% improvement.
  • Hard Negatives: Prioritizing near-success failures yields superior learning signals compared to increased rollout volume, evidenced by marginal gains (+1) in DART-style baselines. Improvements generalize across in-domain and out-of-domain splits, and are stable to regularization scaling (Jung et al., 27 Nov 2025).

Label-Free Harmful Meme Detection

ALARM attains performance surpassing even label-driven approaches on three meme detection datasets. Pairwise-guided reference refinement, built from high-confidence contrastive pairs, equips the LMM agent with an evolving set of detection principles capable of identifying subtle and novel forms without gradient updates or additional annotation (Lang et al., 25 Dec 2025).

6. Extensions and Limitations

  • Strengths: Pairwise paradigms excel by leveraging the most informative negative examples, supporting dynamic competence expansion, and extracting transferable semantic cues. They are applicable across black-box and label-scarce settings and foster interpretable agent feedback mechanisms.
  • Limitations: Reward or utility functions must be sufficiently reliable, as errors may misclassify hard negatives. There is computational overhead from multi-agent rollouts and reference management. Purely synthetic or self-supervised curricula may risk mode collapse or inadequate diversity if not carefully regulated (Jung et al., 27 Nov 2025).
  • Extensions: Potential directions include (i) incorporating additional agent types for curriculum learning or targeted skill acquisition, (ii) leveraging step-level preferences for finer optimization, (iii) integrating occasional weak human feedback for high-precision calibration, and (iv) adapting to continuous-control or multimodal domains where reward signals are sparse or complex (Jung et al., 27 Nov 2025, Lang et al., 25 Dec 2025).

7. Significance and Outlook

Pairwise learning guided agent self-improvement represents a strategic advance in the design of autonomous, continually adapting systems, especially where annotations, explicit rewards, or full supervision are infeasible. By converting contrastive judgments—whether in the form of paired critiques, failure comparisons, or explainable pseudo-labels—into actionable learning signals, these systems enable progressive, robust improvement. Ongoing empirical results underline their impact across diverse generative, reasoning, and content-moderation tasks (Wan et al., 12 Sep 2025, Jung et al., 27 Nov 2025, Lang et al., 25 Dec 2025). A plausible implication is that such methodologies will be central to future agent designs where interpretability, adaptability, and weak-supervision scaling are paramount.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Pairwise Learning Guided Agent Self-Improvement.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube