Multimodal AEPO (ARES) Frameworks
- Multimodal AEPO (ARES) is a framework that integrates entropy-based exploration and adaptive inference to enhance reasoning, robustness, and alignment in models processing text, images, and audio.
- It employs token-level entropy, sliding-window metrics, and dynamic reward shaping to trigger adaptive exploration and optimize inference processes across diverse benchmarks.
- Empirical evaluations reveal significant gains in sample efficiency, semantic alignment, and accuracy in tasks such as GUI grounding, multimodal reasoning, and open-set recognition.
Multimodal Adaptive Exploration Policy Optimization (AEPO) and ARES are frameworks designed to advance the reasoning, robustness, and alignment capabilities of Multimodal LLMs and related agents. These approaches introduce mathematically principled, entropy- or efficiency-based optimization schemes to enable dynamic, context-sensitive exploration and adaptive inference across tasks involving two or more modalities, such as image, text, and audio. The paradigms have demonstrated improved performance, sample efficiency, and generalization in diverse benchmarks ranging from GUI grounding and multimodal reasoning to robust test-time adaptation and open-set recognition.
1. Key Mathematical Foundations
Central to multimodal AEPO and ARES variants is the explicit quantification and optimization of exploration, confidence, and reasoning efficiency:
- Token-Level Entropy: For autoregressive models, the policy’s distribution at generation step is . The per-token Shannon entropy is
- Sliding-Window Entropy and High-Window-Entropy (HWE) Tokens: Window-averaged entropy over a window ending at :
HWE tokens are those where , with a (trajectory-wise) high-percentile threshold.
- Exploration Efficiency: In GUI grounding, AEPO uses an efficiency metric , where is the success utility (+1 if any candidate hits the target, –1 otherwise), and 0 is the geometric mean of proposal and verification costs.
- Entropy-Aware Adaptation: For open-set adaptation, per-sample entropy 1 is used to adaptively maximize or minimize uncertainty via weighted loss terms, amplifying entropy gaps between known and unknown-class samples.
2. Core Algorithms and Training Pipelines
Multiple AEPO and ARES instantiations employ a staged (often two-phase) optimization process:
AEPO for GUI Grounding (Liu et al., 7 Aug 2025)
- Multi-Answer Generation: At each RL step, the model produces 2 candidate outputs in a single rollout, increasing the effective exploration space.
- Adaptive Exploration Reward (AER):
3
where 4 is the index of the first correct candidate.
- Objective: Policy gradients are computed with respect to the log-probability of the full candidate set. A collinear-point penalty discourages degenerate exploration.
ARES for Adaptive Multimodal Reasoning (Chen et al., 9 Oct 2025)
- Stage 1: Adaptive Cold-Start (AdaCS): Model learns a mapping from problem difficulty (proxied by pass rates) to reasoning trace length via supervised fine-tuning on data curated for proportional reasoning length.
- Stage 2: Adaptive-Entropy Policy Optimization (AEPO):
- During rollout, high window-entropy tokens trigger exploration branches (“when to explore”).
- Hierarchical reward shaping with dynamic KL-regularized loss allocates more exploration (HWE tokens) for harder problems, and penalizes overthinking for easy problems.
5
- 6 is dynamically relaxed in HWE-flagged sub-sequences (“thinking budget allocator”).
Alternating RL and SFT (ARES) (Byun et al., 2024)
- RL Stage: Optimize with sentence-level AI feedback:
7
using PPO.
- SFT Stage: Correct errors and stabilize outputs by supervised fine-tuning on AI-proposed correction pairs.
- Algorithm Alternation: Repeat RL and SFT iterations to balance reward-driven exploration with robustness and fluency.
Adaptive Entropy-Aware Optimization for Multimodal Open-Set TTA (Dong et al., 23 Jan 2025)
- Unknown-Aware Adaptive Entropy (UAE): Weighted loss term that maximizes uncertainty on high-entropy (possibly unknown) samples, minimizes for confident (likely known) samples.
- Adaptive Modality Prediction Discrepancy (AMP): Encourages prediction consistency across modalities for known classes, and discrepancy for unknowns.
- Total Loss:
8
with batch-diversity regularization.
3. Architectural and Implementation Details
- Model Backbone: All frameworks build on transformer-based multimodal LLMs or fusion architectures. Visual features are extracted via frozen encoders (e.g., ViT), textual features via standard embeddings, and sequences are processed autoregressively.
- Policy Heads and Formatting: For AEPO (GUI), sequence heads are engineered to emit structured sets of candidate actions (e.g., multiple coordinates) as output tokens using special formatting.
- RL Optimization: REINFORCE with leave-one-out baselines or PPO is standard. KL regularization is frequently applied—sometimes dynamically via dual variables.
- Entropy Tracking: Entropy statistics are cached or computed post-logits, generally outside the base transformer, avoiding modifications to attention or core layers.
- Data and Prompting: Training regimes rely on curated and filtered datasets, often with additional annotation or feedback (automatic or model-derived). For some setups, “> …” style prompts are prepended to encourage explicit reasoning traces.
4. Empirical Evaluation and Benchmark Results
AEPO for GUI Grounding (Liu et al., 7 Aug 2025)
- Benchmarks: MMBench-GUI, ScreenSpot-Pro, UI-Vision, UI-I2E-Bench, ScreenSpot-V2.
- Key Findings: InfiGUI-G1-7B achieved gains of up to +9.0% relative over RLVR on hard semantic tasks; gains are especially notable on icon-based (semantic) subtasks compared to purely spatial ones. AEPO consistently improved both Top-1 accuracy and exploration success rate across benchmarks.
ARES for Multimodal Adaptive Reasoning (Chen et al., 9 Oct 2025)
- Benchmarks: MathVerse-V, MathVision, LogicVista, MMMU-Pro, CharXIV, MMStar, GPQA, AIME24/25, MATH500, MMLU Pro.
- Performance: ARES-7B outperformed open-source 7B MLLMs by +9.7 pp (multimodal avg.), +27.2 pp (textual). Notably, it reached or exceeded proprietary systems (e.g., Gemini-2.5, Claude-4-Sonnet) at 1/10th inference cost, with CoT length dynamically optimized by difficulty.
Alternating RL/SFT ARES (Byun et al., 2024)
- Datasets: ScienceQA, A-OKVQA.
- Metrics: Rationale win rate (GPT-4o judge), answer accuracy.
- Results: ARES increased rationale win rate to 69–74%, improved answer accuracy by 2.5 pp on average, and ablation confirmed that alternating RL and SFT is superior to single-stage RL.
AEO for MM-OSTTA (Dong et al., 23 Jan 2025)
- Benchmarks: EPIC-Kitchens, HAC, Kinetics-100-C (action), nuScenes (3D segmentation).
- Metrics: Known-class accuracy (Acc, IoU), FPR95, AUROC, H-score.
- Results: AEO increased H-score by 13–27.7 pp across action recognition benchmarks and 2–3 pp in segmentation; enabled substantial FPR95 reduction in long-term and continual adaptation settings.
5. Significance, Limitations, and Outlook
- Principled Exploration: All frameworks directly address exploration-exploitation balance by using efficiency, entropy, or feedback-driven adaptive branching, mitigating the “confidence trap” and vanishing-gradient regimes endemic to naive RL or SFT.
- Semantics and Generalization: Explicit reward shaping and entropy modulation induce better semantic alignment and robustness, as evidenced by large gains in hard or open-set regimes.
- Data and Compute Efficiency: For instance, InfiGUI-G1 achieves strong GUI grounding with only 44K samples—orders of magnitude less than SFT-based alternatives—while ARES-7B matches proprietary systems with drastically lower compute.
- Limitations: Increased rollout compute for multi-answer or multi-branch generation, dependence on feedback quality for alternating RL/SFT, and ultimate accuracy constraints imposed by encoder or backbone quality.
- Future Directions: Prospects include dynamic per-instance exploration scheduling, integration with stronger vision encoders, value-function-based actor-critic variants for variance reduction, and further extension to OOD and continual learning scenarios.
6. Representative Algorithmic Schemes
| Framework | Exploration Trigger | Reward/Objective Shaping | Adaptive Inference |
|---|---|---|---|
| AEPO (GUI) | Multi-answer rollout (9 points) | Efficiency-based AER, collinear penalty | Candidate pruning, policy gradient RL |
| ARES (Reasoning) | HWE-token branching | Hierarchical entropy/KL shaping | Dynamic CoT length, thinking-budget allocation |
| ARES (RL/SFT) | Sentence-level teacher reward | Alternating PPO + SFT via AI corrections | Correction-aware stability |
| AEO (MM-OSTTA) | Per-sample entropy (0) | Entropy/discrepancy weighted loss | Per-batch parameter update at test time |
This table summarizes the mechanistic differences of multimodal AEPO/ARES/AEO frameworks as documented in the referenced works.
7. References
- InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization (Liu et al., 7 Aug 2025)
- ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping (Chen et al., 9 Oct 2025)
- ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback (Byun et al., 2024)
- Towards Robust Multimodal Open-set Test-time Adaptation via Adaptive Entropy-aware Optimization (Dong et al., 23 Jan 2025)