HERA: Hierarchical Forecasting Architecture
- The paper introduces HERA, a hierarchical architecture that decomposes temporal reasoning into encoder, refresher, and anticipator stages for multi-level event forecasting.
- HERA employs asynchronous cross-level messaging to capture nested event structures in complex processes like human activity and multi-agent coordination.
- Empirical evaluations reveal HERA’s superior performance over flat models, achieving higher F1 scores in both coarse and fine-level predictions.
The Hierarchical Encoder–Refresher–Anticipator (HERA) is a class of hierarchical neural architectures developed for modeling and forecasting temporally composite processes that are naturally structured as multi-level event hierarchies. HERA was introduced to address the limitations of conventional flat sequence models—such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs)—that do not explicitly encode the semantic relationships across different levels of abstraction in event-driven sequences. Two distinct domains exemplify HERA's utility: human activity forecasting in video and orchestrated multi-agent reasoning for complex question answering. The core paradigm in both cases is the explicit, asynchronous coordination between three modules—Encoder, Refresher, and Anticipator—linked via structured cross-level communication.
1. Architectural Principles
HERA decomposes temporal reasoning into a multi-stage, multi-level process with hierarchical abstraction. In its canonical two-level instantiation for activity forecasting, the levels represent coarse (e.g., "make-coffee") and fine (e.g., "pour-milk") actions, where fine actions are strictly nested within their coarse parent. Three main modules structure the pipeline:
- Encoder: Constructs multi-level summaries of observed sequential data. In video, this is realized via asynchronous GRUs operating at each abstraction layer, embedding information about completed events and their durations, modulated by cross-level messages—downward (coarse→fine) for subtask planning and upward (fine→coarse) for progress feedback.
- Refresher: Operates at sequence interruption points, bridging observation gaps by refreshing hidden states and explicitly inferring the remaining duration of interrupted actions for each level. This enables accurate resumption of partially observed event segments.
- Anticipator: Uses the refreshed initial states to perform rollout, forecasting the remainder of the event hierarchy in both coarse and fine granularity until task completion, again using cross-level message passing for synchronization.
Analogous principles have been applied in multi-agent retrieval-augmented generation (RAG), where the Encoder plans global agent orchestration, the Refresher evolves an "experience library" (semantic priors), and the Anticipator adapts local agent prompts via targeted credit assignment (Morais et al., 2020, Li et al., 1 Apr 2026).
2. Mathematical Formalism
At level ( for coarse, for fine), the activity sequence is , where is the event label and its (relative) duration. Accumulated duration within a parent event is . Downward cross-level messages and upward messages align planning and progress.
Encoder Updates:
0
Refresher:
At an interruption (coarse index 1, partial duration 2), rewinding computes
3
4
Refreshed hidden vectors launch the Anticipator phase.
Anticipator Rollout:
5
The recursive process continues until the parent task/segment completes. In multi-agent contexts, the architecture generalizes to encode orchestrator distributions over execution topologies, with group sampling and reward-guided refinement (Li et al., 1 Apr 2026).
3. Training and Objective Functions
Training involves a multi-stage, multi-level loss: 6 with 7 (Encoder, Refresher, Anticipator). For Refresher, only duration loss is used. Network parameters—including task weights 8—are optimized by multi-task weighting. The total training loss sums over stages and all levels: 9 In orchestrated RAG, reward-guided sampling and token-cost incentives govern topology evolution and prompt updates, yielding dynamic adaptation without parameter updates to underlying LLMs (Li et al., 1 Apr 2026).
4. Specialized Datasets and Annotation Strategies
For hierarchical event modeling, a key resource is the re-annotated Breakfast Actions dataset. 30 coarse activities (e.g., "make-coffee") encompass 140 fine actions (e.g., "pour-cereal," "grab-milk"), with strict nesting enforced. The corpus comprises 1,717 videos, totaling ∼77 hours and 25,537 label–duration pairs, annotated in a verb–noun scheme. This schema enables comprehensive, coarse-to-fine, segment-level forecasting. Each fine action is entirely contained within a unique parent coarse action, supporting two-level rollouts and cross-level synchronization (Morais et al., 2020).
5. Evaluation Protocols and Empirical Results
Evaluation employs segment-level F1@0 (intersection-over-union, with 1) at each hierarchy level—preferable to frame-level statistics due to class imbalance sensitivity. Protocols use 4-fold leave-one-person-out cross-validation, with variable observation fractions (20% or 30% of video observed; forecasting targets 70%–80%). Performance is reported at increasing anticipation horizons (e.g., 10%, 20%, 30%, 50% of future sequence).
Key empirical findings:
- After observing 20% and forecasting 50% of a video:
- Coarse-level [email protected]: HERA ≈ 76%, outperforming Farha2 (≈70%) and Synced-Pair (≈59%).
- Fine-level [email protected]: HERA ≈ 40%, exceeding Farha2 (≈39%) and Joint-Single (≈26%).
- For long-range (predict 70%) forecasts, HERA maintains ∼35% fine-level F1, with baselines dropping below 25%.
Ablation results highlight the importance of asynchronous two-way messaging and the Refresher module:
- Removing bidirectional messages causes [email protected] to fall from 0.65 to 0.36.
- Removing the Refresher reduces F1 to ≈0.64.
In multi-agent RAG, HERA achieves an average 38.69% improvement over related baselines, with robust gains in both accuracy and token efficiency (Li et al., 1 Apr 2026).
6. Comparative Analysis and Ablation Insights
HERA has been systematically compared to:
- Independent-Single-RNN (per-level flat GRUs)
- Joint-Single-RNN (combined input GRU)
- Synced-Pair-RNN (synchronous cross-level GRUs)
- Farha et al.’s vanilla and two-level anticipators
HERA’s explicit hierarchical state management, asynchronous cross-layer communication, and explicit treatment of incomplete events remain critical for long-horizon reliability, outperforming alternatives in both detailed forecasting and efficiency.
Ablation findings (from fine-level [email protected], observing 20%, predicting varying horizons):
- Asynchronous messaging, especially upward feedback and downward instruction, is essential for high-fidelity rollouts.
- The discrete label component in downward messages is beneficial but not critical.
- The Refresher mechanism substantially mitigates error propagation at observation boundaries.
For multi-agent orchestration, the experience library and role-aware prompt evolution modules each provide significant (6–30%) performance/efficiency gains, with ablation confirming the necessity of semantic priors and agent-specific adaptation for multi-step, multi-agent reasoning (Li et al., 1 Apr 2026).
7. Implementation Guidelines and Potential Extensions
Recommended network configurations in the activity prediction instantiation use GRU and MLP hidden states of 16 units each, with ADAM optimizer (learning rate 2), batch size 512, and 20 training epochs (PyTorch implementation). Complexity scales linearly with the number of events per level, and asynchronous scheduling introduces minimal computational overhead.
HERA supports the extension to deeper hierarchies (3) by stacking further asynchronous GRU layers. Integrating visual embeddings in the Encoder yields end-to-end video-to-forecast capabilities. The paradigm generalizes from video action forecasting to domains such as assistive robotics, autonomous driving, surveillance, multi-agent planning, and retrieval-augmented multi-agent systems—where efficient global policy optimization, experience consolidation, and role-aware local adaptation are critical.
HERA’s operation does not require modification of underlying base model parameters, facilitating deployment in systems where frozen (e.g., LLM-based) agents coordinate via in-context learning and orchestrated topological restructuring (Morais et al., 2020, Li et al., 1 Apr 2026).