FM-EAC: Feature Model-Based Enhanced Actor-Critic
- Feature Model-Based Enhanced Actor-Critic is an RL framework that integrates compact feature extraction with actor–critic methods to improve sample efficiency, generalization, and transferability.
- It tightly couples planning, acting, and learning by using modular components such as CNNs, GNNs, and PANs to transform high-dimensional states into informative, low-dimensional features.
- Empirical results show FM-EAC outperforms traditional methods on single- and multi-task benchmarks by leveraging feature-informed critics and Dyna-style planning for rapid policy adaptation.
The Feature Model-Based Enhanced Actor-Critic (FM-EAC) methodology designates a class of reinforcement learning (RL) architectures that integrate low-dimensional feature modeling with an augmented actor–critic framework to address the challenges of sample efficiency, generalization, and transferability in high-dimensional, dynamic environments. FM-EAC tightly couples planning (model-based), acting (model-free), and learning via shared feature representations, enabling performance competitive with or surpassing existing state-of-the-art RL methods on both single- and multi-task benchmarks (Boney et al., 2021, Zhou et al., 17 Dec 2025).
1. Model Architecture and Key Components
FM-EAC is built upon the integration of the following core modules:
- Feature Model (FM): A parameterized feature extractor, typically denoted , that maps a high-dimensional raw environmental state to a compact, task-relevant vector . FM can be instantiated using convolutional neural networks (CNNs), graph neural networks (GNNs), pre-trained point array networks (PANs), or arbitrary differentiable architectures.
- Enhanced Actor-Critic (EAC): Consists of one or more actors and multiple critics that use both raw states and extracted features. The critic’s inputs may be , enabling rich, feature-informed value estimation.
- Planning Module: Utilizes the FM for short-horizon imagination rollouts or model-based policy refinement, augmenting the off-policy RL loop with synthesized experience.
The overall loop alternates between real-world execution (sampling actions via actors, collecting environment transitions and features) and batch training (updating FM, actors, and critics from pooled real and synthetic data).
2. Feature Extraction Strategies
Feature extraction is central to both single-task vision-based RL (Boney et al., 2021) and multi-task, dynamic settings (Zhou et al., 17 Dec 2025). Two prominent instantiations:
a) Differentiable Feature-Point Bottleneck (Boney et al., 2021)
- A 4-layer convolutional encoder maps stacked input frames () to spatial feature maps .
- Each map is interpreted as an unnormalized log-probability over 2D coordinates, from which:
- Soft-argmax yields spatial location as expectation over the induced probability.
- Presence scalar: , encoding object presence or feature confidence.
- The overall feature vector is .
- Temporal differencing is concatenated to the feature input, ensuring velocity information is available.
b) General Feature Modelization (Zhou et al., 17 Dec 2025)
- can be a GNN (graph convolution over agent or environment relationships), PAN (for capturing geometric structure), or other user-specified networks.
- The FM may optionally model environment transitions and rewards in feature space:
- These models enable the agent to synthesize transitions via imagination rollouts for planning or augmenting the replay buffer.
3. Enhanced Actor-Critic Learning
FM-EAC leverages variants of off-policy RL with the following procedural characteristics:
- Critic(s): evaluate state–action pairs with feature augmentation for improved expressivity. Twin critics and clipped double Q-learning are standard for stability and bias reduction.
- Actor: outputs either a stochastic (Gaussian) or deterministic policy.
- Losses:
- Critic:
- Target: with
- Actor:
- Parameter Update: End-to-end differentiation updates FM, EAC, and supporting networks via gradients from actor-critic objectives.
4. Network Customization and Modularity
FM-EAC accommodates significant architectural flexibility:
| Module | Candidate Networks (examples) | Use Case |
|---|---|---|
| Feature Model | GNN, PAN, BPN, Conv/FCN, handcrafted | General/structured features |
| Actor | SAC, TD3, PPO, hybrids | Discrete/continuous, hybrid |
| Critic | Standard Q, value, or distributional critics | Value estimation |
Sub-networks can be selected to exploit specific inductive biases: GNNs for relational reasoning (multi-UAV), PANs for geometric structure, BPNs for task-specific information (e.g., energy-aware policies), and so on. Actor–critic variants can be matched to task requirements (action type, policy shape).
5. Empirical Performance and Sample Efficiency
FM-EAC demonstrates strong empirical performance across both image-based single-task RL (Boney et al., 2021) and multi-task environment benchmarks (Zhou et al., 17 Dec 2025):
- On DeepMind Control Suite tasks, feature-point FM-EAC approaches match or nearly match state-centric SAC and outperform pixel-based SAC, DrQ, CURL, SAC-AE, and Dreamer in sample efficiency and return.
- On urban and agricultural tasks (e.g., multi-UAV package delivery and sensing), customized FM-EAC variants (PAN-EAC, GNN-EAC) achieve the highest average rewards and minimal convergence times over unseen maps. Typical metrics include urban QoS, agricultural age-of-information (AoI), and reward.
| Algorithm | Reward (std) | Online ms | Offline ms | Urban QoS | Agri AoI |
|---|---|---|---|---|---|
| TD3 | $15.2$ | $79.5$ | $8.26$ | $1.77$ | |
| SAC | $17.3$ | $74.3$ | $7.47$ | $1.80$ | |
| MBPO | $44.9$ | $73.5$ | $7.29$ | $1.97$ | |
| PAN-EAC | $17.0$ | $69.5$ | $8.00$ | $1.10$ | |
| GNN-EAC | $35.9$ | $36.3$ | $8.08$ | $1.27$ |
Ablations establish that feature-point count is not critical for asymptotic returns, but temporal features (e.g., explicit velocities) accelerate learning. Auxiliary losses or decoders are unnecessary, as the bottlenecked geometric representation suffices for effective control. Pretrained FM layers do not generalize well across environments compared to end-to-end adaptation.
6. Computational Complexity and Implementation Insights
FM-EAC can be adapted to available computational budgets via selection of base networks and rollout lengths. Computational complexity of core modules is determined by:
- GNNs: for dense adjacency operations ( number of nodes).
- PANs: relative to the number of points for lightweight execution.
- Actor–Critic: where is the batch size and denotes the number of parameters.
Efficient implementations utilize batch-optimized forward passes, separable softmax (for feature-point extraction), and Polyak-averaged target networks to ensure stability. Real-time transfer to field-scale urban and agricultural deployment is feasible by selecting minimal FM architectures or fixed feature matrices.
7. Generalizability, Transfer, and Applications
FM-EAC’s modularity and reliance on task-relevant features confer robust transferability across unseen task instances and environments:
- Generalization: Shared FM supports spatio-temporal pattern transfer—e.g., urban to agricultural maps—without retraining.
- Sample Efficiency: Off-policy actor-critic loop with imagination rollouts supports Dyna-style planning, reducing real environment sample counts by 3×–10× compared to standard SAC on pixels.
- Applications: Includes multi-UAV package delivery with mobile edge computing, precision agriculture/wireless sensing, and extensions to multi-robot systems, autonomous driving fleets, and multiplayer games.
Meta-heuristics (e.g., ant colony optimization, genetic algorithms, particle swarm optimization) exhibit inferior sample efficiency and fail in sequential, dynamic domains compared to FM-EAC.
FM-EAC unifies model-based planning and model-free actor-critic learning via a feature-centric representation. By bottlenecking high-dimensional observations to compact, disentangled features and propagating gradients through actor–critic losses, FM-EAC supports efficient learning, broad generalization, and modular extensibility for future RL applications in complex, dynamic environments (Boney et al., 2021, Zhou et al., 17 Dec 2025).