Papers
Topics
Authors
Recent
2000 character limit reached

FM-EAC: Feature Model-Based Enhanced Actor-Critic

Updated 24 December 2025
  • Feature Model-Based Enhanced Actor-Critic is an RL framework that integrates compact feature extraction with actor–critic methods to improve sample efficiency, generalization, and transferability.
  • It tightly couples planning, acting, and learning by using modular components such as CNNs, GNNs, and PANs to transform high-dimensional states into informative, low-dimensional features.
  • Empirical results show FM-EAC outperforms traditional methods on single- and multi-task benchmarks by leveraging feature-informed critics and Dyna-style planning for rapid policy adaptation.

The Feature Model-Based Enhanced Actor-Critic (FM-EAC) methodology designates a class of reinforcement learning (RL) architectures that integrate low-dimensional feature modeling with an augmented actor–critic framework to address the challenges of sample efficiency, generalization, and transferability in high-dimensional, dynamic environments. FM-EAC tightly couples planning (model-based), acting (model-free), and learning via shared feature representations, enabling performance competitive with or surpassing existing state-of-the-art RL methods on both single- and multi-task benchmarks (Boney et al., 2021, Zhou et al., 17 Dec 2025).

1. Model Architecture and Key Components

FM-EAC is built upon the integration of the following core modules:

  • Feature Model (FM): A parameterized feature extractor, typically denoted ϕφ\phi_\varphi, that maps a high-dimensional raw environmental state ss to a compact, task-relevant vector ϕ(s)\phi(s). FM can be instantiated using convolutional neural networks (CNNs), graph neural networks (GNNs), pre-trained point array networks (PANs), or arbitrary differentiable architectures.
  • Enhanced Actor-Critic (EAC): Consists of one or more actors πθ\pi_\theta and multiple critics QϕiQ_{\phi_i} that use both raw states and extracted features. The critic’s inputs may be (s,a,ϕ(s))(s, a, \phi(s)), enabling rich, feature-informed value estimation.
  • Planning Module: Utilizes the FM for short-horizon imagination rollouts or model-based policy refinement, augmenting the off-policy RL loop with synthesized experience.

The overall loop alternates between real-world execution (sampling actions via actors, collecting environment transitions and features) and batch training (updating FM, actors, and critics from pooled real and synthetic data).

2. Feature Extraction Strategies

Feature extraction is central to both single-task vision-based RL (Boney et al., 2021) and multi-task, dynamic settings (Zhou et al., 17 Dec 2025). Two prominent instantiations:

  • A 4-layer convolutional encoder ff maps stacked input frames (st=[ot,ot1]s_t = [o_t, o_{t-1}]) to KK spatial feature maps htRK×H×Wh_t \in \mathbb{R}^{K \times H \times W}.
  • Each map is interpreted as an unnormalized log-probability over 2D coordinates, from which:
    • Soft-argmax yields spatial location (xk,yk)(x_k, y_k) as expectation over the induced probability.
    • Presence scalar: mk=tanh((1/HW)x,yht[k](x,y))m_k = \tanh((1/HW)\sum_{x,y} h_t[k](x, y)), encoding object presence or feature confidence.
    • The overall feature vector is xt=[(x1,y1,m1),...]RK×3x_t = [(x_1, y_1, m_1), ...] \in \mathbb{R}^{K \times 3}.
  • Temporal differencing xtxt1x_t - x_{t-1} is concatenated to the feature input, ensuring velocity information is available.
  • ϕφ\phi_\varphi can be a GNN (graph convolution over agent or environment relationships), PAN (for capturing geometric structure), or other user-specified networks.
  • The FM may optionally model environment transitions and rewards in feature space:
    • pφ(ss,a)N(μφ(ϕ(s),a),Σφ(ϕ(s),a))p_\varphi(s'|s,a) \approx \mathcal{N}(\mu_\varphi(\phi(s),a), \Sigma_\varphi(\phi(s),a))
    • rφ(s,a)Rφ(ϕ(s),a)r_\varphi(s,a) \approx R_\varphi(\phi(s),a)
  • These models enable the agent to synthesize transitions via imagination rollouts for planning or augmenting the replay buffer.

3. Enhanced Actor-Critic Learning

FM-EAC leverages variants of off-policy RL with the following procedural characteristics:

  • Critic(s): Qϕi(s,a,ϕ(s))Q_{\phi_i}(s, a, \phi(s)) evaluate state–action pairs with feature augmentation for improved expressivity. Twin critics and clipped double Q-learning are standard for stability and bias reduction.
  • Actor: πθ(as)\pi_\theta(a|s) outputs either a stochastic (Gaussian) or deterministic policy.
  • Losses:
    • Critic: JQi=E[(Qϕi(s,a,ϕ(s))Y(s,a,r,s))2]\mathcal{J}_{Q_i} = \mathbb{E}[(Q_{\phi_i}(s,a,\phi(s)) - Y(s, a, r, s'))^2]
    • Target: Y(s,a,r,s)=r+γminj=1,2Qϕj(s,a,ϕ(s))αlogπθ(as)Y(s, a, r, s') = r + \gamma \min_{j=1,2} Q_{\phi'_j}(s', a', \phi(s')) - \alpha \log \pi_\theta(a'|s') with aπθ(s)a' \sim \pi_\theta(\cdot|s')
    • Actor: Jπ=Es,aπθ[αlogπθ(as)miniQϕi(s,a,ϕ(s))]\mathcal{J}_\pi = \mathbb{E}_{s,a \sim \pi_\theta}[\alpha \log \pi_\theta(a|s) - \min_{i} Q_{\phi_i}(s, a, \phi(s))]
  • Parameter Update: End-to-end differentiation updates FM, EAC, and supporting networks via gradients from actor-critic objectives.

4. Network Customization and Modularity

FM-EAC accommodates significant architectural flexibility:

Module Candidate Networks (examples) Use Case
Feature Model GNN, PAN, BPN, Conv/FCN, handcrafted General/structured features
Actor SAC, TD3, PPO, hybrids Discrete/continuous, hybrid
Critic Standard Q, value, or distributional critics Value estimation

Sub-networks can be selected to exploit specific inductive biases: GNNs for relational reasoning (multi-UAV), PANs for geometric structure, BPNs for task-specific information (e.g., energy-aware policies), and so on. Actor–critic variants can be matched to task requirements (action type, policy shape).

5. Empirical Performance and Sample Efficiency

FM-EAC demonstrates strong empirical performance across both image-based single-task RL (Boney et al., 2021) and multi-task environment benchmarks (Zhou et al., 17 Dec 2025):

  • On DeepMind Control Suite tasks, feature-point FM-EAC approaches match or nearly match state-centric SAC and outperform pixel-based SAC, DrQ, CURL, SAC-AE, and Dreamer in sample efficiency and return.
  • On urban and agricultural tasks (e.g., multi-UAV package delivery and sensing), customized FM-EAC variants (PAN-EAC, GNN-EAC) achieve the highest average rewards and minimal convergence times over unseen maps. Typical metrics include urban QoS, agricultural age-of-information (AoI), and reward.
Algorithm Reward (±\pmstd) Online ms Offline ms Urban QoS Agri AoI
TD3 410±62410 \pm 62 $15.2$ $79.5$ $8.26$ $1.77$
SAC 446±50446 \pm 50 $17.3$ $74.3$ $7.47$ $1.80$
MBPO 134±80134 \pm 80 $44.9$ $73.5$ $7.29$ $1.97$
PAN-EAC 1392±621392 \pm 62 $17.0$ $69.5$ $8.00$ $1.10$
GNN-EAC 1400±591400 \pm 59 $35.9$ $36.3$ $8.08$ $1.27$

Ablations establish that feature-point count KK is not critical for asymptotic returns, but temporal features (e.g., explicit velocities) accelerate learning. Auxiliary losses or decoders are unnecessary, as the bottlenecked geometric representation suffices for effective control. Pretrained FM layers do not generalize well across environments compared to end-to-end adaptation.

6. Computational Complexity and Implementation Insights

FM-EAC can be adapted to available computational budgets via selection of base networks and rollout lengths. Computational complexity of core modules is determined by:

  • GNNs: O(N2)O(N^2) for dense adjacency operations (N=N = number of nodes).
  • PANs: O(P)O(P) relative to the number of points for lightweight execution.
  • Actor–Critic: O(Bθ)O(B \cdot |\theta|) where BB is the batch size and θ|\theta| denotes the number of parameters.

Efficient implementations utilize batch-optimized forward passes, separable softmax (for feature-point extraction), and Polyak-averaged target networks to ensure stability. Real-time transfer to field-scale urban and agricultural deployment is feasible by selecting minimal FM architectures or fixed feature matrices.

7. Generalizability, Transfer, and Applications

FM-EAC’s modularity and reliance on task-relevant features confer robust transferability across unseen task instances and environments:

  • Generalization: Shared FM supports spatio-temporal pattern transfer—e.g., urban to agricultural maps—without retraining.
  • Sample Efficiency: Off-policy actor-critic loop with imagination rollouts supports Dyna-style planning, reducing real environment sample counts by 3×–10× compared to standard SAC on pixels.
  • Applications: Includes multi-UAV package delivery with mobile edge computing, precision agriculture/wireless sensing, and extensions to multi-robot systems, autonomous driving fleets, and multiplayer games.

Meta-heuristics (e.g., ant colony optimization, genetic algorithms, particle swarm optimization) exhibit inferior sample efficiency and fail in sequential, dynamic domains compared to FM-EAC.


FM-EAC unifies model-based planning and model-free actor-critic learning via a feature-centric representation. By bottlenecking high-dimensional observations to compact, disentangled features and propagating gradients through actor–critic losses, FM-EAC supports efficient learning, broad generalization, and modular extensibility for future RL applications in complex, dynamic environments (Boney et al., 2021, Zhou et al., 17 Dec 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Feature Model-Based Enhanced Actor-Critic (FM-EAC).