Free Energy Projective Simulation
- FEPS is a framework that models agents as interpretable, graph-based systems using active inference and the free energy principle for internal policy optimization.
- The methodology employs a clone-structured episodic memory and random walk deliberation to build explicit world models and compute expected free energy for decision-making.
- Empirical validations in tasks like grid navigation and timed-response paradigms demonstrate FEPS's potential in contextual learning and adaptive policy optimization.
Free Energy Projective Simulation (FEPS) is a framework that models agents as interpretable, graph-based systems performing active inference in partially observable environments. FEPS integrates the free energy principle (FEP) and active inference (AIF) with a structured memory architecture, enabling agents to derive optimal policies via internal reward mechanisms and explicit world models without relying on external scalar rewards or deep neural networks (Pazem et al., 2024).
1. World Model and Internal Representation
FEPS agents maintain an explicit world model rooted in a partially observable Markov decision process (POMDP). The model components are as follows:
- Belief-state space: (with denoting "clone clips").
- Observation space: (sensory states).
- Action space: .
- Transition function: , representing , models transitions between belief states given actions.
- Emission (likelihood) function: , representing , associates belief states with observations.
- Internal reward: , used for learning (see Section 3).
A key architectural feature is the clone-structured Episodic & Compositional Memory (ECM). Each observation is associated with "clone clips" , yielding . The emission edges are deterministic (). Transition edges carry trainable weights encoding . Deliberation in FEPS corresponds to a random walk on the ECM, while policy selection is implemented via a bipartite graph from to .
The full joint distribution across time is:
2. Expected Free Energy and Policy Construction
Policy optimization in FEPS is regulated by minimizing expected free energy (EFE), in line with AIF:
- For current belief and candidate action , the one-step predictive model is .
- The expected free energy for action is
with and the conditional entropy.
EFE decomposes as:
- Epistemic value: Expected information gain, corresponding to entropy reduction.
- Pragmatic value: Expected utility for matching the preference distribution .
The policy is determined by a softmax over the negative EFE:
where promotes exploitation (EFE minimization), and can promote exploration.
3. Internal Rewards and Learning Dynamics
Standard RL paradigms rely on external scalar rewards provided by the environment. In contrast, FEPS agents use solely internal rewards driven by prediction accuracy:
- Each transition edge is endowed with a confidence value .
- During prediction, as long as the predicted observation matches the actual , confidence is incremented along the trajectory.
- Upon the first prediction error, reinforcement is distributed to each implicated transition’s -value:
with the forgetting rate and the baseline initialization for each edge.
4. FEPS Operational Algorithm
FEPS proceeds through episodic interaction and online model adaptation. The essential workflow is summarized as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
for each observation s: create N_clone clone-clips b with emission edge b→s initialize all transition h-values h^0 and confidence f=0 initialize preference distribution pref while not terminated: reset environment; receive s_env excite all b with s(b) = s_env # initialize candidate set C while episode not finished: # Superposed belief state estimation maintain candidate set C ⊆ B consistent with s_env # Compute EFE and sample policy for each b in C: compute G_b[a] for all a π(a|b) = softmax(-ζ G_b[a]) mix π(a|b) across C to select a_t # Model-based prediction and update for each b in C: sample b′ ~ p_prior(b′|b, a_t) predict ŝ = b′→s take action a_t; observe new s_env # Confidence update if ŝ == s_env: f(b→b′)++ for all edges in trajectory C ← matching clones else: for each edge (b→b′) with f: h ← h - γ(h-h^0) + R·f reset f(b→b′)=0; reinitialize C to all clones for s_env # Policy update for each b: h_{b→a} = G_b[a] π(a|b) = softmax(-ζ G_b[a]) |
This workflow aligns mechanism design, learning, and planning via EFE minimization using only internal signals.
5. Techniques for Interpretability and Robustness
Several explicit strategies enable FEPS interpretability and facilitate robust credit assignment in the presence of partial observability:
- Clone-Structured Representation: Each clone-clip inherits the semantics of its source observation, and over training, clones of the same diverge, encoding distinct hidden-state contexts.
- Belief-State Superposition: Candidate set (clones compatible with current ) models superposed beliefs; prediction-consistent narrowing localizes the agent's true latent state.
- Long-Term Goals and Look-Ahead Preferences: Preference is factorized as , where targets specific observations and is dynamically propagated via a look-ahead, akin to dynamic programming:
- Reducing Prediction Errors: Candidate set narrowing rapidly recovers the true belief, while confidence-based reinforcement selectively solidifies reliable transitions.
6. Empirical Validation in Behavioral Paradigms
FEPS was validated in two behavioral-biology–inspired RL settings:
A. Timed-Response Task (Skinner Box analog):
- Environment: Hidden MDP featuring ambiguous observations (e.g., "light on, hungry" from two different latent states).
- Metrics: Error-free trajectory length, variational free energy (VFE), and EFE evolution.
- Results:
- Agents differentiate clones for ambiguous contexts.
- VFE undergoes sharp drops corresponding to elimination of impossible transitions and correct context separation.
- Look-ahead preferences enable the acquisition of correct multi-step policies.
B. Partially Observable Grid Navigation:
- Environment: grid, hidden food goal, overlapping scent observations; agent actions: up/down/left/right; 3 clones per scent.
- Metrics: Trajectory length, VFE, EFE, policy optimality, and median steps to goal.
- Results:
- For (task) and (wandering), longest trajectories achieved in most trials.
- Superposed belief tracking maximizes error-free exploration.
- Post-training, clone-to-cell mapping recapitulates a cognitive map.
- Preference reconfiguration reuses the trained world model for new goals.
7. Limitations and Prospects for Extension
Noted constraints and open areas for FEPS include:
- Task specification is limited to sensory observations; direct hidden state preferences cannot be represented.
- Scalability: Application to large state/action spaces may require hierarchical or function-approximate augmentations.
- Exploration-exploitation: The softmax temperature currently requires manual tuning; integration of intrinsic motivators (e.g., boredom, novelty) is pending.
- Model expansion: Online cloning/structural growth and related mechanisms are not dynamically addressed.
- Physical embodiment: Incorporation of FEPS and ECM update rules in real-world or neuromorphic agents remains future work.
FEPS thus establishes an interpretable, fully-internal-reward, graph-based instantiation of active inference, providing cognitive mapping, distinct hidden-state contextualization, and preference-adaptive policies through EFE minimization (Pazem et al., 2024).