Free Energy Projective Simulation

Updated 25 January 2026

FEPS is a framework that models agents as interpretable, graph-based systems using active inference and the free energy principle for internal policy optimization.
The methodology employs a clone-structured episodic memory and random walk deliberation to build explicit world models and compute expected free energy for decision-making.
Empirical validations in tasks like grid navigation and timed-response paradigms demonstrate FEPS's potential in contextual learning and adaptive policy optimization.

Free Energy Projective Simulation (FEPS) is a framework that models agents as interpretable, graph-based systems performing active inference in partially observable environments. FEPS integrates the free energy principle (FEP) and active inference (AIF) with a structured memory architecture, enabling agents to derive optimal policies via internal reward mechanisms and explicit world models without relying on external scalar rewards or deep neural networks (Pazem et al., 2024).

1. World Model and Internal Representation

FEPS agents maintain an explicit world model rooted in a partially observable Markov decision process (POMDP). The model components are as follows:

Belief-state space: $B = \{b\}$ (with $b$ denoting "clone clips").
Observation space: $S = \{s\}$ (sensory states).
Action space: $A = \{a\}$ .
Transition function: $T$ , representing $p(b'|b, a)$ , models transitions between belief states given actions.
Emission (likelihood) function: $L$ , representing $p(s|b)$ , associates belief states with observations.
Internal reward: $R$ , used for learning (see Section 3).

A key architectural feature is the clone-structured Episodic & Compositional Memory (ECM). Each observation $s\in S$ is associated with $N_\text{clone}$ "clone clips" $b\in B$ , yielding $|B| = N_\text{clone} \cdot |S|$ . The emission edges $b \to s$ are deterministic ( $p(s|b)=\delta_{s,s(b)}$ ). Transition edges $b \xrightarrow{a} b'$ carry trainable weights $h_{b,b'}^{(a)}$ encoding $p(b'|b,a)$ . Deliberation in FEPS corresponds to a random walk on the ECM, while policy selection is implemented via a bipartite graph from $B$ to $A$ .

The full joint distribution across time $t$ is:

$p(B_{0:t}, A_{0:t-1}, S_{0:t}) = p(B_0, S_0) \prod_{\tau=1}^t \pi(A_{\tau-1}|B_{\tau-1})\,p(S_\tau|B_\tau)\,p(B_\tau|B_{\tau-1}, A_{\tau-1})$

2. Expected Free Energy and Policy Construction

Policy optimization in FEPS is regulated by minimizing expected free energy (EFE), in line with AIF:

For current belief $b_t$ and candidate action $a$ , the one-step predictive model is $p(B_{t+1}, S_{t+1}|b_t,a) = p(b_{t+1}|b_t,a)p(s_{t+1}|b_{t+1})$ .
The expected free energy for action $a$ is

$G_{b_t}[a] = \mathbb{E}_{b',s' \sim p(\cdot|b_t,a)} \left[\log p(b'|b_t,a) - \log \operatorname{pref}(s',b'|b_t,a)\right] = -H[B_{t+1}|b_t,a] + \mathbb{E}_{b',s'}[S^\text{pref}(s',b'|b_t,a)]$

with $S^\text{pref}(s',b'|\cdots) = -\log \operatorname{pref}(s',b'|\cdots)$ and $H$ the conditional entropy.

EFE decomposes as:

Epistemic value: Expected information gain, corresponding to entropy reduction.
Pragmatic value: Expected utility for matching the preference distribution $\operatorname{pref}$ .

The policy is determined by a softmax over the negative EFE:

$\pi(a|b_t) = \mathrm{softmax}(-\zeta G_{b_t}[a])$

where $\zeta < 0$ promotes exploitation (EFE minimization), and $\zeta > 0$ can promote exploration.

3. Internal Rewards and Learning Dynamics

Standard RL paradigms rely on external scalar rewards $r_t$ provided by the environment. In contrast, FEPS agents use solely internal rewards driven by prediction accuracy:

Each transition edge $b \xrightarrow{a} b'$ is endowed with a confidence value $f_{b\to b'}$ .
During prediction, as long as the predicted observation $\hat{s}_{t+1}$ matches the actual $s_{t+1}^{\text{env}}$ , confidence $f$ is incremented along the trajectory.
Upon the first prediction error, reinforcement $R \cdot f$ is distributed to each implicated transition’s $h$ -value:

$h^{\text{new}}_{b,b'} = h^{\text{old}}_{b,b'} - \gamma(h^{\text{old}}_{b,b'} - h^0_{b,b'}) + R \cdot f_{b\to b'}$

with $\gamma$ the forgetting rate and $h^0$ the baseline initialization for each edge.

4. FEPS Operational Algorithm

FEPS proceeds through episodic interaction and online model adaptation. The essential workflow is summarized as:

for each observation s:
    create N_clone clone-clips b with emission edge b→s
initialize all transition h-values h^0 and confidence f=0
initialize preference distribution pref

while not terminated:
    reset environment; receive s_env
    excite all b with s(b) = s_env  # initialize candidate set C

    while episode not finished:
        # Superposed belief state estimation
        maintain candidate set C ⊆ B consistent with s_env

        # Compute EFE and sample policy
        for each b in C:
            compute G_b[a] for all a
            π(a|b) = softmax(-ζ G_b[a])
        mix π(a|b) across C to select a_t

        # Model-based prediction and update
        for each b in C:
            sample b′ ~ p_prior(b′|b, a_t)
            predict ŝ = b′→s
        take action a_t; observe new s_env

        # Confidence update
        if ŝ == s_env:
            f(b→b′)++ for all edges in trajectory
            C ← matching clones
        else:
            for each edge (b→b′) with f:
                h ← h - γ(h-h^0) + R·f
            reset f(b→b′)=0; reinitialize C to all clones for s_env

        # Policy update
        for each b:
            h_{b→a} = G_b[a]
            π(a|b) = softmax(-ζ G_b[a])

This workflow aligns mechanism design, learning, and planning via EFE minimization using only internal signals.

5. Techniques for Interpretability and Robustness

Several explicit strategies enable FEPS interpretability and facilitate robust credit assignment in the presence of partial observability:

Clone-Structured Representation: Each clone-clip inherits the semantics of its source observation, and over training, clones of the same $s$ diverge, encoding distinct hidden-state contexts.
Belief-State Superposition: Candidate set $C$ (clones compatible with current $s_{\text{env}}$ ) models superposed beliefs; prediction-consistent narrowing localizes the agent's true latent state.
Long-Term Goals and Look-Ahead Preferences: Preference $\operatorname{pref}(s',b'|b)$ is factorized as $\operatorname{pref}(s') \operatorname{pref}(b'|b)$ , where $\operatorname{pref}(s')$ targets specific observations and $\operatorname{pref}(b'|b)$ is dynamically propagated via a look-ahead, akin to dynamic programming:

$v_0(b) = \operatorname{pref}(s(b)),\qquad v_n(b') = \max \{ v_{n-1}(b'') \cdot \beta^{n-1} \cdot p(b''|b') \cdot \pi(a|b') \}$

Reducing Prediction Errors: Candidate set narrowing rapidly recovers the true belief, while confidence-based reinforcement selectively solidifies reliable transitions.

6. Empirical Validation in Behavioral Paradigms

FEPS was validated in two behavioral-biology–inspired RL settings:

A. Timed-Response Task (Skinner Box analog):

Environment: Hidden MDP featuring ambiguous observations (e.g., "light on, hungry" from two different latent states).
Metrics: Error-free trajectory length, variational free energy (VFE), and EFE evolution.
Results:
- Agents differentiate clones for ambiguous contexts.
- VFE undergoes sharp drops corresponding to elimination of impossible transitions and correct context separation.
- Look-ahead preferences enable the acquisition of correct multi-step policies.

B. Partially Observable Grid Navigation:

Environment: $3 \times 3$ grid, hidden food goal, overlapping scent observations; agent actions: up/down/left/right; 3 clones per scent.
Metrics: Trajectory length, VFE, EFE, policy optimality, and median steps to goal.
Results:
- For $\zeta = -3$ (task) and $\zeta = +1$ (wandering), longest trajectories achieved in most trials.
- Superposed belief tracking maximizes error-free exploration.
- Post-training, clone-to-cell mapping recapitulates a cognitive map.
- Preference reconfiguration reuses the trained world model for new goals.

7. Limitations and Prospects for Extension

Noted constraints and open areas for FEPS include:

Task specification is limited to sensory observations; direct hidden state preferences cannot be represented.
Scalability: Application to large state/action spaces may require hierarchical or function-approximate augmentations.
Exploration-exploitation: The softmax temperature $\zeta$ currently requires manual tuning; integration of intrinsic motivators (e.g., boredom, novelty) is pending.
Model expansion: Online cloning/structural growth and related mechanisms are not dynamically addressed.
Physical embodiment: Incorporation of FEPS and ECM update rules in real-world or neuromorphic agents remains future work.

FEPS thus establishes an interpretable, fully-internal-reward, graph-based instantiation of active inference, providing cognitive mapping, distinct hidden-state contextualization, and preference-adaptive policies through EFE minimization (Pazem et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Free Energy Projective Simulation (FEPS): Active inference with interpretability (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Free Energy Projective Simulation (FEPS).

Free Energy Projective Simulation

1. World Model and Internal Representation

2. Expected Free Energy and Policy Construction

3. Internal Rewards and Learning Dynamics

4. FEPS Operational Algorithm

5. Techniques for Interpretability and Robustness

6. Empirical Validation in Behavioral Paradigms

7. Limitations and Prospects for Extension

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Free Energy Projective Simulation

1. World Model and Internal Representation

2. Expected Free Energy and Policy Construction

3. Internal Rewards and Learning Dynamics

4. FEPS Operational Algorithm

5. Techniques for Interpretability and Robustness

6. Empirical Validation in Behavioral Paradigms

7. Limitations and Prospects for Extension

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research