Papers
Topics
Authors
Recent
Search
2000 character limit reached

Free Energy Projective Simulation

Updated 25 January 2026
  • FEPS is a framework that models agents as interpretable, graph-based systems using active inference and the free energy principle for internal policy optimization.
  • The methodology employs a clone-structured episodic memory and random walk deliberation to build explicit world models and compute expected free energy for decision-making.
  • Empirical validations in tasks like grid navigation and timed-response paradigms demonstrate FEPS's potential in contextual learning and adaptive policy optimization.

Free Energy Projective Simulation (FEPS) is a framework that models agents as interpretable, graph-based systems performing active inference in partially observable environments. FEPS integrates the free energy principle (FEP) and active inference (AIF) with a structured memory architecture, enabling agents to derive optimal policies via internal reward mechanisms and explicit world models without relying on external scalar rewards or deep neural networks (Pazem et al., 2024).

1. World Model and Internal Representation

FEPS agents maintain an explicit world model rooted in a partially observable Markov decision process (POMDP). The model components are as follows:

  • Belief-state space: B={b}B = \{b\} (with bb denoting "clone clips").
  • Observation space: S={s}S = \{s\} (sensory states).
  • Action space: A={a}A = \{a\}.
  • Transition function: TT, representing p(bb,a)p(b'|b, a), models transitions between belief states given actions.
  • Emission (likelihood) function: LL, representing p(sb)p(s|b), associates belief states with observations.
  • Internal reward: RR, used for learning (see Section 3).

A key architectural feature is the clone-structured Episodic & Compositional Memory (ECM). Each observation sSs\in S is associated with NcloneN_\text{clone} "clone clips" bBb\in B, yielding B=NcloneS|B| = N_\text{clone} \cdot |S|. The emission edges bsb \to s are deterministic (p(sb)=δs,s(b)p(s|b)=\delta_{s,s(b)}). Transition edges babb \xrightarrow{a} b' carry trainable weights hb,b(a)h_{b,b'}^{(a)} encoding p(bb,a)p(b'|b,a). Deliberation in FEPS corresponds to a random walk on the ECM, while policy selection is implemented via a bipartite graph from BB to AA.

The full joint distribution across time tt is:

p(B0:t,A0:t1,S0:t)=p(B0,S0)τ=1tπ(Aτ1Bτ1)p(SτBτ)p(BτBτ1,Aτ1)p(B_{0:t}, A_{0:t-1}, S_{0:t}) = p(B_0, S_0) \prod_{\tau=1}^t \pi(A_{\tau-1}|B_{\tau-1})\,p(S_\tau|B_\tau)\,p(B_\tau|B_{\tau-1}, A_{\tau-1})

2. Expected Free Energy and Policy Construction

Policy optimization in FEPS is regulated by minimizing expected free energy (EFE), in line with AIF:

  • For current belief btb_t and candidate action aa, the one-step predictive model is p(Bt+1,St+1bt,a)=p(bt+1bt,a)p(st+1bt+1)p(B_{t+1}, S_{t+1}|b_t,a) = p(b_{t+1}|b_t,a)p(s_{t+1}|b_{t+1}).
  • The expected free energy for action aa is

Gbt[a]=Eb,sp(bt,a)[logp(bbt,a)logpref(s,bbt,a)]=H[Bt+1bt,a]+Eb,s[Spref(s,bbt,a)]G_{b_t}[a] = \mathbb{E}_{b',s' \sim p(\cdot|b_t,a)} \left[\log p(b'|b_t,a) - \log \operatorname{pref}(s',b'|b_t,a)\right] = -H[B_{t+1}|b_t,a] + \mathbb{E}_{b',s'}[S^\text{pref}(s',b'|b_t,a)]

with Spref(s,b)=logpref(s,b)S^\text{pref}(s',b'|\cdots) = -\log \operatorname{pref}(s',b'|\cdots) and HH the conditional entropy.

EFE decomposes as:

  • Epistemic value: Expected information gain, corresponding to entropy reduction.
  • Pragmatic value: Expected utility for matching the preference distribution pref\operatorname{pref}.

The policy is determined by a softmax over the negative EFE:

π(abt)=softmax(ζGbt[a])\pi(a|b_t) = \mathrm{softmax}(-\zeta G_{b_t}[a])

where ζ<0\zeta < 0 promotes exploitation (EFE minimization), and ζ>0\zeta > 0 can promote exploration.

3. Internal Rewards and Learning Dynamics

Standard RL paradigms rely on external scalar rewards rtr_t provided by the environment. In contrast, FEPS agents use solely internal rewards driven by prediction accuracy:

  • Each transition edge babb \xrightarrow{a} b' is endowed with a confidence value fbbf_{b\to b'}.
  • During prediction, as long as the predicted observation s^t+1\hat{s}_{t+1} matches the actual st+1envs_{t+1}^{\text{env}}, confidence ff is incremented along the trajectory.
  • Upon the first prediction error, reinforcement RfR \cdot f is distributed to each implicated transition’s hh-value:

hb,bnew=hb,boldγ(hb,boldhb,b0)+Rfbbh^{\text{new}}_{b,b'} = h^{\text{old}}_{b,b'} - \gamma(h^{\text{old}}_{b,b'} - h^0_{b,b'}) + R \cdot f_{b\to b'}

with γ\gamma the forgetting rate and h0h^0 the baseline initialization for each edge.

4. FEPS Operational Algorithm

FEPS proceeds through episodic interaction and online model adaptation. The essential workflow is summarized as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
for each observation s:
    create N_clone clone-clips b with emission edge bs
initialize all transition h-values h^0 and confidence f=0
initialize preference distribution pref

while not terminated:
    reset environment; receive s_env
    excite all b with s(b) = s_env  # initialize candidate set C

    while episode not finished:
        # Superposed belief state estimation
        maintain candidate set C  B consistent with s_env

        # Compute EFE and sample policy
        for each b in C:
            compute G_b[a] for all a
            π(a|b) = softmax(-ζ G_b[a])
        mix π(a|b) across C to select a_t

        # Model-based prediction and update
        for each b in C:
            sample b ~ p_prior(b|b, a_t)
            predict ŝ = bs
        take action a_t; observe new s_env

        # Confidence update
        if ŝ == s_env:
            f(bb)++ for all edges in trajectory
            C  matching clones
        else:
            for each edge (bb) with f:
                h  h - γ(h-h^0) + R·f
            reset f(bb)=0; reinitialize C to all clones for s_env

        # Policy update
        for each b:
            h_{ba} = G_b[a]
            π(a|b) = softmax(-ζ G_b[a])

This workflow aligns mechanism design, learning, and planning via EFE minimization using only internal signals.

5. Techniques for Interpretability and Robustness

Several explicit strategies enable FEPS interpretability and facilitate robust credit assignment in the presence of partial observability:

  • Clone-Structured Representation: Each clone-clip inherits the semantics of its source observation, and over training, clones of the same ss diverge, encoding distinct hidden-state contexts.
  • Belief-State Superposition: Candidate set CC (clones compatible with current senvs_{\text{env}}) models superposed beliefs; prediction-consistent narrowing localizes the agent's true latent state.
  • Long-Term Goals and Look-Ahead Preferences: Preference pref(s,bb)\operatorname{pref}(s',b'|b) is factorized as pref(s)pref(bb)\operatorname{pref}(s') \operatorname{pref}(b'|b), where pref(s)\operatorname{pref}(s') targets specific observations and pref(bb)\operatorname{pref}(b'|b) is dynamically propagated via a look-ahead, akin to dynamic programming:

v0(b)=pref(s(b)),vn(b)=max{vn1(b)βn1p(bb)π(ab)}v_0(b) = \operatorname{pref}(s(b)),\qquad v_n(b') = \max \{ v_{n-1}(b'') \cdot \beta^{n-1} \cdot p(b''|b') \cdot \pi(a|b') \}

  • Reducing Prediction Errors: Candidate set narrowing rapidly recovers the true belief, while confidence-based reinforcement selectively solidifies reliable transitions.

6. Empirical Validation in Behavioral Paradigms

FEPS was validated in two behavioral-biology–inspired RL settings:

A. Timed-Response Task (Skinner Box analog):

  • Environment: Hidden MDP featuring ambiguous observations (e.g., "light on, hungry" from two different latent states).
  • Metrics: Error-free trajectory length, variational free energy (VFE), and EFE evolution.
  • Results:
    • Agents differentiate clones for ambiguous contexts.
    • VFE undergoes sharp drops corresponding to elimination of impossible transitions and correct context separation.
    • Look-ahead preferences enable the acquisition of correct multi-step policies.

B. Partially Observable Grid Navigation:

  • Environment: 3×33 \times 3 grid, hidden food goal, overlapping scent observations; agent actions: up/down/left/right; 3 clones per scent.
  • Metrics: Trajectory length, VFE, EFE, policy optimality, and median steps to goal.
  • Results:
    • For ζ=3\zeta = -3 (task) and ζ=+1\zeta = +1 (wandering), longest trajectories achieved in most trials.
    • Superposed belief tracking maximizes error-free exploration.
    • Post-training, clone-to-cell mapping recapitulates a cognitive map.
    • Preference reconfiguration reuses the trained world model for new goals.

7. Limitations and Prospects for Extension

Noted constraints and open areas for FEPS include:

  • Task specification is limited to sensory observations; direct hidden state preferences cannot be represented.
  • Scalability: Application to large state/action spaces may require hierarchical or function-approximate augmentations.
  • Exploration-exploitation: The softmax temperature ζ\zeta currently requires manual tuning; integration of intrinsic motivators (e.g., boredom, novelty) is pending.
  • Model expansion: Online cloning/structural growth and related mechanisms are not dynamically addressed.
  • Physical embodiment: Incorporation of FEPS and ECM update rules in real-world or neuromorphic agents remains future work.

FEPS thus establishes an interpretable, fully-internal-reward, graph-based instantiation of active inference, providing cognitive mapping, distinct hidden-state contextualization, and preference-adaptive policies through EFE minimization (Pazem et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Free Energy Projective Simulation (FEPS).