Cognition-Inspired Meta-Action Framework (CINEMA)

Updated 19 January 2026

CINEMA is a cognition-inspired framework that integrates metacognitive particles and structured meta-actions to decompose complex reasoning processes.
It applies a hierarchical scaffold derived from human metacognition to structure visual understanding tasks, yielding competitive benchmark performance.
Its reinforcement learning refinement and stochastic dynamical models enhance agent adaptability, improving measurable agency and separation in cognitive systems.

The Cognition-Inspired Meta-Action Framework (CINEMA) operationalizes structured reasoning in both cognitive systems and multimodal AI agents by imposing a hierarchical scaffold informed by human metacognitive processes and statistical physics. CINEMA comprises formal dynamical models of metacognitive particles, emphasizing explicit belief layers, mental action capabilities, and measurable constructs such as sense of agency and separation. In visual understanding tasks, CINEMA decomposes reasoning into sequential meta-actions inspired by human cognitive workflow, yielding high performance in multi-image, multi-frame, and single-image benchmarks (Sandved-Smith et al., 2024, Yin et al., 12 Jan 2026).

1. Foundations: Metacognitive Particle Formalism

CINEMA is grounded in the formalism of metacognitive particles. A particle is defined as a system whose states $x = (\eta, b, \mu)$ partition into external states $\eta$ , Markov blanket $b$ , and internal states $\mu$ , with conditional independence enforced as $P(\eta, \mu \mid b) = P(\eta \mid b) P(\mu \mid b)$ . Cognitive particles parameterize an approximate posterior $Q_\mu(\eta) \approx P(\eta \mid b)$ via their internal states. Metacognitive particles split these internal states into first-order ( $\mu^{(1)}$ ) and second-order ( $\mu^{(2)}$ ) statistics, where

$\mu^{(1)} \mapsto Q_{\mu^{(1)}}(\eta) = P(\eta \mid b), \quad \mu^{(2)} \mapsto Q_{\mu^{(2)}}(\eta, a, \mu^{(1)}) = P(\eta, a, \mu^{(1)} \mid b)$

Thus, metacognitive particles instantiate beliefs about beliefs, enabling meta-reasoning over their own cognitive states (Sandved-Smith et al., 2024).

2. Meta-Action Formulation in Visual Reasoning

In multimodal AI settings, CINEMA decomposes multi-image reasoning into five discrete meta-actions: Global, Focus, Hint, Think, and Answer. The agent’s state at each step $s_t = (Q, I, a_1...a_{t-1})$ (with textual question $Q$ , image set $I$ , and prior actions) proceeds according to a factorized policy

$\pi_\theta(a_t, c_t \mid s_t) = \pi_\theta(a_t \mid s_t) \cdot \pi_\theta(c_t \mid a_t, s_t)$

Meta-action definitions are as follows:

Meta-Action	Purpose	Output Token and Format
Global	Survey images/question for dependencies	<Global> g </Global>
Focus	Zoom to key clue(s) in image subset	<Focus idx=i> f_i </Focus>
Hint	Summarize possible distractors/errors	<Hint> h </Hint>
Think	Internal compositional reasoning	<Think> τ </Think>
Answer	Final concise answer	<Answer> y </Answer>

Structural constraints enforce only one Global and one Answer, with at least one Focus, Hint, or Think token (Yin et al., 12 Jan 2026).

3. Learning Dynamics: Retrieval-Based Tree Sampling and RL

CINEMA leverages a tree database of correct trajectories to bootstrap initial supervision (cold-start phase). For a new input, a student model proposes a trajectory, which is refined by a teacher (e.g., GPT-4o), and a distinct correct trajectory is retrieved using edit-distance or embedding similarity metrics. Each cold-start instance includes two correct meta-action sequences.

Post-supervision, CINEMA applies pure RL with two phases:

Diversity-Preserving Strategy (DPS): Encourages exploration by penalizing trajectory homogeneity. The per-sample reward combines accuracy, format validity, and a penalty proportional to majority-pattern dominance:

$R = 0.5 \cdot [R_{\text{acc}} \cdot (R_{\text{acc}} - \frac{N-1}{G-1} \cdot 0.1)] + 0.5 \cdot R_{\text{fmt}}$

Annealed Exploitation (DAPO): Refines policy by dynamic clipping and group-relative advantage, progressively annealing clipping range to favor deterministic responses. At token $t$ in rollout $i$ :

$r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q,o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q,o_{i,<t})}$

$\bar{A}_{i,t} = \frac{R_i - \text{mean}_j R_j}{\text{std}_j R_j}$

Objective:

$J_{\text{DAPO}}(\theta) = \mathbb{E}_{q, \{o_i\}} \Bigg[ \frac{1}{\sum_i |o_i|} \sum_i \sum_t \min(r_{i,t}(\theta) \bar{A}_{i,t}, \text{clip}(r_{i,t}(\theta), 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}})\bar{A}_{i,t}) \Bigg]$

The RL set comprises 58,000 instances reserved for fine-tuning where teacher models fail to produce correct outputs (Yin et al., 12 Jan 2026).

4. Stochastic Dynamical Models and Free Energy Principle

CINEMA's metacognitive architecture is formalized via stochastic differential equations:

$\dot{x}(t) = f(x(t)) + w(t)$

with partitioned flows for external ( $\eta$ ), sensory ( $s$ ), active ( $a$ ), and internal ( $\mu$ ) states. State and action updates follow generalized Onsager–Machlup dynamics:

$\dot{\mu} = D\mu - \nabla_\mu \mathcal{F}(s, b, \mu) \approx -\nabla_\mu \mathcal{F}$

$\dot{a} = -\nabla_a \mathcal{F}(s, b, \mu)$

Free energy at each level is

$\mathcal{F}(s, b, \mu) = \mathbb{E}_{Q_\mu}[\ln Q_\mu(\cdot) - \ln P(\cdot, b)]$

In nested metacognitive systems, higher-level active states ( $a^{(2)}$ ) modulate first-order beliefs ( $\mu^{(1)}$ ) via direct gradient descent on free energy (Sandved-Smith et al., 2024).

5. Agency, Separation, and Measurement Principles

The sense of agency in CINEMA arises only in active metacognitive particles. These systems construct a joint meta-belief over their first-order internal states and actions:

$Q_{\mu^{(2)}}(\mu^{(1)}, a^{(1)}) = P(\mu^{(1)} \mid s^{(2)}, a^{(2)}) \times P(a^{(1)} \mid s^{(1)}) \approx P(\mu^{(1)}, a^{(1)} \mid s^{(2)}, a^{(2)})$

Agency strength is quantified by the KL-divergence between the joint distribution and the product of marginals:

$D_{\text{KL}} [ Q_{\mu^{(2)}}(\mu^{(1)}, a^{(1)}) \| Q_{\mu^{(2)}}(\mu^{(1)}) Q_{\mu^{(2)}}(a^{(1)}) ]$

If this divergence vanishes, internal states and actions are believed independent—a "no sense of agency" regime.

Separation (duality of 'I' and environment) is preserved by restricting belief updating to information passing through the Markov blanket $b$ , ensuring conditional independence: $P(\eta, \mu \mid b)$ enforces $\eta \perp \mu \mid b$ (Sandved-Smith et al., 2024).

6. Empirical Benchmarks and Generalizability

CINEMA has been empirically validated on multiple visual-language benchmarks, outperforming closed-source baselines such as GPT-4o and specialized video models. Example results:

Model	MUIR (%)	MVMath (%)	VideoR1 (%)
Qwen2.5VL	57.9	26.7	62.6
GPT-4o	68.0	32.1	–
CINEMA	71.6	36.9	66.5

On single-image math and reasoning datasets, CINEMA achieves results comparable or superior to both multimodal and specialized models.

Pass@K evaluation demonstrates improved robustness when selecting the best of multiple samples, attributable to the two-stage RL protocol ("DPS + annealing") (Yin et al., 12 Jan 2026).

Generalizable across multi-image, video, and single-image tasks, the five-action schema adapts without structural changes, though further meta-action granularity and tool integration are plausible extensions. Limitations include dependency on powerful teacher models, large trajectory storage, RL hyperparameter sensitivity, reliance on multiple-choice benchmarks, and coverage of reasoning primitives (Yin et al., 12 Jan 2026).

7. Implementation Workflow

CINEMA can be instantiated by specifying a system’s SDE ( $\dot{x} = f(x) + w$ ), partitioning states with Markov blanket constraints, then constructing generative models for the belief layers. Gradient-descent flows are implemented for each cognitive and metacognitive level. Distinction between passive and active metacognitive particles is enforced by blanket nesting and direct action modulation of first-order beliefs. For empirical evaluation, sense of agency and separation can be measured via mutual information or KL-divergence metrics. Model complexity is tuned against predictive accuracy by the free energy principle's complexity–accuracy trade-off (Sandved-Smith et al., 2024).

In practice, CINEMA integrates high-quality meta-action trajectories, structured RL, and formal statistical mechanics to furnish a unified framework for cognitive and metacognitive reasoning in both biological and artificial systems.

Markdown Report Issue Upgrade to Chat

References (2)

Metacognitive particles, mental action and the sense of agency (2024)

Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cognition-Inspired Meta-Action Framework (CINEMA).