Context Formation Mechanisms

Updated 4 July 2026

Context formation is the process of converting heterogeneous raw signals into structured, task-relevant conditions that drive decision-making.
It employs diverse methodologies such as probabilistic renormalization, hierarchical belief updates, and encoder-based extraction to optimize performance across domains.
This mechanism enhances system alignment, efficiency, and interpretability in fields ranging from robotics and cognition to compression, retrieval, and agentic LLM systems.

Searching arXiv for recent and relevant papers on “context formation” and adjacent usages across robotics, cognition, compression, retrieval, and LLM systems. Context formation is the process by which a system constructs, organizes, and uses context so that behavior is conditioned on relevant structure rather than on raw inputs alone. Across contemporary research, the term denotes different but formally comparable operations: in robotics, it is the mapping from image and instruction to object-conditioned robot coordinates; in cognitive science, it is the action of contexts on concept states; in cognitive hierarchies, it is top-down prediction from higher to lower belief states; in compression, it is the construction of probabilistic models from neighboring patterns; in retrieval-augmented generation, it is the constrained assembly of coherent span sets under a token budget; and in agentic LLM systems, it is the explicit specification or deterministic assembly of evolving prompt state (Venkatesh et al., 2024). This suggests that “context formation” is best understood not as a single algorithm, but as a family of mechanisms for turning heterogeneous signals into structured conditions for inference, control, or communication (Khurshid et al., 15 Jan 2026).

1. Context formation as a general systems concept

In several lines of work, context is defined as information that changes how states are interpreted or how actions are selected. In the SCOP theory of concepts, a concept is represented by a tuple $(\Sigma, {\cal M}, {\cal L}, \mu, \nu)$ , where ${\cal M}$ is the set of contexts and $\mu$ is the state–context–state transition probability; context is therefore an active operator that changes which concept state is evoked (Veloz et al., 2013). In cognitive hierarchies, context is “higher level information that helps to predict belief states at lower levels,” operationalized by contextual elements $C$ , context enrichment functions $\varrho_{j,i}$ , and a prediction update operator $\gamma$ (Hengst et al., 2018). In “Context Engineering 2.0,” context is formalized as

$C = \bigcup_{e \in \mathcal{E}_{\text{rel}}} \text{Char}(e),$

that is, the union of characterization information for all entities relevant to a user–application interaction (Hua et al., 30 Oct 2025).

These formulations differ in ontology but share a common structure. Each introduces a representational layer that is neither raw observation nor final action: SCOP uses renormalized state distributions, cognitive hierarchies use belief-state modifiers, and context engineering uses an explicit transformation from raw context and task to a context-processing function $\text{CE}:(C,\mathcal{T})\to f_{\text{context}}$ (Hua et al., 30 Oct 2025). A plausible implication is that context formation is the intermediate operation that determines what variables count as salient, available, and actionable within a task-specific regime.

Research on social simulation makes this distinction explicit by separating “context formation” from “context navigation.” There, context formation specifies the experimental design so that the model’s internal task representation matches the intended human context, while navigation guides reasoning within that representation (Kong et al., 4 Jan 2026). The same separation appears, in different vocabulary, in agentic LLM systems, where ACDL describes the policy that maps evolving internal state, environment state, and previous responses into role-typed prompt messages over time (Pelc et al., 3 May 2026).

2. Formal mechanisms of context formation

Several papers provide explicit mathematical machinery for how context is formed. In SCOP, empirical typicality ratings $L(p,e)$ are normalized into transition probabilities

$\mu(p,e,\hat p)=\frac{L(p,e)}{T(e)}, \qquad T(e)=\sum_{p\in\Sigma}L(p,e),$

and then filtered by a threshold ${\cal M}$ 0 to produce a renormalized distribution ${\cal M}$ 1 over surviving states (Veloz et al., 2013). Context relevance is then defined as

${\cal M}$ 2

with robustness ${\cal M}$ 3 (Veloz et al., 2013). In that framework, context formation is the selection and renormalization of concept states under a context-dependent probability kernel.

In cognitive hierarchies, belief-state update is split into observation update and prediction update. If ${\cal M}$ 4 is a node, then bottom-up sensing applies

${\cal M}$ 5

while top-down context applies

${\cal M}$ 6

where ${\cal M}$ 7 is aggregated from parent beliefs through context enrichment functions ${\cal M}$ 8 (Hengst et al., 2018). The formal locus of context formation is therefore the mapping from higher-level belief states ${\cal M}$ 9 to child-relevant contextual elements via $\mu$ 0, and then to revised child beliefs through $\mu$ 1.

In representation learning, the contexture theory casts context formation as learning from the association between an input $\mu$ 2 and a context variable $\mu$ 3. The central operator is

$\mu$ 4

with singular functions that define the “contexture” of the $\mu$ 5– $\mu$ 6 association (Zhai, 28 Apr 2025). An encoder learns the contexture when its feature span equals the span of the top singular functions of this operator. The paper’s central claim is that if an encoder captures the maximum information of this association, then it is optimal on the class of tasks compatible with the context (Zhai, 28 Apr 2025). This suggests a broad abstraction: context formation can be viewed as constructing operators or summaries that preserve the task-relevant structure of an association.

3. Context formation in embodied and cognitive systems

In multi-robot robotics, ZeroCAP formulates context-aware pattern formation as a mapping

$\mu$ 7

where $\mu$ 8 is an image of the environment, $\mu$ 9 is a natural-language instruction, and $C$ 0 is the set of target robot coordinates (Venkatesh et al., 2024). The environment contains an object of interest $C$ 1, which may be implied rather than explicitly named, and the object’s shape is represented after segmentation as a connected graph $C$ 2 (Venkatesh et al., 2024). The pipeline identifies the object and generates a Pattern Formation Instruction $C$ 3 with a VLM, segments the object into a mask $C$ 4, converts the mask into a shape graph, and then asks an LLM to generate coordinates from $C$ 5, where $C$ 6 is the robot team (Venkatesh et al., 2024).

The notion of context is explicitly multi-layered in ZeroCAP: spatial context is the graph $C$ 7, semantic context comes from interpreting the instruction and image jointly, and task context is encoded in $C$ 8, such as “around,” “inside,” or “caging” (Venkatesh et al., 2024). The system defines three purposeful pattern categories—general pattern formation, infill pattern formation, and caging pattern formation—and the LLM reasons over the shape graph rather than raw pixels (Venkatesh et al., 2024). The ablation that “Edges only” yields the best success rate with minimal token usage suggests that symbolic edge-level structure can be a particularly effective context representation for language-conditioned geometric reasoning (Venkatesh et al., 2024).

In cognitive modeling, context formation appears as top-down disambiguation. In the letter/word hierarchy example, the word-level node forms a belief over “THE” versus “CAT,” then sends context to the ambiguous middle-letter node via a context enrichment function, so that the lower-level belief can be updated toward the contextually appropriate interpretation (Hengst et al., 2018). In the visual servoing example, a higher-level physics model predicts future object position and sends that prediction as context to a lower visual node, reducing average tracking error from approximately $C$ 9 to $\varrho_{j,i}$ 0 (Hengst et al., 2018). In the rigid-object pose-tracking example, a higher-level spatial node uses a simulated depth camera to generate object-wise predicted edge sets $\varrho_{j,i}$ 1, which become perceptual context for lower-level edge matching and PnP estimation (Hengst et al., 2018). Across these examples, context formation is not passive annotation; it is the generation of predictive structure that changes the lower-level inference problem.

4. Context formation in coding, compression, and representation learning

In lossless screen-content compression, context formation is the construction of a probabilistic model for the next symbol from already coded data. The original soft context formation coder stores many previously seen local patterns, finds a set of similar patterns at encoding time, and merges their histograms to form a context “softly” based on similarity (Och et al., 26 Aug 2025). In the RGB 4:4:4 version, the context pattern for the current pixel $\varrho_{j,i}$ 2 is $\varrho_{j,i}$ 3, and the probability of a color triplet $\varrho_{j,i}$ 4 is approximated as

$\varrho_{j,i}$ 5

where $\varrho_{j,i}$ 6 is the merged count and $\varrho_{j,i}$ 7 is the total count (Och et al., 26 Aug 2025). The key contrast with hard context modeling is that several similar patterns contribute to the merged histogram, rather than a single discrete context identifier.

The 4:2:0 extension of soft context formation adds two mechanisms that are explicitly context-forming. First, it codes $\varrho_{j,i}$ 8 first and then codes chroma pairs $\varrho_{j,i}$ 9 jointly, justified by normalized mutual information showing that $\gamma$ 0 and $\gamma$ 1 remain highly correlated for screen content (Och et al., 26 Aug 2025). Second, it introduces luma-guided modified MAP prediction and Chroma Range Coding, where feasible chroma ranges are conditioned on quantized luma and spatial block, thereby restricting the effective alphabet for Stage 1 histograms, Stage 2 palettes, and Stage 3 residuals (Och et al., 26 Aug 2025). The result is a multi-dimensional context consisting of spatial pattern similarity, luma-guided reinforcement, and luma-based feasibility constraints. Averaged over a large screen-content image dataset, the proposed method outperforms HEVC-SCC, with HEVC-SCC needing 5.66% more bitrate (Och et al., 26 Aug 2025).

A related VVC hybrid shows context formation operating as a region-selective coding tool. There, synthetic image parts are coded losslessly using soft context formation, while the rest are coded with VVC, and the decoded VVC layer is used to initialize edge pixels and palette statistics for SCF (Och et al., 2023). The proposed system achieves Bjontegaard-Delta-rate gains of 4.98% compared to VVC on the evaluated datasets (Och et al., 2023). In both screen-coding papers, context formation denotes the adaptive construction of conditional symbol distributions from local structure and cross-layer information.

In representation learning, contexture theory reframes pretraining objectives as context formation around a context variable $\gamma$ 2. Supervised learning uses labels as $\gamma$ 3, self-supervised learning uses augmentations or masked views as $\gamma$ 4, and generative modeling uses multi-step contexts $\gamma$ 5 (Zhai, 28 Apr 2025). The theory shows that many pretraining objectives can learn the contexture, and introduces two general objectives, SVME and KISE, for learning it (Zhai, 28 Apr 2025). It also shows how to mix multiple contexts together through convolution, convex combination, or concatenation, described as “an effortless way to create better contexts from existing ones” (Zhai, 28 Apr 2025). This suggests a unifying view in which context formation is the design of the auxiliary variable or operator that governs which downstream tasks are linearly accessible from a representation.

5. Context formation in retrieval, prompt design, and agent systems

In enterprise retrieval-augmented generation, context formation is treated as a constrained selection problem rather than ranking followed by top- $\gamma$ 6 concatenation. The “context bubble” framework represents each chunk as $\gamma$ 7, where $\gamma$ 8 is text, $\gamma$ 9 is structural label, and $C = \bigcup_{e \in \mathcal{E}_{\text{rel}}} \text{Char}(e),$ 0 is token count, and constructs a bubble $C = \bigcup_{e \in \mathcal{E}_{\text{rel}}} \text{Char}(e),$ 1 under a global budget $C = \bigcup_{e \in \mathcal{E}_{\text{rel}}} \text{Char}(e),$ 2, per-section budget fractions $C = \bigcup_{e \in \mathcal{E}_{\text{rel}}} \text{Char}(e),$ 3, and a redundancy gate based on lexical overlap (Khurshid et al., 15 Jan 2026). Scoring is structure-informed: $C = \bigcup_{e \in \mathcal{E}_{\text{rel}}} \text{Char}(e),$ 4 where the prior adds section-specific boosts and keyword boosts (Khurshid et al., 15 Jan 2026). Selection then enforces token budget, section budget, and overlap threshold constraints (Khurshid et al., 15 Jan 2026). Under a fixed 800-token budget in the representative “scope of work” query, the full context bubble uses about 218 tokens on average, covers three sections, and achieves the lowest average overlap, whereas flat top- $C = \bigcup_{e \in \mathcal{E}_{\text{rel}}} \text{Char}(e),$ 5 uses about 780 tokens and covers only one section (Khurshid et al., 15 Jan 2026). In this setting, context formation is the explicit shaping of the evidence set to preserve structure, facet coverage, and auditability.

“Context Engineering 2.0” generalizes this into a lifecycle view. Context engineering is defined as

$C = \bigcup_{e \in \mathcal{E}_{\text{rel}}} \text{Char}(e),$ 6

where $C = \bigcup_{e \in \mathcal{E}_{\text{rel}}} \text{Char}(e),$ 7 is a composition of operations for collection, storage, representation, multimodal handling, self-baking, selection, sharing, and dynamic adaptation (Hua et al., 30 Oct 2025). The paper organizes these into collection and storage, management, and usage, making context formation the composite of acquisition, organization, abstraction, maintenance, and selection (Hua et al., 30 Oct 2025). Two design principles are emphasized: the Minimal Sufficiency Principle, according to which “the context value lies in sufficiency, not volume,” and the Semantic Continuity Principle, according to which “the purpose of context is to maintain continuity of meaning, rather than merely continuity of data” (Hua et al., 30 Oct 2025).

A more formal prompt-level treatment appears in ACDL, where context formation is the design logic that maps internal state, environment state, and prior responses into an ordered sequence of role messages $C = \bigcup_{e \in \mathcal{E}_{\text{rel}}} \text{Char}(e),$ 8 for each LLM call (Pelc et al., 3 May 2026). ACDL supports time-indexed references, loops, conditionals, fragments, and nested substeps, so that the evolution of prompt state can be described precisely rather than via prose (Pelc et al., 3 May 2026). The MINT study described there shows that modest structural variations in context—such as whether tool responses appear as tool versus user messages, or whether reasoning traces are replayed—produce measurable performance differences of up to about five percentage points (Pelc et al., 3 May 2026). This directly supports the view that context formation is a first-class design variable in agentic systems.

An even stronger systems-level version appears in the “Context” architecture of the Magarshak stack, where interaction context for a goal stream and session is assembled as

$C = \bigcup_{e \in \mathcal{E}_{\text{rel}}} \text{Char}(e),$ 9

with permanent, session, cold, and dynamic blocks (Magarshak, 21 Apr 2026). Because these blocks are deterministic pure functions of graph state and because of Groker byte-identity, the stable prefix is byte-identical across turns between semantic changes, enabling near-100% KV-cache reuse (Magarshak, 21 Apr 2026). The Context Stability Theorem bounds expected per-turn LM input cost as

$\text{CE}:(C,\mathcal{T})\to f_{\text{context}}$ 0

where stable blocks cost 10% of full price by prompt caching (Magarshak, 21 Apr 2026). Here context formation becomes a deterministic systems primitive with formal cost guarantees.

Context formation also governs how communication becomes more efficient over repeated interaction. In repeated reference games, conversational context is the fixed image set together with the history $\text{CE}:(C,\mathcal{T})\to f_{\text{context}}$ 1 of prior trials, each containing target image, speaker utterance, and listener guess (Vaduguru et al., 28 Oct 2025). The paper trains large multimodal models by simulating reference games, sampling candidate utterances, and constructing pairwise preferences that jointly reward success and penalize message length (Vaduguru et al., 28 Oct 2025). Over the course of interaction, the resulting speakers reduce message length by up to 41% while increasing success by 15%, and human listeners respond faster to the convention-forming model (Vaduguru et al., 28 Oct 2025). The ablation shows that training on success alone yields stable but verbose conventions, while cost alone yields short but uninterpretable utterances; both are necessary for convention formation (Vaduguru et al., 28 Oct 2025). In this sense, context formation is the accumulation of shared conversational history into an implicit codebook.

In LLM social simulations, the two-stage framework of context formation and navigation is proposed precisely because baseline prompts often leave the model with the wrong task representation in complex experimental settings (Kong et al., 4 Jan 2026). For the sequential purchasing game, context formation alone changes the sign of the key Q50 queue effect but is insufficient to recover the attenuated wait-time response; only context formation plus navigation reproduces both hypotheses (Kong et al., 4 Jan 2026). For the demand-estimation task, context formation alone turns an inverted-U demand curve into a downward-sloping one, because the model now understands that price is systematically manipulated rather than suspicious at zero (Kong et al., 4 Jan 2026). Across GPT-4o, GPT-5, Claude-4.0-Sonnet-Thinking, and DeepSeek-R1, the paper finds that complex decision environments require both stages to achieve behavioral alignment, whereas the simpler demand-estimation task requires only context formation (Kong et al., 4 Jan 2026).

These results support a general distinction. When the primary failure mode is misrepresentation of the task, context formation can be sufficient; when the task requires structured reasoning over beliefs, additional navigation is needed. A plausible implication is that many failures attributed to “reasoning” may instead be failures of task-context specification.

7. Limits, controversies, and recurring design trade-offs

Across domains, context formation faces recurring constraints. ZeroCAP explicitly notes static patterns only, centralized control, limited feedback, and no explicit safety or collision avoidance (Venkatesh et al., 2024). Cognitive hierarchies rely on hand-specified models and DAG structures, without learning of context functions or recurrent lateral dynamics (Hengst et al., 2018). Contexture theory emphasizes that increasing model size alone yields diminishing returns and that further progress requires better contexts, but it also notes that when the association between $\text{CE}:(C,\mathcal{T})\to f_{\text{context}}$ 2 and $\text{CE}:(C,\mathcal{T})\to f_{\text{context}}$ 3 becomes too strong, context complexity rises and sample complexity worsens (Zhai, 28 Apr 2025). Enterprise context bubbles depend on hand-designed structural priors and lexical overlap thresholds, even though the framework is presented as deterministic and auditable (Khurshid et al., 15 Jan 2026).

A recurrent controversy concerns explicit versus implicit context. SCOP and cognitive hierarchies model context explicitly as operators or contextual elements (Veloz et al., 2013). Agentic prompt languages and enterprise RAG frameworks also externalize context into inspectable structures (Pelc et al., 3 May 2026). By contrast, task-vector work in in-context learning studies internal context as hidden-state task vectors extracted from activation space; these task vectors can emerge naturally, but may be weak or non-local, and TVP-loss is introduced to force strong localized task representations (Yang et al., 16 Jan 2025). That work shows that task vectors are formed only in certain prompt formats and often at middle or later layers, with TVP performance approaching ICL performance when trained appropriately (Yang et al., 16 Jan 2025). This suggests an important methodological divide: some research externalizes context into symbolic or architectural structures, while other research studies context as latent state inside a transformer.

Another recurring trade-off is between richness and efficiency. The context bubble paper argues that top- $\text{CE}:(C,\mathcal{T})\to f_{\text{context}}$ 4 retrieval causes redundancy and fragmentation, so context should be compact, diverse, and structure-aware (Khurshid et al., 15 Jan 2026). “Context Engineering 2.0” formulates this as sufficiency rather than volume (Hua et al., 30 Oct 2025). Soft context formation in compression similarly prefers merged histograms over exhaustive hard partitions, and the 4:2:0 extension uses luma-guided range restriction to shrink the effective alphabet (Och et al., 26 Aug 2025). These convergences suggest that effective context formation is often a matter of preserving the right invariants while aggressively discarding irrelevant support.

Taken together, the literature portrays context formation as a foundational operation spanning cognition, robotics, compression, retrieval, and LLM systems. It creates the intermediate structures—belief modifiers, symbolic graphs, probabilistic neighborhoods, compatible context variables, coherent evidence bundles, or deterministic prompt blocks—through which raw data becomes decision-relevant. What varies across domains is not whether context must be formed, but what representation is formed, how strongly it constrains downstream behavior, and which guarantees are sought: alignment, efficiency, robustness, interpretability, or control.