Perceptio: Formalizing Perception in Diverse Fields

Updated 4 July 2026

Perceptio is a multidisciplinary concept that formalizes structured perception as an explicit intermediate layer between raw inputs and decision-making, with applications in AI, economics, and semiotics.
The paper presents a vision–language model that integrates explicit segmentation and depth tokens into its autoregressive sequence, achieving significant accuracy gains on spatial benchmarks.
Explicit perceptual tokens enable targeted supervision, bridging abstract model inference with concrete visual and cognitive processes to improve overall model robustness.

Searching arXiv for papers on “Perceptio” and closely related uses of the term. Perceptio denotes several distinct research constructs across contemporary arXiv literature, unified by an interest in how perceptual structure is represented, externalized, or made operational. In multimodal machine learning, “Perceptio” names a perception-enhanced large vision–LLM that introduces explicit semantic segmentation and depth tokens into the autoregressive sequence in order to strengthen 2D and 3D spatial grounding (Li et al., 19 Mar 2026). In economic theory, “Perceptio” designates a framework for endogenous perception in single-agent screening, where attention and misperception are shaped by incentives (Balzer et al., 2024). In computational semiotics, the term is used to describe perceptual inference through observed–seen cycles and bipartite communication loops in semiotic networks (Kupeev et al., 2023). Taken together, these usages do not define a single unified doctrine. Rather, they mark a family of technically precise efforts to formalize perception as an intermediate structure between raw input and downstream judgment, action, or communication.

1. Perceptio as explicit spatial reasoning in vision–LLMs

In the most direct contemporary usage, Perceptio is a large vision–LLM that makes 2D segmentation and 3D depth explicit inside the autoregressive generation loop (Li et al., 19 Mar 2026). Its central claim is that standard large vision–LLMs excel at semantic understanding but remain weak at fine-grained spatial grounding because geometry is only implicit in pooled visual features. Perceptio addresses this by requiring the model to emit semantic segmentation and discretized depth tokens before producing a textual answer.

The sequence structure is fixed. The model generates a segmentation control token, then a depth block bracketed by start and end markers, and only afterward emits ordinary text tokens. This constitutes what the paper calls an explicit spatial chain-of-thought: the model first instantiates a concrete spatial interpretation of the scene and then conditions its final answer on that interpretation (Li et al., 19 Mar 2026). The architecture builds on InternVL2.5, a frozen SAM2 encoder and fine-tuned SAM2 decoder for segmentation, and a frozen depth VQ-VAE codebook distilled from Depth Anything V2.

Depth is tokenized through a VQ-VAE. Each depth map is encoded into a grid of latent vectors, quantized to codebook indices, and linearized into a fixed-length sequence. The implementation uses codebook size $K=128$ and $n=100$ depth tokens per image, with special markers $[\mathrm{d\_start}]$ and $[\mathrm{d\_end}]$ surrounding the sequence (Li et al., 19 Mar 2026). To stabilize this generation process, the model introduces a composite depth-token objective with marker, token, and count losses, alongside a soft-merging relaxation that reconstructs continuous depth from soft token assignments and permits differentiable supervision.

This design is paired with multi-task co-training across general image QA, pixel grounding conversations, perception-augmented corpora, and referring expression segmentation. The quantitative outcome is a consistent improvement on spatially demanding benchmarks. Perceptio-8B reaches 82.7, 77.9, and 80.0 cIoU on RefCOCO, RefCOCO+, and RefCOCOg respectively, and improves HardBLINK average accuracy to 71.0, reported as a 10.3-point gain over the compared baseline (Li et al., 19 Mar 2026). The same model also reports 83.4 accuracy on MMBench. This suggests that explicit spatial token emission does not merely visualize internal states post hoc; it materially alters downstream reasoning.

2. Tokenization, objectives, and sequence-level supervision

The technical distinctiveness of Perceptio lies in the fact that perception is supervised in the output space rather than inferred only from hidden activations. Segmentation is triggered by a learnable special token $[\mathrm{seg}]$ , which conditions the SAM2 decoder to predict a query-grounded mask. Depth is represented as an autoregressively generated token sequence, and the model is trained jointly with text next-token prediction, segmentation reconstruction, the composite depth-token objective, and depth reconstruction via soft-merging (Li et al., 19 Mar 2026).

The depth loss is defined as

$L_{\mathrm{depth}} = \lambda_m L_{\mathrm{marker}} + \lambda_t L_{\mathrm{token}} + \lambda_c L_{\mathrm{count}},$

with reported weights $\lambda_m=0.3$ , $\lambda_t=0.5$ , and $\lambda_c=0.2$ (Li et al., 19 Mar 2026). The total multi-task objective combines text generation, segmentation reconstruction, depth-token supervision, and depth reconstruction:

$L_{\mathrm{total}} = L_{\mathrm{LLM}} + L_{\mathrm{SegRecon}} + \lambda_d L_{\mathrm{depth}} + \lambda_r L_{\mathrm{DepthRecon}}.$

The reported training uses $n=100$ 0 and $n=100$ 1 (Li et al., 19 Mar 2026).

Ablation results indicate that both segmentation and depth tokens are functionally necessary. Removing depth tokens causes HardBLINK average accuracy to drop from 71.0 to 45.2, a decrease of 25.8 points, while removing segmentation tokens degrades general VQA-style metrics, including MMBench from 83.4 to 81.8 and SEED from 75.7 to 73.4 (Li et al., 19 Mar 2026). This indicates that 2D semantic grouping and 3D geometric ordering play complementary roles. A plausible implication is that explicit perceptual intermediates can regularize different failure modes: segmentation constrains object identity and support, while depth constrains relative distance and occlusion.

3. Broader machine-learning context: perceptual constancy and perception-centric training

Perceptio’s emphasis on explicit perceptual intermediates can be situated alongside other recent work that studies perception in large multimodal models, although these works do not use the term identically. “Probing Perceptual Constancy in Large Vision LLMs” evaluates 33 vision–LLMs on 253 experiments spanning color, size, and shape constancy and reports that humans significantly outperform models overall and in each domain (Sun et al., 14 Feb 2025). The same study finds a dissociation: shape constancy is comparatively strong and relatively scale-insensitive, whereas color and size constancy lag behind and improve with model size. ANOVA across domains yields $n=100$ 2, with Tukey HSD showing significant differences between shape and both color and size but not between color and size (Sun et al., 14 Feb 2025).

That result is relevant because Perceptio, in the spatial-token sense, addresses a narrower but related bottleneck: the lack of explicit geometric grounding in autoregressive multimodal models. The constancy study suggests that not all perceptual invariances are equally available from scale alone, and that shape-related performance may reflect only “minimal” rather than “robust” constancy (Sun et al., 14 Feb 2025). This suggests that explicit intermediate representations, such as segmentation and depth tokens, may be one route toward stronger model-based spatial reasoning, though the cited paper does not test Perceptio directly.

A second nearby line is reconstruction-based emergence of perceptual metrics. “From Images to Perception: Emergence of Perceptual Properties by Reconstructing Images” shows that a biologically inspired encoder–decoder, PerceptNet, develops feature representations whose distances correlate with human Mean Opinion Scores on TID2013 even without explicit perceptual supervision (Hernández-Cámara et al., 14 Aug 2025). The highest Spearman correlation occurs at the encoder’s V1-like stage, with alignment peaking for moderate noise $n=100$ 3, small blur $n=100$ 4, and moderate sparsity $n=100$ 5 (Hernández-Cámara et al., 14 Aug 2025). This suggests a different conception of perceptio: not explicit tokenized geometry, but perceptual structure emerging from reconstructive pressure in biologically constrained architectures.

4. Perceptio as endogenous perception in mechanism design

A separate and conceptually unrelated usage appears in economic theory. “Mechanism Design with Endogenous Perception” uses “Perceptio” to denote a framework in which an agent’s perception of their private information is itself shaped by incentives (Balzer et al., 2024). The model distinguishes attentive and inattentive cognitive states. When attentive, the agent perceives their type correctly; when inattentive, misperception is governed by a general perception-generating process $n=100$ 6.

The paper’s central formal contribution is a representation of the ex-ante value of attention:

$n=100$ 7

where the value of attention depends only on the allocation rule and not on the transfer rule (Balzer et al., 2024). This yields an analogue of revenue equivalence for attention incentives. The work also defines a welfare-based notion of perceptual accuracy and shows that accuracy is characterized by a sufficient statistic

$n=100$ 8

Here “Perceptio” does not refer to sensory perception or machine vision. It refers to endogenous cognitive access to private information. Nevertheless, the conceptual parallel is notable: in both the LVLM and mechanism-design usages, perception is not treated as a transparent input channel. It is modeled as a structured intermediate state that can be shaped by architecture or incentives. This suggests only an analogy, not a common theory.

5. Perceptio in semiotic networks and perceptual inference

A third usage appears in “Semiotics Networks Representing Perceptual Inference,” where perceptio is formalized through iterative observed–seen transformations (Kupeev et al., 2023). In this framework, “observed” denotes the raw input at the current step, while “seen” denotes the internal percept derived from it, expressed in the same raw modality. The observed-to-seen mapping is implemented by an encoder–decoder cycle:

$n=100$ 9

Perception is defined by iterated application of this operation until convergence to a percept image:

$[\mathrm{d\_start}]$ 0

The corresponding awareness property is the fixed-point identity

$[\mathrm{d\_start}]$ 1

The paper extends this intra-agent loop into inter-agent communication by defining semiotic networks in which one agent’s “seen” image becomes another agent’s “observed” image. Under specified conditions, the resulting sequence converges to bipartite orbits with alternating awareness operators (Kupeev et al., 2023). The framework is demonstrated by turning an image classifier into a “perceptualized image classifier”: a baseline classifier is preceded by a perceptual layer that maps inputs to attractor images before classification. On restricted MNIST regimes, the stochastic version, which averages randomized attractors, outperforms both the baseline and the vanilla perceptualized classifier (Kupeev et al., 2023).

This conception of perceptio is again distinct from the spatial-token LVLM. Yet it shares an important formal commitment: perception is treated as an explicit intermediate object that can be externalized and inspected. In the semiotic network case, the intermediate object is a fixed-point percept image; in the LVLM case, it is a sequence of segmentation and depth tokens.

6. Conceptual themes, divergences, and open questions

Across these distinct literatures, several recurring themes emerge. First, perception is repeatedly modeled as an intermediate representation rather than as a direct readout of the world. In the LVLM Perceptio, geometry is externalized into segmentation and depth tokens (Li et al., 19 Mar 2026). In endogenous-perception mechanism design, private information is filtered through attentive or inattentive cognitive states (Balzer et al., 2024). In semiotic networks, perception is the fixed point of observed–seen iteration (Kupeev et al., 2023).

Second, explicitness is treated as a methodological advantage. The spatial-token model supervises perception directly in the generated sequence (Li et al., 19 Mar 2026). The semiotic model visualizes what the network “sees” as a stable image (Kupeev et al., 2023). The mechanism-design model derives closed-form representations of attention incentives and perceptual accuracy (Balzer et al., 2024). This suggests a shared methodological preference for making intermediate perceptual structure observable, whether through tokens, attractors, or incentive functionals.

Third, these uses diverge sharply in ontology. The LVLM Perceptio concerns spatial grounding in multimodal autoregressive systems (Li et al., 19 Mar 2026). The mechanism-design Perceptio concerns cognitive states in screening problems (Balzer et al., 2024). The semiotic-network Perceptio concerns perceptual inference and communication loops (Kupeev et al., 2023). Any attempt to collapse them into a single doctrine would go beyond the available evidence. A more defensible interpretation is that “Perceptio” functions as a label for formalized perception-as-structure across otherwise unrelated domains.

Several open questions follow from this dispersion. In multimodal learning, one unresolved issue is whether explicit perceptual tokens improve only benchmark-specific grounding or whether they induce broader, more robust perceptual invariances. The constancy results suggest the latter remains unsettled (Sun et al., 14 Feb 2025). In semiotic modeling, it remains open how far attractor-based perceptualization scales beyond restricted training regimes (Kupeev et al., 2023). In mechanism design, richer cognitive hierarchies beyond the binary attentive–inattentive split remain future work (Balzer et al., 2024). These questions suggest that the contemporary significance of perceptio lies less in terminological unity than in a shared technical ambition: to treat perception as a manipulable and analyzable intermediate layer between input and judgment.

7. Significance in current research

Within current arXiv discourse, the most prominent technical instantiation of Perceptio is the perception-enhanced vision–LLM with explicit spatial token generation (Li et al., 19 Mar 2026). Its reported gains on RefCOCO, RefCOCO+, RefCOCOg, HardBLINK, and MMBench indicate that explicit 2D and 3D perceptual representations can strengthen spatial grounding in large multimodal systems (Li et al., 19 Mar 2026). This marks a shift from implicit geometric inference toward supervised perceptual externalization.

At the same time, the broader set of “Perceptio” usages shows that the term has become a locus for formal reflection on what it means to model perception rigorously. In one case, perception is a sequence-level geometric scaffold for answering visual questions (Li et al., 19 Mar 2026). In another, it is an incentive-sensitive interpretation of one’s own type (Balzer et al., 2024). In another, it is a fixed-point transformation between observed and seen states (Kupeev et al., 2023). The common denominator is not a shared domain but a shared formal attitude: perception is something to be represented, not merely assumed.

This suggests that “Perceptio,” as it appears in recent literature, names a broader epistemic program. A plausible implication is that future work using the term will continue to privilege explicit intermediate representations—tokenized, dynamical, or decision-theoretic—over opaque end-to-end mappings. Whether those representations ultimately converge across fields remains unresolved. What is already clear is that the term has become associated with efforts to move perception from latent assumption to first-class computational object.