PerceptiNet: Semantic & Perceptual Systems

Updated 3 July 2026

PerceptiNet is a term for diverse perception-centric systems, including modules for multimodal semantic extraction in 6G embodied networks and perceptual metric learning in vision and haptics.
It extracts unified high-level semantic representations from heterogeneous sensors to support adaptive semantic communication and coordinated task planning in complex scenarios.
In visual and haptic applications, PerceptNet leverages deep fusion and bio-inspired architectures to achieve efficient perceptual quality assessment through metric learning.

PerceptiNet is not a single universally fixed architecture across the arXiv literature. In its most exact recent usage, it denotes the Perception Semantic Network within the Collaborative Conversational Embodied Intelligence Network (CC-EIN) for 6G multi-agent embodied systems, where it converts heterogeneous local sensing into a unified semantic representation for downstream semantic communication and coordination (Chen et al., 25 Nov 2025). In parallel, the spelling also overlaps with, or is used informally for, the PerceptNet family of perceptual models in haptics and vision, including metric learning for haptic textures, human-vision-inspired image-quality models, and later self-supervised and parametric variants (Hepburn et al., 2019).

1. Nomenclature and referential scope

The term PerceptiNet is lexically unstable in the literature. In the CC-EIN paper, PerceptiNet is an explicit module name: the paper states, “PerceptiNet extracts high-level semantic information from multimodal data,” and places it as the first stage in the sequence PerceptiNet → DRAOSC → CohesiveMind → InDec (Chen et al., 25 Nov 2025). By contrast, several other papers use the exact spelling PerceptNet, while noting that “PerceptiNet” may arise as a search variation or informal spelling in discussions of the same model family (Kumari et al., 2019).

This split matters because the associated technical objects are different. The CC-EIN PerceptiNet is a multimodal semantic front-end for embodied collaboration, whereas the PerceptNet literature in haptics and vision concerns perceptual metrics, human-vision-inspired embeddings, and image quality assessment. A plausible implication is that “PerceptiNet” functions less as a single canonical architecture than as a label attached to multiple perception-centric systems whose common theme is the construction of task-relevant internal representations.

2. PerceptiNet inside CC-EIN

Within CC-EIN, PerceptiNet is the perception front-end rather than the communication or planning module. The paper explicitly assigns four distinct roles: PerceptiNet extracts semantic information from multimodal sensing, DRAOSC performs adaptive semantic transmission, CohesiveMind handles task decomposition and allocation using global semantics, and InDec provides Grad-CAM-based interpretability (Chen et al., 25 Nov 2025). PerceptiNet therefore creates the semantic substrate on which the rest of the system operates.

Its stated target is the paper’s “perception misalignment” problem. The motivating scenario is a post-earthquake rescue environment populated by heterogeneous embodied intelligent devices—drones, autonomous vehicles, and robot dogs—that observe the world with different sensing modalities and resolutions. The paper motivates multimodality by noting that visual sensing offers broad semantic cues but degrades under occlusion or poor lighting, whereas geometric sensing such as LiDAR or point-cloud sensing provides structure but weaker semantics. PerceptiNet is introduced to extract “high-level semantic representations that are consistent, compact, and relevant for the task.”

Functionally, the output is a unified semantic representation produced from local multimodal sensing on each device. That representation is then consumed in two ways. First, it is passed to DRAOSC as the semantic content to be transmitted under bandwidth and channel constraints. Second, it is used by CohesiveMind as the environmental semantic model for task parsing, decomposition, and allocation. The design rationale is explicit: without a perception module that normalizes heterogeneous sensing into a shared semantic space, later semantic communication and multi-agent planning would lack a stable common language.

3. Architectural composition and interface design

The CC-EIN paper presents PerceptiNet at a high level rather than as a fully specified standalone network. The input side consists of multimodal environment data from the embodied devices, alongside communication-environment evaluation. In Section III-A, the devices are described as being equipped with cameras and LiDAR, while the abstract and conclusion also refer to image + radar or to “visual images, radar signals, and environmental parameters” (Chen et al., 25 Nov 2025). The paper itself notes no formal reconciliation of this terminology. The safest reading is that PerceptiNet is intended as a multimodal perception-and-semantic-encoding layer combining visual sensing, range or 3D sensing, and communication-state context.

Per-device processing is distributed. In the local perception stage, the drone uses a visual agent with YOLOv11 and HRNet; the autonomous vehicle uses a radar point cloud agent with LIO-SAM and PointPillars to handle LiDAR point cloud data; the robot dog combines close-range camera and point cloud sensing; and a dedicated communication agent analyzes communication-environment data for adaptive transmission optimization (Chen et al., 25 Nov 2025). This suggests a heterogeneous semantic extraction stack rather than a single monolithic encoder.

The fusion mechanism is described conceptually as cross-modal deep fusion. The abstract states that “a cross-modal fusion maps image and radar data into a unified semantic representation, ensuring consistent task understanding across MEIDs,” and the main text repeatedly refers to “deep fusion of multimodal data” and “unified semantic representations.” However, the paper provides no explicit equations, fusion layers, transformer blocks, cross-attention definitions, concatenation/projection formulas, alignment loss, or representation format. It also does not define whether the output is a token sequence, semantic graph, fixed-length vector, object list, or scene map. The architecture is therefore specified functionally, not analytically.

4. Evaluation, semantic consistency, and underspecification

PerceptiNet’s importance in CC-EIN is most visible at the system interface with semantic communication. The overall framework adopts an “understand first, transmit later” paradigm: task-relevant semantic features are extracted, compressed, transmitted, and reconstructed for task execution. PerceptiNet performs the “understand first” step, while DRAOSC adjusts coding schemes, compression ratios, channel selection, and transmission power according to task urgency and channel conditions such as SNR and bandwidth utilization (Chen et al., 25 Nov 2025). The paper does not attribute quantization or rate adaptation to PerceptiNet itself.

The strongest PerceptiNet-relevant evidence is indirect and comes from the framework-level semantic consistency (SC) metric. SC is defined verbally as the agreement of EIDs’ semantic understanding for task-relevant information, measured against standard semantic descriptions from a knowledge base. The complete CC-EIN achieves SC = 0.89 at 30 dB, compared with 0.84 for GA-PPO, 0.78 for CC-EIN without DRAOSC, and 0.81 for CF; at -10 dB, CC-EIN still achieves 0.30, compared with 0.27, 0.07, and 0.14 respectively (Chen et al., 25 Nov 2025). At the system level, the same paper reports 95.4% task completion rate and 95% transmission efficiency in post-earthquake rescue simulation.

At the same time, the paper leaves major aspects of PerceptiNet unspecified. There are no PerceptiNet-specific equations, no explicit fusion or alignment objective, no dataset or optimizer description for the perception stack, no ablation isolating multimodal fusion versus unimodal sensing, and no direct perception metrics such as mAP, IoU, segmentation accuracy, or calibration error. There are likewise no model sizes, FLOPs, memory footprint, latency numbers, or embedded deployment details. The technically faithful interpretation is that PerceptiNet is a named architectural subsystem with a central systems role, but its internal learning mechanics remain high-level (Chen et al., 25 Nov 2025).

5. PerceptNet in haptics: metric learning from ambiguous triplets

A distinct usage of the name family appears in “PerceptNet: Learning Perceptual Similarity of Haptic Textures in Presence of Unorderable Triplets” (Kumari et al., 2019). Here the model is a deep metric learner for haptic textures/signals from human judgments, not an embodied semantic front-end. The goal is to learn an embedding $\phi : \mathbb{R}^n \to \mathbb{R}^m$ in which Euclidean distance corresponds to human perceptual dissimilarity.

The input representation is spectral rather than raw acceleration. The paper transforms 3-axis acceleration data using DFT321, aggregates the magnitude spectrum into 32 geometrically spaced frequency bins whose sizes increase by a factor of 1.8, and applies Gaussian smoothing with $\sigma = 20$ to produce the constant Q-factor filter bank (CQFB) feature vector. On top of these features, PerceptNet is a 1D convolutional neural network with six 1D convolutional layers, three pooling layers, and a final linear fully connected layer producing a 128-dimensional embedding vector: $\phi(x) \in \mathbb{R}^{128}.$

Its central methodological contribution is the treatment of both high-margin triplets and low-margin triplets. High-margin triplets encode orderable judgments; low-margin triplets encode unorderable or ambiguous human comparisons, treated as approximate equality constraints. The optimized loss is

$E = E_H + E_L,$

with

$E_H = \sum_{c \in H} \exp \left( -\rho(c) \right), \qquad E_L = \sum_{c \in L} \left( 1 - \exp \left( -| \rho(c) | \right) \right),$

where

$\rho\left( (x_i, x_j, x_k) \right) = d^2_\phi(x_i, x_k) - d^2_\phi(x_i, x_j).$

The empirical results are reported on synthetic data and the TUM haptic texture dataset. On held-out triplets, the model reaches about 84% TGA; on held-out samples, 73%; and on held-out classes, 67%. On a derived pairwise distinguishable-versus-indistinguishable task, it reports AUC = 0.97. In this line of work, PerceptNet is a parametric perceptual metric that generalizes to new haptic signals without retraining from scratch (Kumari et al., 2019).

6. PerceptNet in vision: human-vision priors, self-supervision, and parametric constraints

In visual perception research, PerceptNet denotes a compact, human-vision-inspired network for estimating perceptual distance between a reference image and a distorted image (Hepburn et al., 2019). The model maps an image $x$ to a perceptual representation $f(x)$ , and the distance is

$\|f(x)-f(d(x))\|_2 .$

The training objective shown in the paper is

$\max_f \rho(\|f(x) - f(d(x))\|_2, y),$

where $\sigma = 20$ 0 is the mean opinion score (MOS) and $\sigma = 20$ 1 is the Pearson correlation. Architecturally, the model is explicitly aligned with early vision: gamma correction $\sigma = 20$ 2 opponent colour space $\sigma = 20$ 3 Von Kries transform $\sigma = 20$ 4 center-surround filters $\sigma = 20$ 5 LGN normalisation $\sigma = 20$ 6 orientation sensitive and multiscale in V1 $\sigma = 20$ 7 divisive normalisation in V1. The paper reports 36.3k parameters for PerceptNet, compared with 24.7M parameters for LPIPS AlexNet, and gives strong traditional IQA results, including 0.93 Pearson on TID2008 Test and 0.95 on LIVE (Hepburn et al., 2019).

A later study asks whether similar perceptual structure can emerge without perceptual supervision. In “From Images to Perception: Emergence of Perceptual Properties by Reconstructing Images”, a bio-inspired PerceptNet is used as the encoder of an autoencoder trained for autoencoding, denoising, deblurring, and sparsity regularization on about 200,000 natural images from ImageNet (Hernández-Cámara et al., 14 Aug 2025). The reported finding is that the encoder’s final V1-like layer consistently exhibits the highest correlation with human perceptual judgments on TID2013, measured with Spearman correlation against MOS. The dependence on training corruption is non-monotonic: alignment peaks for moderate denoising around $\sigma = 20$ 8, for deblurring at small blur levels roughly $\sigma = 20$ 9, and for moderate sparsity, while larger penalties—specifically $\phi(x) \in \mathbb{R}^{128}.$ 0—reduce it. This suggests that perceptual structure may emerge from an efficient trade-off between fidelity, invariance, and coding efficiency.

The same family was pushed further in “Parametric PerceptNet: A bio-inspired deep-net trained for Image Quality Assessment” (Vila-Tomás et al., 2024). There the early-vision stages are explicitly parameterized as Gaussian, Difference of Gaussians, Gabor, and divisive normalization operators with interpretable parameters. The headline reduction is from 7,598,852 parameters in the non-parametric model to 1062 parameters in the fully parametric version. The paper reports, for the final selected models, $\phi(x) \in \mathbb{R}^{128}.$ 1 on KADID10K for the non-parametric model and $\phi(x) \in \mathbb{R}^{128}.$ 2 for the Parametric Fully Trained model, alongside $\phi(x) \in \mathbb{R}^{128}.$ 3 versus $\phi(x) \in \mathbb{R}^{128}.$ 4 on TID2008 and $\phi(x) \in \mathbb{R}^{128}.$ 5 versus $\phi(x) \in \mathbb{R}^{128}.$ 6 on TID2013. It also argues that biologically plausible initialization does not guarantee biologically plausible solutions after end-to-end optimization, and that IQA regression performance alone is insufficient to establish human-like internal representations.

In this visual lineage, PerceptNet is best understood as a compact perceptual embedding model whose later developments emphasize two complementary claims: first, that early-vision inductive bias can yield strong perceptual metrics with very small parameter counts; second, that perceptual alignment may emerge from the right architecture and reconstruction objective even in the absence of direct human supervision (Vila-Tomás et al., 2024).