Size Perception and Embedding (SPE) Module

Updated 27 December 2025

Size Perception and Embedding (SPE) module is an algorithmic framework that extracts and encodes explicit size cues to enhance scale-aware representation across various learning systems.
It integrates high-frequency prior extraction and adaptive fusion to boost fine-grained visual recognition, optimize VR perceptual thresholds, and dynamically allocate embedding capacity in recommender systems.
Empirical evaluations demonstrate performance gains—up to 90.2% accuracy in visual tasks and 3.5% improvements in classification—validating its utility across multiple applications.

A Size Perception and Embedding (SPE) module refers to a class of neural or algorithmic submodules designed to explicitly extract, encode, and utilize information about the physical or semantic "size" of objects, users, or features within machine learning and cognitive modeling systems. SPE modules address settings where scale—physical, perceptual, or abstract—is a crucial but underrepresented discriminant in standard representation learning. Recent works have formalized SPE designs across vision encoders, recommender systems, and psychophysics-based VR interfaces, with implementations grounded in explicit architectural, loss, and interaction mechanisms.

1. Motivations and Core Principles

SPE modules are motivated by the observation that fine-grained tasks often require networks to distinguish objects or entities not just by appearance or content, but by absolute or relative scale. Standard encoders—such as CNNs, transformers, or matrix factorization systems—tend to be agnostic to scale priors unless provided via dataset normalization, hardcoded features, or dedicated architectural pathways. SPEs aim to:

Provide explicit size-related priors or side channels that propagate through representation learning pipelines.
Guide discriminative learning by ensuring that size cues are available and attended to at each hierarchical stage.
Improve downstream interpretability by structuring latent representations so that size information is linearly or nonlinearly extractable.

Distinct SPE variants address vision (object/seed classification, chart/visual perception), recommender systems (adaptive embedding capacity), and VR perceptual modeling (illusion threshold estimation), each with context-specific requirements and optimal designs (Xing et al., 20 Dec 2025, Lee et al., 30 Jul 2024, Qu et al., 22 Jul 2024, Zhang et al., 8 Apr 2025).

2. SPE for Fine-Grained Visual Recognition

In ancient seed image classification, the SPE module was proposed to encode the absolute size of seeds, enabling the network to distinguish visually similar categories by small, class-informative scale cues (Xing et al., 20 Dec 2025). The design consists of two tightly integrated sub-blocks:

High-Frequency Prior Extraction: The RGB input image $f_I \in \mathbb{R}^{B \times 3 \times H \times W}$ undergoes 2D FFT; the central low-frequency band is suppressed via masking, isolating high-frequency components. After IFFT and dimensionality reduction using $1$$\times$$1$ convolution, BN, and ReLU ("CBR" block), a normalized single-channel map $f_H^{(0)}$ is produced and propagated to subsequent backbone stages.
Size Embedding and Fusion: At each backbone stage $i$ , $f_H^{(i)}$ is aligned with backbone features $f_B^{(i)}$ . Both are normalized through CBR, concatenated along the channel axis, and fused with a channel-attention block (SE or simplified CBAM). The fused tensor is mapped back to channel size $C_i$ and passed to the next stage.

The pipeline ensures continuous injection and refinement of the size prior, while maintaining compatibility with hierarchical representation learning. Empirically, SPE modules based on high-frequency size priors outperform edge-based, mask-based, and RGB-based ablations (accuracy: high-frequency prior—90.2%, others—up to 88.8%). Integration of SPE into various backbone networks systematically improves classification performance (e.g., DenseNet201 +3.5% accuracy gain) (Xing et al., 20 Dec 2025).

Prior Type	Accuracy (%)
Edge-only	87.0
Mask	88.2
Sobel edges	88.4
Raw RGB	88.8
Low-frequency prior	89.2
High-frequency (SPE)	90.2

3. SPE in Graphical Perception and Image Embedding Models

For general visual models and chart understanding tasks, SPE modules have been formalized to quantify and optimize "channel effectiveness" with respect to the size/magnitude visual channel (Lee et al., 30 Jul 2024). Metrics central to SPE design in this context include:

Channel Accuracy ( $R^2$ ): Quantifies linearity of embedding responses to true magnitude. For a stimulus series $x_1, ..., x_n$ with embeddings $z_1, ..., z_n$ , PCA and linear regression extract $p_i = v_1^T z_i$ , and $R^2$ is computed between $p$ and $x$ .
Channel Discriminability: Measures the local distance ( $d_i = \|z_{i+1}-z_i\|_2$ ) between adjacent magnitude steps, with the number and character of peaks corresponding to just-noticeable differences (JNDs).

Empirical evaluation on CLIP variants finds best size accuracy with a ViT-L/14 backbone ( $R^2 \approx 0.75$ ), but reveals significant divergence from human channel ranking and JND grouping (CLIP exhibits ~4 bins, humans are more sensitive in the length channel). These quantitative assessments support SPE module loss functions:

Accuracy Loss: $L_{\mathrm{acc}} = \frac{1}{n} \sum_i (f(z_i)-x_i)^2$
Discriminability Loss: $L_{\mathrm{disc}} = \frac{1}{n-1} \sum_i \max(0, m - \|z_{i+1}-z_i\|_2)^2$
Weber Law Regularization: $L_{\mathrm{weber}} = \sum_i (\|z_{i+1}-z_i\|_2 / x_i - k)^2$

The recommended architecture inserts a dedicated "size-head" at the encoder output, enabling SPE-optimized training for human-consistent size linearity and discriminability (Lee et al., 30 Jul 2024).

4. Perceptual Illusion-Space SPE for VR and Haptics

In VR grasping and haptic feedback scenarios, SPE modules formalize the interplay between physically rendered and virtually presented object size and taper angle. Controlled psychophysics studies fit logistic curves to 2AFC responses, extracting downscaling (DT), point of subjective equality (PSE), and upscaling (UT) thresholds (Zhang et al., 8 Apr 2025). Key features:

Parameterization: Physical width ( $S_p$ ), physical taper angle ( $A_p$ ), virtual width ( $S_v$ ), virtual taper angle ( $A_v$ ). Virtual-physical incongruence ratios $r_S = S_v/S_p$ , $r_A = A_v/A_p$ serve as input dimensions.
Threshold Models: Closed-form rational functions predict DT and UT for both size and angle as analytic functions of ( $S_p, A_p, r_A$ ) and ( $S_p, A_p, r_S$ ).
Forward and Inverse Prediction APIs: Given ( $S_p, A_p, S_v, A_v$ ), SPE modules output (DT, UT) intervals and PSEs; inverse routines solve for required virtual-physical configurations to achieve target perceptual effects.
Practical Guidance: Remaining well inside the [DT, UT] range (e.g., >10% margin) maximizes perceptual "proxy equivalence," supporting dynamic retargeting and proxy selection in VR.

This formulation allows for SPE modules to serve as computational interfaces for perceptual indistinguishability and to mediate adaptation of proxy affordances based on quantitative illusion-space modeling (Zhang et al., 8 Apr 2025).

5. SPE for Dynamic Embedding Capacity in Recommender Systems

The SCALL method formalizes an SPE module to adaptively allocate embedding dimensionality for users and items under strict memory budgets in streaming recommendation systems (Qu et al., 22 Jul 2024). The approach consists of:

Memory-Control Architecture: Embedding table $\mathbf E$ of shape $(|\mathcal U|+|\mathcal V|) \times d_{\max}$ sits atop binary mask $\mathbf M$ ; per-entity sizes $d_n$ are sum-masked activations.
Probabilistic Sampling: At each timestep, user/item frequency scores are sampled from power-law distributions parameterized by $\alpha^U, \alpha^V$ , normalized to allocation shares $p^U, p^V$ , and scaled by a global partition $w$ (user-item split). Budget $B$ is enforced strictly via allocation quantization.
Reinforcement Learning Search: State includes frequency summaries, recent performance, size dispersions, and mean pooled embeddings. SAC-style RL agent outputs $(w, \alpha^U, \alpha^V)$ , used to reconfigure allocations. RL reward is based on the ratio of recall and NDCG@20 of the current embedding configuration relative to the previous segment.
Adaptivity for Unseen Entities: New users and items are automatically incorporated via the frequency histograms and allocation share mechanism; no special handling is required.
Reservoir Sampling: Buffering of past segments encourages memory retention and regularization.

This design ensures stable adherence to global memory budgets while dynamically adapting representational capacity according to observed interaction frequencies and task needs, all mediated via the SPE module (Qu et al., 22 Jul 2024).

6. Empirical Evidence and Comparative Analysis

Empirical ablations and integration studies across domains confirm SPE module effectiveness:

In ancient seed classification, SPE with high-frequency priors enables backbone-agnostic gains, with up to 3.5% higher accuracy compared to non-size-based or edge-only variants (Xing et al., 20 Dec 2025).
In graphical model perception, SPE-based losses drive embedding manifolds closer to human-consistent psychophysical criteria; discriminability analysis reveals the critical role of JND structure (Lee et al., 30 Jul 2024).
In VR, illusion-space SPE modules predict threshold regions that extend beyond simple ratio models, supporting nuanced controller retargeting (Zhang et al., 8 Apr 2025).
In streaming recommendation, SPE (SCALL) modules allow for efficient adaptation to both scale and diversity growth without memory overrun, outperforming traditional static approaches (Qu et al., 22 Jul 2024).

These results confirm the central claim that explicit, context-specific size perception modules provide measurable benefits in domains where scale is an essential—yet otherwise inadequately modeled—signal.

7. Limitations and Prospects

SPE modules are calibrated based on empirical distributions or design priors specific to their domain—such as object scale in seed classification, graphical JNDs in vision, illusion-space boundaries in VR, or embedding frequency spectra in recommender systems. Limitations include:

Calibration may not generalize across tasks, especially when data distribution shifts (e.g., off-range physical sizes, novel user/item types, alternative object properties).
Current SPE modules typically focus on additive fusion and do not yet exploit more complex interactions between size and other feature channels (e.g., multi-modal cues, temporal aspects, or higher-order correlations).
Psychophysical SPE models in VR have limited validation to specific grasp types, shapes, and stimulus ranges (Zhang et al., 8 Apr 2025).
Optimization of SPE hyperparameters for best alignment with human or task-specific priors remains an open challenge in several fields.

A plausible implication is that SPE modules will constitute a core architectural pattern in settings where size, magnitude, or capacity must be explicitly controlled, monitored, or decoded with high fidelity, but further research is necessary to generalize across broader domains and to integrate more flexible joint modeling with other salient feature dimensions.