Granularity Control Embedding Overview

Updated 19 November 2025

Granularity control embedding is a mechanism that explicitly modulates the data processing resolution through scalable signals.
It employs techniques like learnable tokens, gating matrices, and hierarchical fusion to adjust detail across diverse domains.
Empirical studies demonstrate performance gains in segmentation, image synthesis, and time-series analysis through controlled granularity.

Granularity control embedding denotes any architectural or algorithmic mechanism that allows a model to represent, inject, or modulate the “resolution” or “level of detail” at which data, instructions, or features are processed. In contemporary research, granular control embeddings are instantiated as explicit signals (continuous or discrete), embeddings, structured representations, gating matrices, or hierarchical fusion layers that allow end-users, data generators, or models themselves to select at what semantic, spatial, temporal, behavioral, or operational level information is aggregated, manipulated, or output. This paradigm appears across language modeling, vision, sequential time-series analysis, recommendation, compression, and interactive systems.

1. Architectural Instantiations of Granularity Control Embedding

Granularity control embedding is implemented in models as explicit vectors, matrices, tokens, or routing mechanisms, which encode the desired scale or resolution of processing. Notable architectural forms include:

Learnable granularity tokens: Used in segmentation models (GraCo (Zhao et al., 2024), UnSAMv2 (Yu et al., 17 Nov 2025)), diffusion pipelines (Li et al., 2024), and transformers for medical time-series (Wang et al., 2024, Wang et al., 2024), where a granularity vector is added or concatenated to other embedding streams (image, point prompt, click maps) and injected into every transformer block.
Granularity gating matrices: VL-PET (Hu et al., 2023) introduces a learnable gate $G \in \mathbb{R}^{N \times d}$ in parameter-efficient tuning, with degree-of-control at large, middle-X, middle-Y, or small levels, gating hidden states in encoder/decoder modules.
Self-attentive fusion layers: MGAM (Ji et al., 2023) and Medformer (Wang et al., 2024) employ multi-stage attention or fusion schemes that dynamically aggregate subset, group, and superset, or multi-scale patch/token embeddings through explicit attention, controlling the relative weight of each granularity.
Recursive hierarchical code blocks: In ReCode (Yu et al., 27 Oct 2025), planning and action are both represented as code (placeholder and primitive calls); recursion depth mirrors “plan versus act” granularity, with the code-embedding tree serving as a dynamic control mechanism.
Continuous or discretized scalar inputs: Both GraCo and UnSAMv2 accept a scalar $\lambda$ or $\alpha$ controlling granularity, and encode it via lookup tables, token embeddings, or high-frequency Fourier features projected via MLPs into the model’s internal space.

In each case, the injection or routing of granularity information is tightly coupled to the model’s operational semantics, enabling explicit, task-dependent modulation of output detail.

2. Mathematical Formalization and Embedding Construction

A common principle is the introduction of a formal variable or embedding $g$ representing granularity. Representative approaches:

Mask prompt injection (GraCo, UnSAMv2, NVG):
- Discrete bins or continuous values ( $\lambda$ , $\alpha$ ) are mapped to embeddings $g$ via lookup or projection: $g = E_{d(\lambda)}$ (GraCo), $E_g = \mathrm{MLP}(\varphi(\alpha))$ (UnSAMv2)!
- For hierarchical visual generation (NVG (Wang et al., 18 Aug 2025)), structure embeddings $s_e$ encode each spatial cluster index across coarse-to-fine stages into $K$ -dim vector suitable for positional encoding and transformer routing.
Gated feature correction (VL-PET):
- $G \odot (H + \Delta H)$ applies hadamard multiplication of a granularity-controlled gate with hidden and correction terms. Levels of $G$ range from element-wise to global scalar.
Multi-head and multi-vector projections (M3-Embedding):
- Dense, sparse, and late-interaction heads project token sequences to representations that naturally reflect sentence, paragraph, or document scale, allowing uniform architecture for heterogeneous input granularities (Chen et al., 2024).
Multi-stage clustering and attention (MFN, MGAM):
- Hierarchical attention over clustering assignments, and subset/group/superset embeddings, with each stage reflecting different granularity in user interests or group preferences.

Discrete-to-continuous transitions (e.g. UnSAMv2’s $\alpha \in [0.1, 1.0]$ Fourier embedding, GraCo’s $B$ -bin embedding table) ensure a unified representation space, compatible with backbone transformer or convolutional architectures.

3. Data Generation, Training Objectives, and Supervision Strategies

Effective granularity control requires either explicit labeled data at multiple scales, or automated proxy labels and unsupervised hierarchy mining:

Self-supervised pseudo-document or mask generation: UnSAMv2 and GraCo exploit a divide-and-conquer pipeline to automatically generate mask-granularity pairs from unlabeled images, sidestepping costly multi-level annotation (Yu et al., 17 Nov 2025, Zhao et al., 2024).
Multi-granular code trace extraction: ReCode builds hierarchical execution trajectories; each tree node is a training sample for supervised fine-tuning at that granularity level (Yu et al., 27 Oct 2025).
Contrastive and self-distillation losses: M3-Embedding (Chen et al., 2024) binds heads via contrastive learning and ensembling, while controlling input granularity via variable sequence length and chunking.
Granularity-aware attribute similarity: SGML’s soft-binomial deviance loss leverages semantic attribute-space cosine similarity as a modulating signal on standard metric learning (Manandhar et al., 2019).
Auxiliary clustering losses: MFN’s entropy-based pretraining encourages diversity and utilization of interest slots at different granularities (Xie et al., 2021).

These mechanisms provide supervision for granularity control in the absence of human-curated part/whole or fine/coarse labels.

4. Applications Across Domains

Granularity control embeddings have enabled explicit scale-selection, improved performance, and greater flexibility in:

Interactive and foundation vision models: GraCo produces flexible object/part segmentation (IoU–granularity curve) (Zhao et al., 2024), UnSAMv2 achieves continuous control with only $0.02\%$ parameter increase (Yu et al., 17 Nov 2025).
Image synthesis and compression: NVG generation produces images stage-wise from coarse to fine (Wang et al., 18 Aug 2025), Control-GIC assigns each spatial patch a granularity-dependent VQ code length, yielding bit-rate modulation (Li et al., 2024).
Text and user modeling: M3-Embedding simultaneously enables sentence, paragraph, document-level retrieval in a shared model backbone (Chen et al., 2024); MFN extracts coarse/fine-grained multiple interests for CTR prediction ((Xie et al., 2021), MGAM (Ji et al., 2023)).
Medical/EEG time-series: Medformer and ADformer encode patches and channels at variable lengths and counts, with two-stage self-attention, handling inter/intra-granularity dependencies (Wang et al., 2024, Wang et al., 2024).
Symbolic decision/planning: ReCode recursively expands code blocks for plan/action unification at varying semantic hierarchy (Yu et al., 27 Oct 2025).

These deployments show that control over granularity enhances adaptability, enables user-specified detail, and supports better generalization.

5. Empirical Results and Ablations on Granularity-Level Control

Granularity-control mechanisms yield consistent quantitative and qualitative gains. Key results:

Model & Domain	Mechanism	Metric Impact
GraCo (Obj/Part Seg)	Embedding $\lambda$ + AGG	NoC@90 $\,\approx\,$ 1.46, NoC@85 $\,\approx\,$ 1.34
UnSAMv2 (SAM-2)	Mask token + Fourier+MLP	NoC $_{90}$ : 5.69 → 4.75, 1-IoU: 58.0 → 73.1
VL-PET (VL Tasks)	Granularity gate $G$	CIDEr: 120.19 → 122.03 (large G)
NVG (ImageNet)	Structure embedding seq.	FID: 3.03 → 2.06 (coarse-to-fine)
MGAM (Group Rec.)	Subset/Group/Superset attn	HR@5: +2–5% over baselines
Control-GIC (Compr.)	Patch-wise entropy mask	LPIPS, FID, PSNR: fine control

Ablations confirm that masking/removing the granularity signal, fusion layers, router tokens, or switching to simple additive gates degrades performance (e.g., VL-PET's $G\odot(H + \Delta H)$ outperforms $H + \Delta H + G$ ; samplings of $\lambda$ in GraCo yield varying IoU congruent with user intent). There is no universal "more granularity is better": optimal level depends on data domain (Wang et al., 2024).

6. Limitations and Prospects for Future Granularity-Control Embedding

Challenges remain in automated hierarchical data generation, optimizing the number and distribution of granularity bins/levels, and mitigating brittleness in user input scaling or code recursion. Models relying on the base backbone’s parsing capabilities may suffer from syntactic errors (ReCode (Yu et al., 27 Oct 2025)), noisy hierarchy mining (UnSAMv2), or over/under-decomposition (ADformer ablations). Practically, optimal fusion or gating needs to be empirically tuned.

Granularity control embedding, through scalar injection, gating, hierarchical recursion, or cross-modal fusion, has become a crucial module in bringing multi-resolution flexibility to state-of-the-art models in vision, language, sequential, and symbolic domains. Ongoing work is expected to focus on dynamic adaptation, online granularity estimation, and further unification of granularity control into both training and inference workflows.