Granularity-Aware Pretraining Strategy

Updated 9 December 2025

Granularity-aware pretraining is defined by processing data at multiple semantic resolutions—such as tokens, segments, and hierarchical labels—across language, vision, and multimodal domains.
It employs techniques like hybrid vocabularies, lattice structures, and multi-grained contrastive losses to align model representations with desired downstream feature spaces.
Empirical results demonstrate that matching pretraining data granularity with target tasks significantly boosts efficiency, accuracy, and domain adaptation.

Granularity-aware pretraining strategy refers to methods that explicitly incorporate or manipulate data and label granularity—in terms of tokenization level, instance grouping, segment structure, object-prior, or hierarchical label space—during the unsupervised or supervised model pretraining process. These methods aim to boost sample efficiency, cross-task generality, domain adaptation, and downstream performance by aligning model representations to multiple semantic levels or by selecting pretraining data that matches the desired downstream feature space.

1. Conceptual Foundations: Granularity in Pretraining

Granularity denotes the resolution at which informative elements (tokens, segments, objects, groups, or classes) are encoded and processed during pretraining. It applies across language, vision, video, and multi-modal domains.

In text, granularity may span subwords, words, multi-word n-grams, character-word lattices, or document segments (Chang et al., 2024, Lai et al., 2021).
In vision, it refers to instance-level (fine details), local-group (semantic subgroups), or group-level (cluster prototypes) features (Zhou et al., 2022).
In multimodal and video contexts, granularity relates to temporal span (frame, clip, sequence), phrase-level annotation, object-centric grouping, or multi-grained captions (Wang et al., 2024, Wang et al., 2021, Xian et al., 2024, Hao et al., 2024).

Granularity-aware schemes differ from naïve single-level approaches (e.g., pure subword tokenization or global clip-sentence alignment) by constructing representations and supervision signals explicitly optimized for multiple semantic levels and by maintaining flexible mappings across them.

2. Data Sampling and Tokenization Strategies

Granularity-aware pretraining can begin with data selection protocols that filter, align, or synthesize training samples to match desired feature distributions and downstream tasks.

Multi-Granular Data Sampling via Importance Weights

Target-Aware Language Modeling leverages hybrid token vocabularies (subword, word, multi-word n-grams) for each document, featurizing texts into sparse count-sketch representations. Importance weights are computed as the ratio of empirical densities of feature vectors under target and background distributions:

$w_i = \frac{\hat p_{\rm feat}(z_i)}{\hat q_{\rm feat}(z_i)}$

This quantifies each document’s relevance to the target domain in multi-granular token space. Sampling is then performed proportional to $w_i$ , yielding high sample efficiency: training on only $\sim1\%$ of RefinedWeb matched or exceeded full data pretraining and consistently outperformed random or coarse-only sampling (Chang et al., 2024).

Vocabulary Adaptation and Utility Optimization

Hybrid vocabularies are constructed by merging base and task-specific tokens, then minimized via:

$\mathcal H_v = -\frac{1}{\ell_v}\sum_{j\in v}P(j)\,\log P(j)$

Tokens with low utility in the target context are pruned. Empirical proportions (subword $\sim60\%$ , word $\sim30\%$ , multi-word $\sim10\%$ ) best preserved cross-domain generality.

Lattice Structures in LLMs

Lattice-BERT builds token graphs from single-character and multi-character word spans, feeding all units into transformer encoders. Lattice position attention (absolute, distance, and overlap biases) encodes fine and coarse positional relations among all token spans, supporting robust multi-granular context modeling (Lai et al., 2021).

3. Multi-Grained Supervision and Contrastive Objectives

Another paradigm is the design of multi-level supervision signals in the pretraining objective, scenario-specific grouping, and semantic clustering.

Multi-Grained Contrastive Losses

Fine-grained Multi-Modal Self-Supervised Learning computes contrastive losses at three explicit levels: global, frame/region-level, and phrase-level. Learned attention masks (via small MLPs) automatically reweight unit-to-unit pairs, suppressing noisy or irrelevant alignments, especially in uncurated data. The total loss is a convex combination over granularities (Wang et al., 2021).

Grouping Supervision in Vision

Mugs introduces three concurrent discriminations:

Instance discrimination (IDS): augments and aligns representations at individual image level.
Local-group discrimination (LGDS): aligns and separates local neighbor groups.
Group discrimination (GDS): clusters local groups, aligns soft assignments to trainable prototypes.

Losses from each level contribute equally, yielding multi-granular features that boost generality in classification, segmentation, and detection. Ablation confirms that removing any granular level significantly reduces performance (Zhou et al., 2022).

Object-Aware Masking and Weighted Reconstruction

SOAR exploits object-centric granularity in video pretraining by constructing patch-level objectness maps, enforcing visibility for object-rich regions across spatiotemporal patches. A weighted loss amplifies gradients for object-centered regions, preserving critical spatial granularity and accelerating convergence with minimal resource requirements (Xian et al., 2024).

4. Label Granularity and Hierarchical Supervision

Granularity is equally pertinent in label-space: pretraining on fine-grained labels (class leaves, fine subclasses) versus coarse classes (super-classes or binary splits).

Theoretical and Empirical Insights

Pretraining on fine label granularity enables networks to learn both common and rare subclass features, supporting generalization to hard test samples. Coarse-label pretraining leads to shortcut learning on common features only, impairing fine-feature transfer. Experiments on ImageNet21k and iNaturalist demonstrate monotonically increasing transfer accuracy with finer pretraining labels, provided a meaningful label hierarchy and label-function alignment is preserved (Hong et al., 2023).

Guidelines recommend pretraining on label sets 5–20× larger than the target task, with at least $10^3$ samples per class, and warn against using random, misaligned, or overly fine-grained labels that can stunt feature enrichment or induce overfitting.

5. Granularity Expansion and Scalable Multimodal Alignment

Recent innovations address the challenge of limited multi-grained datasets by programmatically expanding and compressing data granularity.

Granularity Expansion via Integration and Compression

GEXIA synthesizes novel granularities by concatenating multiple video/text samples (integration) or summarizing longer examples (compression), forming arbitrarily rich temporal and semantic levels without new annotation.

Integration: concatenates along the time or token axis.
Compression: uses LLMs for text summarization or key-frame extraction (optional for video).

Iterative Approximation Modules

IAMs adaptively refine dense features into low-dimensional seeds for contrastive alignment, with iteration counts scaled to input length/granularity. Ablation studies confirm that granularity-aware #iter assignments significantly enhance cross-modal retrieval vs. rigid uniform settings (Wang et al., 2024).

6. Empirical Results and Domain-Specific Applications

Granularity-aware approaches consistently yield improvements in downstream tasks—often with gains in sample-efficiency and generalization, and operational savings in compute and memory.

Target-Aware sampling (1% data) outperforms and matches full-data LLM pretraining on eight tasks (Chang et al., 2024).
Mugs sets new state-of-the-art linear probe accuracy (ViT-L/16: 82.1% on ImageNet-1K), transfer learning, and segmentation benchmarks (Zhou et al., 2022).
SOAR accelerates UAV action recognition with up to 87.5% less pretraining and 2–5% higher accuracy (Xian et al., 2024).
Fine-grained label pretraining delivers 4.6% absolute top-1 accuracy gain on ImageNet-1k vs. coarse pretraining (Hong et al., 2023).
GEXIA matches or surpasses SOTA in long-form video retrieval/classification using only short-clip data (Wang et al., 2024).
UrbanVLP outperforms prior urban indicator models by an average R² +3.55% across four cities and six socioeconomic tasks. Ablation confirms the necessity of preserving both macro (satellite) and micro (street-view) granularity (Hao et al., 2024).

7. Practical Guidelines and Implications

Practical deployment of granularity-aware pretraining strategies involves:

Constructing or adapting tokenization and segment vocabularies to match downstream needs and domain specificity.
Optimizing feature and supervision mixtures empirically: multi-granular token proportions, loss weights across granular levels, and object-centric masking parameters.
Adjusting embedding and attention modules to accommodate hierarchical or multi-level data, leveraging iterative or attention-based mechanisms.
Ensuring pretraining label granularity preserves feature alignment and supports target problem requirements.
Applying programmatic data granularity expansion (integration and compression) for scalable multi-modal alignment without additional curation.

A plausible implication is that granularity-aware schemes will become increasingly critical as foundation models scale in size and application diversity, particularly for scenarios with heterogeneous data, long-tail distributions, or transfer gaps.

Table: Granularity-Aware Pretraining Schemes & Domains

Approach	Domain	Granular Levels
Multi-granular token sampling (Chang et al., 2024)	Language	Subword, word, multi-word
Lattice construction + MSP (Lai et al., 2021)	Language	Character, word, segment
Multi-level contrastive alignment (Wang et al., 2021)	Multimodal	Global, frame/region, phrase
Instance/local-group/prototype supervision (Zhou et al., 2022)	Vision	Instance, local group, semantic group
Object-aware masking (Xian et al., 2024)	Video	Patch/object, background
Granularity expansion + IAM (Wang et al., 2024)	Video	Temporal length, summary
Hierarchical label pretraining (Hong et al., 2023)	Vision	Leaf, subtree, coarse class
Macro/micro VL fusion (Hao et al., 2024)	Vision-Language	Satellite, street view, location

Granularity-aware pretraining strategies constitute a rapidly maturing paradigm that yields superior generalization, efficient resource utilization, and robust downstream adaptability by leveraging flexible multi-level representations in language, vision, and multimodal domains.