OSNet: Omni-Scale Network for re-ID

Updated 25 March 2026

OSNet is a deep convolutional network that learns omni-scale features via multi-stream residual blocks specifically designed for person re-identification.
It employs a unified aggregation gate that adaptively fuses features from varied spatial scales, enhancing discrimination amid intra-class variations.
OSNet’s efficient factorized convolutions and lightweight design yield competitive accuracy and speed on benchmark re-ID datasets.

The Omni-Scale Network (OSNet) is a deep convolutional neural network architecture designed to facilitate omni-scale feature learning for person re-identification (re-ID). OSNet introduces a multi-stream residual block with dynamic, unified aggregation gating, enabling the extraction and adaptive fusion of features across a continuum of spatial scales within every network layer. This approach explicitly addresses the need for discriminative, fine-grained features and robust generalisation capabilities required for re-ID, especially under conditions of large intra-class variance and small inter-class difference. The architecture is both computationally lightweight and effective, outperforming larger backbones on standard benchmarks without resorting to fixed or static multi-scale aggregation (Zhou et al., 2019, Zhou et al., 2019).

1. Motivation: Omni-Scale Feature Learning for Person Re-Identification

Person re-identification presents unique challenges stemming from substantial intra-class variation (e.g., changes in pose, viewpoint, occlusion, background) and limited inter-class variation (e.g., similar clothing across individuals). Effective distinction between individuals depends critically on capturing visual cues at multiple spatial scales—ranging from global body shape to minute local details such as logos or accessories.

Traditional architectures often rely on fusing a small number of fixed-scale features, whether globally or locally, using concatenation or summation. However, identity discrimination may require arbitrary and input-specific mixtures of scales, motivating the notion of omni-scale features: the flexible combination of both homogeneous (single-scale) and heterogeneous (multi-scale composite) feature representations within each network block (Zhou et al., 2019, Zhou et al., 2019).

By dynamically learning which scales to emphasize for each input, OSNet provides a mechanism for both fine detail recall and contextual sensitivity. This enables effective discrimination even among visually similar impostors or in the presence of viewpoint-induced scale shifts.

2. OSNet Block Design: Parallel Multi-Scale Streams and Unified Aggregation

The central building block of OSNet is a multi-stream residual bottleneck module, where each stream captures features at a different spatial extent. Let $\mathbf{x} \in \mathbb{R}^{H \times W \times C}$ denote the input to the block.

Parallel Streams: Each of $T$ streams utilizes repeated “Lite 3×3” convolution layers (composed of pointwise followed by depthwise convolutions), where stack depth $t$ determines the stream’s receptive field: $t=1,2,3,4$ \, $\implies$ \, $3\times3$ , $5\times5$ , $7\times7$ , $9\times9$ effective kernels.

$F^t(\mathbf{x}) = \underbrace{\text{Lite3$\times$3}\left( \cdots\, \text{Lite3$\times$3}_{t \text{ times}}(\mathbf{x}) \cdots \right)}_{\text{$t$ layers}}$

Unified Aggregation Gate (AG): To fuse multi-scale stream responses in an adaptive, input-dependent fashion, OSNet introduces a single channel-wise aggregation gate shared among all streams:

$\mathbf{w}^t = G(\mathbf{x}^t) = \sigma\left( \text{MLP}\left( \text{GAP}(\mathbf{x}^t) \right) \right)\,, \qquad \mathbf{w}^t \in \mathbb{R}^C$

where GAP is global average pooling, MLP is a two-layer perceptron with hidden size $C/16$ , and $\sigma$ denotes the sigmoid nonlinearity.

The unified AG fuses the streams via channel-wise Hadamard product:

$\tilde{\mathbf{x}} = \sum_{t=1}^T \mathbf{w}^t \odot \mathbf{x}^t$

A residual connection is added: $\mathbf{y} = \mathbf{x} + \tilde{\mathbf{x}}$ . This design allows each block to adaptively select between predominantly global, predominantly local, or a mixture of feature scales at every forward pass, guided by the input content (Zhou et al., 2019, Zhou et al., 2019).

3. Efficient Factorised Convolutions: Pointwise and Depthwise

To ensure computational efficiency and parameter compactness, OSNet replaces conventional $k \times k$ convolutions with sequential pointwise and depthwise operations (“Lite 3×3”):

Depthwise convolution per channel:

$y_c(i, j) = \sum_{m=-\lfloor k/2 \rfloor}^{\lfloor k/2 \rfloor} \sum_{n=-\lfloor k/2 \rfloor}^{\lfloor k/2 \rfloor} X_c(i + m, j + n)\,K_c(m, n)$

Pointwise $1 \times 1$ convolution for channel mixing:

$z_p(i, j) = \sum_{c=1}^{C_{\mathrm{in}}} y_c(i, j) U_{p, c}$

This factorisation reduces parameter and compute cost from $k^2 C_{\mathrm{in}} C_{\mathrm{out}}$ to $(k^2 + C_{\mathrm{in}}) C_{\mathrm{out}}$ , enabling architectures as small as 0.2–2.2M parameters with minimal sacrifice of representational power (Zhou et al., 2019, Zhou et al., 2019).

4. Network Architecture, Scaling, and Training Protocols

A canonical OSNet instantiation for $256 \times 128$ input consists of:

conv1: $7 \times 7$ conv, stride 2
max-pool: $3 \times 3$ , stride 2
conv2–conv4: each with two omni-scale bottleneck blocks, interleaved with $1 \times 1$ transition conv + $2 \times 2$ avg-pool
conv5: $1 \times 1$ conv
Global average pooling $\rightarrow$ 512-D feature

Layer/width multipliers ( $\beta$ for channels, $\gamma$ for input resolution) enable precise control over network size and inference cost (Zhou et al., 2019, Zhou et al., 2019).

Training follows standard identity classification objectives: Cross-entropy with label smoothing, SGD or AMSGrad optimizers, extensive data augmentation (random flip, crop, random erasing or RandomPatch). Triplet and center loss terms are sometimes included for metric learning. Both training from scratch and ImageNet fine-tuning are supported (Zhou et al., 2019, Xie et al., 2020).

At test time, $\ell_2$ distances between normalized 512-D features are used for matching.

5. Extensions: Branch-Cooperative and Part-Level OSNet

Branch-Cooperative OSNet (BC-OSNet) (Zhang et al., 2020):

BC-OSNet extends OSNet with four cooperative branches beyond the backbone:

Global branch: Generalized mean-pooling over all spatial regions.
Local (PCB-style) branch: Uniform horizontal stripes, each pooled by GeM, concatenated and supervised with a single ID loss.
Global Contrastive Pooling (GCP) branch: Aggregates global max pooling, stripe-based average pooling, and their contrast for enhanced robustness.
One-vs-Rest Relation branch: Models relationships among spatial parts for contextual interaction.

Branches are supervised with a combination of ID, batch-hard triplet, and center losses. The final descriptor is the concatenation or weighted sum of all branch outputs. This design improves performance, achieving, for instance, 84.0% mAP and 87.1% rank-1 on CUHK03_labeled (Zhang et al., 2020).

PLR-OSNet (Part-Level Resolution OSNet) (Xie et al., 2020):

PLR-OSNet introduces a twin-branch head atop the OSNet backbone:

Global branch: Standard conv4–conv5 + global max pool.
Local branch: Identical blocks as global branch; the feature map is split into $n$ horizontal stripes, each pooled, concatenated, and subjected to a unified ID loss (one softmax across the full concatenated vector).

The use of a single identity prediction (rather than independent losses per strip) yields an additional +3% mAP, with a total model size of $\sim$ 3.4M parameters. Spatial- and channel-attention modules may be inserted for modest gains (Xie et al., 2020).

6. Generalisation: Instance Normalisation and NAS-Augmented Variants

Cross-dataset domain gaps are addressed with the OSNet-AIN extension (Zhou et al., 2019), integrating instance normalisation (IN) into selected blocks to remove style-specific information (e.g., illumination, color) while preserving discriminative content. Placement of IN layers is discovered via Gumbel-Softmax-based differentiable architecture search over the search space:

$\Omega = \{ \omega_0 \text{ (no IN)}, \omega_1 \text{ (IN inside block)}, \omega_2 \text{ (IN after block)}, \omega_3 \text{ (both)} \}$

Continuous relaxation enables joint optimisation of block weights and selection logits. After search, each block selects its optimal configuration, boosting generalisation. OSNet-AIN achieves, for example, 61.0% R1 for Duke $\to$ Market and 52.4% R1 for Market $\to$ Duke in unsupervised cross-dataset re-ID, surpassing many methods requiring access to target-domain data (Zhou et al., 2019).

7. Empirical Performance and Comparative Analysis

OSNet consistently outperforms comparable compact and large backbones across standard re-ID datasets, achieving state-of-the-art or near state-of-the-art results:

Dataset	OSNet-scratch	OSNet-ImageNet	BC-OSNet	PLR-OSNet (approx)
Market1501	93.6% R1, 81.0% mAP	94.8% R1, 84.9% mAP	95.6% R1, 89.5% mAP	$\sim$ 94–95% R1
DukeMTMC-reID	84.7% R1, 68.6% mAP	88.6% R1, 73.5% mAP	91.4% R1, 81.2% mAP	$\sim$ 87–88% R1
CUHK03 detected	57.1% R1, 54.2% mAP	72.3% R1, 67.8% mAP	84.3% R1, 80.5% mAP	$\sim$ 77–78% R1
CUHK03 labeled	—	—	87.1% R1, 84.0% mAP	—

The unified aggregation gate, dynamic scale fusion, and efficient factorised convolutions are repeatedly validated as critical components via ablation studies: single-scale or statically fused architectures underperform by 2–15% in R1/mAP; decoupling or removing channel-wise dynamic gating yields 2–6% degradation.

Compact OSNet models ( $\leq2.2$ M parameters) offer deployment efficiency and maintain high accuracy, making them appropriate for real-time and resource-constrained applications (Zhou et al., 2019, Zhang et al., 2020, Zhou et al., 2019).

References

"Omni-Scale Feature Learning for Person Re-Identification" (Zhou et al., 2019)
"Learning Generalisable Omni-Scale Representations for Person Re-Identification" (Zhou et al., 2019)
"Branch-Cooperative OSNet for Person Re-Identification" (Zhang et al., 2020)
"Learning Diverse Features with Part-Level Resolution for Person Re-Identification" (Xie et al., 2020)