Universal Inverted Bottleneck (UIB)

Updated 28 December 2025

Universal Inverted Bottleneck (UIB) is a neural network block that generalizes inverted bottlenecks by parameterizing depthwise convolutions and nonlinearities for hardware efficiency.
It integrates elements from IB, ConvNeXt-style blocks, FFNs, and ExtraDW to balance computational cost, memory footprint, and accuracy.
UIB’s design supports neural architecture search (NAS) in MobileNetV4, enabling optimal performance on CPUs, DSPs, GPUs, and mobile accelerators.

The Universal Inverted Bottleneck (UIB) is a neural network architectural block designed as a flexible and hardware-efficient generalization of classic inverted bottlenecks, serving as the core feature extractor in MobileNetV4. UIB unifies concepts from prior micro-blocks—Inverted Bottleneck (IB), ConvNeXt-style block, feed-forward networks (FFN), and a novel Extra Depthwise (ExtraDW) variant—by parameterizing the presence and position of depthwise convolutions and the structure of intermediate nonlinearities. The resulting design delivers superior trade-offs between computational cost, memory footprint, and accuracy, enabling neural architecture search (NAS) to yield universally efficient models across CPUs, DSPs, GPUs, and mobile accelerators (Qin et al., 16 Apr 2024, Nguyen et al., 13 Mar 2025).

1. Formal Definition and Internal Structure

UIB operates on an input tensor $X \in \mathbb{R}^{H \times W \times C}$ and outputs $Y \in \mathbb{R}^{H' \times W' \times C'}$ . The canonical UIB block proceeds through six key stages: initial expansion (1×1 pointwise convolution), depthwise spatial mixing, optional ExtraDW (a second depthwise convolution branch), a small feed-forward network (two 1×1 convolutions forming an FFN), projection (1×1 convolution), and residual connection.

Mathematical Formulation (stride $s=1$ for clarity):

$\begin{align*} Z_1 & = \sigma(\mathrm{BN}_1(\mathrm{Conv}_{1\times1}(X;\, C \rightarrow rC))) \ Z_2 & = \sigma(\mathrm{BN}_2(\mathrm{DWConv}_{3\times3}(Z_1;\, rC \rightarrow rC))) \ Z_3 & = \mathrm{BN}_3(\mathrm{DWConv}_{3\times3}(Z_2;\, rC \rightarrow rC)) \ F_1 & = \phi (\mathrm{Conv}_{1\times1}(Z_3;\, rC \rightarrow eC)) \ F_2 & = \mathrm{BN}_4(\mathrm{Conv}_{1\times1}(F_1;\, eC \rightarrow rC)) \ P & = \mathrm{BN}_5(\mathrm{Conv}_{1\times1}(F_2;\, rC \rightarrow C)) \ Y & = X + P \end{align*}$

Here $\sigma$ denotes HSwish activation, $\phi$ denotes GELU, and $r$ (expansion ratio) and $e$ (FFN expansion) are hyperparameters (typical values: $r \approx 4\dots6$ , $e \approx 2$ ) (Nguyen et al., 13 Mar 2025).

Parameter and FLOP Counts:

Given kernel size $k=3$ and assuming $H' = H$ , $W' = W$ , the block uses:

$\texttt{Params(UIB)} = (1\times1\times C \times rC) + (k^2 \times rC) + (k^2 \times rC) + (1\times1\times rC \times eC) + (1\times1\times eC \times rC) + (1\times1\times rC \times C)$
$\texttt{FLOPs(UIB)} \approx H \cdot W \cdot [C \cdot (rC) + rC \cdot k^2 + rC \cdot k^2 + rC \cdot (eC) + eC \cdot (rC) + rC \cdot C]$

2. Switchable Structure: Universal Parameterization

UIB is parameterized by two Booleans $b_1$ , $b_2$ , determining the position of (up to two) depthwise convolutions, and two kernel sizes $K_1, K_2 \in \{3, 5\}$ . This configurable structure enables UIB to realize four canonical blocks:

Inverted Bottleneck (IB): $b_1=0$ , $b_2=1$ , $K_2=3$
ConvNeXt-style: $b_1=1$ , $b_2=0$ , $K_1 \in \{3,5\}$
Feed-Forward Network (FFN): $b_1=0$ , $b_2=0$
Extra Depthwise (ExtraDW): $b_1=1$ , $b_2=1$

Pseudocode for one forward pass (simplified):

def UIB(X, C_out, r, b1, b2, K1, K2, stride):
    if b1:
        Z0 = DWConv(X, kernel=K1, stride=stride)
    else:
        Z0 = DWConv(X, 3, stride) if stride > 1 else X
    Z0 = BatchNorm(Z0)
    Z0 = ReLU(Z0)
    C_e = r * C_in
    Z1 = Conv1x1(Z0, out_channels=C_e)
    Z1 = BatchNorm(Z1)
    Z1 = ReLU(Z1)
    Z2 = DWConv(Z1, kernel=K2, stride=1) if b2 else Z1
    Z2 = BatchNorm(Z2) if b2 else Z2
    Z2 = ReLU(Z2) if b2 else Z2
    Z3 = Conv1x1(Z2, out_channels=C_out)
    Z3 = BatchNorm(Z3)
    return X + Z3 if stride == 1 and C_in == C_out else Z3

This unification allows nearly all weights (especially 1×1 convolutions) to be shared across instantiations in supernets employed during architecture search (Qin et al., 16 Apr 2024).

3. Comparison to Prior Blocks

The canonical MobileNetV2 IB block includes a 1×1 expand, a 3×3 depthwise, and a 1×1 project, with no ExtraDW or explicit FFN. UIB adds:

An ExtraDW (second depthwise conv), enhancing spatial mixing and receptive field without significantly increasing memory transfer cost.
An embedded small MLP-style FFN (two 1×1 convs with GELU activation) for channel mixing, inspired by ConvNeXt and Transformer architectures.
Flexible pre/post-expansion depthwise convolution options and expanded kernel sizes.

Empirically, UIB incurs ∼20%–30% more raw FLOPs per block compared to standard IB, but achieves up to ∼10%–15% lower measured latency on mobile CPUs/DSPs due to higher arithmetic intensity and better hardware fusion for FFN and ExtraDW. On ImageNet-1K, UIB improves top-1 accuracy by ~0.4% at iso-latency, and, in ABAW Action-Unit detection, increases F1 by +0.2–0.5% with marginal parameter overhead (Nguyen et al., 13 Mar 2025).

4. Integration in MobileNetV4 and Neural Architecture Search

MobileNetV4 stages are assembled from stacks of UIB blocks, where each block’s type (i.e., [IB, ConvNeXt, FFN, ExtraDW]) and kernel sizes are determined using NAS. The NAS procedure operates in two stages:

Coarse search: Varies the counts and widths of UIB blocks per stage, fixing block type (usually ExtraDW).
Fine search: Freezes the above, then searches over $b_1$ , $b_2$ , $K_1$ , $K_2$ per block (yielding four block forms).

Objective functions balance accuracy and hardware latency, e.g., maximizing $\mathrm{Acc}(\mathcal{A}) - \lambda \max[0, \mathrm{Lat}(\mathcal{A}; \text{target}) - \text{Lat}_{\text{budget}}]$ , where latency is profiled on real hardware (EdgeTPU, ARM, DSP, etc.) (Qin et al., 16 Apr 2024).

A plausible implication is that this NAS protocol, with its compact and weight-sharing supernet (as 1×1 weights are reused), enables efficient and architecture-agnostic tuning of spatial and channel mixing, receptive-field expansion, and compute/memory ratio per device and per resolution.

5. Empirical Performance and Ablation Studies

UIB-based MobileNetV4 models have been demonstrated to be predominantly Pareto-optimal across all tested hardware platforms—Pixel, Samsung, iPhone (CPUs, DSPs, GPUs, and custom accelerators). Ablations show that:

Replacing UIB with standard IB reduces CPU latency by 12% but increases measured FLOPs by 18%, and, more importantly, yields lower F1 on ABAW AU detection (e.g., 0.5441 vs. 0.5369).
Incrementally applying ExtraDW (+0.0039 F1) and full FFN (+0.0033 F1) further improves ABAW pipeline F1 (Nguyen et al., 13 Mar 2025).
Removing either depthwise option (setting $b_2=0$ or $b_1=0$ ) impairs the Pareto-front, confirming the necessity of ExtraDW in certain stages.
Occasional insertion of “FFN” blocks (no depthwise) in late stages boosts high-ridge (GPU) efficiency.

The following table summarizes the empirical trade-offs specifically reported:

Block Variant	ABAW AU F1	Real CPU Latency	FLOPs (% change)
IB (baseline)	0.5369	+12%	–18%
ExtraDW only	0.5408	n/a	n/a
Full UIB (ExtraDW+FFN)	0.5441	baseline	+18%

All values from (Nguyen et al., 13 Mar 2025). “n/a" denotes not reported.

6. Theoretical Efficiency: The Roofline Perspective

The universality of UIB stems from its adaptability under the Roofline model, which represents the trade-off between computational (MAC) intensity and memory bandwidth on different hardware. UIB enables per-block optimization of spatial mixing (depthwise convs with variable kernels) and channel mixing (1×1 pointwise convs, FFN), allowing for:

Large depthwise kernels at low resolutions (efficient on high-ridge devices like mobile GPUs or TPUs).
Predominant use of 1×1 convolutions deeper in the network to minimize memory transfers (efficient on memory-bandwidth-bound CPUs).
Flexible insertion and positioning of spatial and channel mixing so that neither compute nor memory becomes a bottleneck in any network stage.

This adaptability results in models that are almost always on the Pareto-front over the entire MAC/byte ridge-point sweep, according to real-device benchmarks (Qin et al., 16 Apr 2024).

7. Applications and Downstream Integration

In practice, UIB’s rich spatial/channel mixing at multiple resolutions produces semantically strong features, as leveraged in the multi-scale 3D MLP-Mixer-based temporal aggregation module for affective behavior analysis tasks. In evaluated pipelines (e.g., ABAW competition benchmarks), UIB’s combination of compute efficiency, low latency, and predictive accuracy is well-suited for embedded and real-time video analytics on mobile and edge devices (Nguyen et al., 13 Mar 2025). The practical relevance is further highlighted by the ability of models built with UIB to achieve 87% ImageNet-1K accuracy with a Pixel 8 EdgeTPU runtime of 3.8 ms (Qin et al., 16 Apr 2024).