LowFormer: Hardware-Efficient Vision Backbone
- LowFormer is a hardware-efficient convolutional–transformer backbone that optimizes runtime by concentrating computation in lower-resolution, high-channel stages.
- It integrates a lightweight 'Lowtention' operator to replace standard multi-head self-attention, significantly reducing memory and compute costs.
- Extensive evaluations on devices like NVIDIA A40, Titan RTX, and edge GPUs demonstrate LowFormer’s superior throughput, lower latency, and competitive accuracy on ImageNet and downstream tasks.
Searching arXiv for LowFormer and closely related papers to ground the article in current literature. LowFormer is a family of hardware-efficient convolutional–transformer vision backbones designed around measured throughput and latency rather than MACs or FLOPs as the primary notion of efficiency. Its defining idea is the joint optimization of macro design and micro design: computation is concentrated in lower-resolution, higher-channel stages where dense kernels are efficient, while self-attention is executed on lower-resolution, lower-channel feature maps and then restored to the original stage resolution. The 2024 formulation introduces this backbone family as a hardware-aware alternative to conventional efficient backbones, and the 2026 follow-up extends the design with “Lowtention,” an explicitly named lightweight alternative to Multi-Head Self-Attention, additional edge-GPU variants, and broader downstream evaluation (Nottebaum et al., 2024, Nottebaum et al., 27 Mar 2026).
1. Design objective and problem framing
LowFormer is motivated by the observation that MACs are a weak proxy for wall-clock time. The underlying argument is empirical rather than purely asymptotic: runtime depends not only on arithmetic count but also on memory access cost, degree of parallelism, kernel maturity, and fusion opportunities. In the LowFormer papers, depthwise convolutions, grouped convolutions, and high-resolution operators are treated as potentially memory-bound even when their MAC counts are small, whereas large dense convolutions at lower resolution can remain compute-bound and fast (Nottebaum et al., 2024).
This framing leads to a specific benchmarking methodology. In the 2024 study, throughput is measured on an NVIDIA A40 as median images processed per second over 100 iterations at batch size 200, while latency is measured on an NVIDIA Titan RTX over 400 iterations at batch size 16 with TorchScript-compiled models optimized for inference. The same paper also reports latency on an ARM Mali-G76 MP12 mobile GPU and an ARM Cortex-A53 CPU at batch size 1 and resolution . The 2026 study retains A40 throughput measurements but adds TITAN RTX latency with TorchScript optimize_for_inference, Jetson TX2 latency using ONNX Runtime + TensorRT, Raspberry Pi5 latency using ONNX Runtime, and iPhone 13 latency via the CoreML performance tool (Nottebaum et al., 2024, Nottebaum et al., 27 Mar 2026).
The conceptual consequence is that LowFormer treats efficient architecture design as a hardware-specific systems problem. A plausible implication is that its contributions are best understood less as a claim about minimal arithmetic cost and more as a claim about operator placement, tensor shape selection, and kernel friendliness on real devices.
2. Macro-architecture
LowFormer uses a five-stage hierarchical backbone. For a input, the spatial pyramid typically follows through stride-2 convolutional reductions. The first three stages are intentionally shallow and predominantly convolutional, while the later stages carry most of the depth and include the attention-bearing blocks. This reflects the explicit design rule to minimize layers at high resolutions and push compute deeper into low-resolution stages (Nottebaum et al., 27 Mar 2026).
In the 2026 formulation, the stage configurations are reported as follows. For B0, and ; for B1, and ; for B1.5, and ; for B2, and 0; and for B3, 1 and 2 (Nottebaum et al., 27 Mar 2026).
The early stages rely on stride-2 conv/MBConv downsampling, and the mid-stage emphasizes fused MBConv blocks. In the 2024 paper, the middle stage is described as stacked fused MBConvs where channel counts are 3, while the late stage replaces a subset or all blocks with LowFormer Attention blocks. In the 2026 paper, the corresponding description is that the first three stages are pure convolutional blocks and stages 3 and 4 include Lowtention blocks (Nottebaum et al., 2024, Nottebaum et al., 27 Mar 2026).
Normalization and activation are split by path. BatchNorm is used inside MBConv-style convolutional blocks; LayerNorm is used in or around the attention path; HardSwish is used across most convolutional blocks; and GeLU is used inside the MLP. There is no explicit learned sine/cosine positional encoding; instead, depthwise convolutions around attention act as conditional positional encodings (Nottebaum et al., 27 Mar 2026).
3. Micro-architecture: LowFormer Attention and Lowtention
Standard MHSA is taken in the usual form
4
5
with output projection by 6. Its dominant complexity is 7, where 8 is the token count and 9 the channel dimension. The memory footprint is dominated by the 0 attention map and the 1 tensors (Nottebaum et al., 2024).
LowFormer replaces this with a slimmed-down attention block executed “low.” For a stage input 2, spatial resolution is first reduced with a stride-3 depthwise convolution, then 4, 5, and 6 are generated with pointwise 7 convolutions while halving channels. SDA is performed at the reduced resolution, the result is projected back to width 8, and a transposed depthwise convolution restores the original spatial resolution: 9
0
1
2
The 2026 paper renames this operator family “Lowtention” and presents it as a lightweight alternative to Multi-Head Self-Attention (Nottebaum et al., 2024, Nottebaum et al., 27 Mar 2026).
The complexity reduction is explicit. With 3 and 4, the dominant attention term falls from 5 to 6. For 7 with halved channels, the theoretical reduction in the dominant 8 term is 9. Memory shrinks in parallel: the attention map area drops by 0, while 1 shrink by 2 in tokens and by 3 in channels (Nottebaum et al., 2024).
The block is followed by LayerNorm and an MLP composed of two pointwise convolutions with a nonlinearity. The authors emphasize that the MLP is more hardware-friendly than MHSA and that the post-attention path can be fused for better runtime. In the 2026 description, the depthwise convolution after attention is fused with the pointwise projection because the input channels there are halved and thus remain in a regime where fused dense kernels are favorable (Nottebaum et al., 27 Mar 2026).
4. Hardware-efficiency thesis and operator-level analysis
A central claim of LowFormer is that memory movement and kernel utilization frequently dominate arithmetic count. The papers therefore compare common modules directly in runtime. Depthwise convolutions have far fewer MACs than standard convolutions but often show little speed advantage, and standard convolutions with 4 more MACs can be similarly fast or faster. The cited explanation is high memory access cost and poor parallelism of grouped/depthwise kernels on common hardware (Nottebaum et al., 2024).
This operator-level analysis extends to fused MBConv. The papers report that fusing the expansion pointwise convolution and the depthwise convolution into a single standard convolution increases MACs yet is faster over broad ranges of resolution and channel counts, especially below roughly 256 channels, because memory traffic is reduced and highly optimized dense convolution kernels can be used. In the 2024 ablation, replacing fused MBConv with unfused MBConv yields 5 throughput and 6 top-1 accuracy (Nottebaum et al., 2024). In the 2026 ablation for LowFormer-B1, the same replacement yields throughput 7, Jetson TX2 latency 8, ARM latency 9, and top-1 0, illustrating that the trade-off is device dependent (Nottebaum et al., 27 Mar 2026).
High-resolution operators are treated as disproportionately expensive even when MAC-matched to lower-resolution alternatives. The papers describe matched-MAC layer pairs in which the higher-resolution variant is often much slower, while higher-channel computation at lower resolution preserves kernel efficiency. This design rule directly motivates the shallow early stages and the concentration of depth at low resolution (Nottebaum et al., 2024, Nottebaum et al., 27 Mar 2026).
The attention ablations reinforce the same point. In the 2024 paper, removing low-resolution execution and keeping attention at full resolution reduces throughput by 1 at 2 and the latency penalty grows with resolution, reaching 3 at 4, without an accuracy gain. In the 2026 TX2 measurements, removing Lowtention downsampling causes latency to jump dramatically at high resolutions, with 5 at 6 for the B1 variant without downsampling (Nottebaum et al., 2024, Nottebaum et al., 27 Mar 2026).
5. Quantitative performance
On ImageNet-1K, the 2024 paper reports the following selected results at 7: LowFormer-B0 has 14.1M parameters, 944M MACs, 5988 images/s, 0.30 ms latency, and 78.4% top-1; LowFormer-B1 has 17.9M parameters, 1410M MACs, 4237 images/s, 0.43 ms, and 79.9%; LowFormer-B1.5 has 33.9M parameters, 2573M MACs, 2739 images/s, 0.66 ms, and 81.2%; LowFormer-B2 has 45.0M parameters, 3689M MACs, 2227 images/s, 0.88 ms, and 81.6%; and LowFormer-B3 has 57.1M parameters, 6098M MACs, 1162 images/s, 1.55 ms, and 83.6% (Nottebaum et al., 2024).
The same paper positions these results against contemporary efficient backbones. LowFormer-B0 is compared with MobileOne-S2 and reported to have approximately 8 throughput, approximately 15% lower latency, and 9 accuracy. LowFormer-B1 is reported as faster than EfficientViT-B1, with 0 throughput and 1 top-1. LowFormer-B3 matches FAT-B3 at 83.6% top-1 while offering approximately 2 throughput and about 45% of the latency (Nottebaum et al., 2024).
The 2026 paper republishes the ImageNet line with additional device measurements. LowFormer-B0 is reported at 5988 images/s, 8.5 ms on Jetson TX2, 39.1 ms on ARM CPU, and 78.4% top-1; B1 at 4237 images/s, 11.7 ms, 59.1 ms, and 79.9%; B1.5 at 2739 images/s, 18.1 ms, 111.6 ms, and 81.2%; B2 at 2227 images/s, 21.6 ms, 144.2 ms, and 81.6%; and B3 at 1162 images/s, 32.5 ms, 273.8 ms, and 83.6%. Comparative highlights include B2 versus BiFormer-T, where B2 has approximately 3 throughput, approximately 4 TX2 latency, approximately 5 ARM latency, and 6 top-1; and B3 versus FastViT-SA36, where B3 has approximately 7 throughput, approximately 8 TX2 latency, approximately 9 ARM latency, and the same 83.6% top-1 (Nottebaum et al., 27 Mar 2026).
LowFormer is also evaluated as a backbone for downstream tasks. In the 2024 study, RetinaNet on COCO 2017 reports 38.6 AP for LowFormer-B0, 41.4 AP for B2, and 43.1 AP for B3; Semantic FPN on ADE20K reports 39.7 mIoU for B1, 42.8 mIoU for B2, and 44.6 mIoU for B3 (Nottebaum et al., 2024). The 2026 study reproduces these downstream settings and adds latency details: at 0, LowFormer-B0 reaches 38.6 AP with backbone throughput 1190 images/s and TX2 latency 22.4 ms; B2 reaches 41.4 AP with 450 images/s and 63.3 ms; B3 reaches 43.1 AP with 245 images/s and 109.0 ms. For ADE20K segmentation, B1 reaches 39.7 mIoU with 840 images/s and 31.6 ms on TX2; B2 reaches 42.8 mIoU with 450 images/s and 63.3 ms; and B3 reaches 44.6 mIoU with 245 images/s and 109.0 ms (Nottebaum et al., 27 Mar 2026).
6. Variants, limitations, and relation to adjacent work
The 2026 extension introduces edge-GPU-oriented variants E1, E2, and E3. LowFormer-E1 is reported at 78.8% top-1, 1350M MACs, 6337 images/s, 1.0 ms GPU latency, 6.2 ms on TX2, and 1.7 ms on iPhone 13; E2 at 81.6%, 3800M MACs, 2070 images/s, 1.5 ms, 14.7 ms, and 2.5 ms; and E3 at 83.0%, 5350M MACs, 1566 images/s, 3.6 ms, 25.0 ms, and 3.6 ms. These variants are presented as evidence that the same hardware-aware recipe can be tuned for additional latency targets (Nottebaum et al., 27 Mar 2026).
The limitations are also explicit. Depthwise convolution remains MAC-efficient but may not realize runtime gains on many GPUs and CPUs because it is memory-bound; specialized accelerators may behave differently. The fused MBConv advantage is strongest at channel counts 1 and diminishes at very high channels. Attention downsampling improves runtime dramatically and shows no accuracy loss at 2 in the reported ablations, but tasks that critically rely on high-frequency global interactions at full resolution may require tuning the downsampling factor. The papers further note that gains depend on kernel implementations, JIT compilation, vendor libraries, and batch size, so benchmarking on target hardware remains necessary (Nottebaum et al., 2024, Nottebaum et al., 27 Mar 2026).
A useful point of comparison is LRFormer’s Low-Resolution Self-Attention for semantic segmentation, which computes self-attention in a fixed low-resolution space regardless of the input image’s resolution and restores fine detail through 3 depth-wise convolutions (Wu et al., 2023). LowFormer differs in scope and presentation: it is introduced as a hardware-efficient backbone family and later as a broader architecture-design framework with Lowtention, edge variants, and results across classification, detection, segmentation, image retrieval, and visual object tracking (Nottebaum et al., 2024, Nottebaum et al., 27 Mar 2026). This suggests a broader design pattern in modern dense and hierarchical vision models: global context can often be modeled on compressed feature maps, while lightweight convolutional operators preserve locality and hardware efficiency.