PillarAttn: Sparse Attention for 3D & LLM

Updated 2 December 2025

PillarAttn is a family of sparse and channel-wise attention mechanisms that selectively aggregates features from 'pillars' in both LiDAR-based 3D detection and speculative LLM decoding.
It employs learnable gating via compact MLPs and dynamic token selection to reduce computational overhead and improve throughput, achieving up to 2.13× faster inference in LLMs and modest accuracy gains in 3D detection.
The design integrates seamlessly with architectures like PillarNet and PointPillars, adding minimal runtime overhead (5–6 ms per nuScenes frame) while enhancing overall model performance.

PillarAttn refers to a family of sparse and channel-wise attention mechanisms designed specifically for efficient aggregation and selection in pillar-based neural architectures. Its primary applications are in two domains: (a) LiDAR-based 3D object detection, where it fuses features from virtual pillar grids, and (b) LLM inference, where it enables highly efficient sparse self-attention during speculative decoding.

1. Conceptual Overview

PillarAttn aggregates or selectively attends to structured groupings—“pillars”—within data. In 3D perception, it is instantiated as Attentive Pillar Aggregation (APA), merging features extracted from virtual pillar grids. In autoregressive language modeling, PillarAttn sparsifies KV-Cache access by restricting attention to dynamically selected critical (“pillar”) tokens. All instantiations employ learnable, channel-wise (or token-wise) gating—often implemented as compact MLPs with ReLU-sigmoid bottlenecks—resulting in minimal additional computational overhead while significantly enhancing downstream accuracy or throughput (Park et al., 2024, Zhao et al., 1 Dec 2025).

2. PillarAttn in LiDAR 3D Object Detection

Fine-Grained Pillar Feature Encoding (FG-PFE) leverages PillarAttn to fuse three orthogonal feature streams extracted per pillar of a LiDAR scan: vertical (V-PFE), temporal (T-PFE), and horizontally shifted (H-PFE) features (Park et al., 2024).

Inputs and Formulation

Given $N_p$ non-empty pillars and $C_p$ channels per stream:

$P_{\text{vpfe}} \in \mathbb{R}^{N_p \times C_p}$ : vertical features
$P_{\text{tpfe}} \in \mathbb{R}^{N_p \times C_p}$ : temporal-sweep features
$P_{\text{hpfe}} \in \mathbb{R}^{N_p \times C_p}$ : horizontal-shift features

Concatenate along the channel dimension: $X = [P_{\text{vpfe}}\,\Vert\,P_{\text{tpfe}}\,\Vert\,P_{\text{hpfe}}] \in \mathbb{R}^{N_p \times 3C_p}$

Channel-wise attention is computed as: $z = \frac{1}{N_p} \sum_{i=1}^{N_p} X_i,\quad a = W_0 z,\quad b = W_1 \mathrm{ReLU}(a),\quad s = \sigma(b)$

$X' = X \odot s,\quad P_{\text{fg}} = W_2 X' + b_2\,\in\,\mathbb{R}^{N_p \times C_p}$

Integration

$P_{\text{fg}}$ scatted into the 2D BEV grid, enabling convolutional backbones for box prediction. All operations are parallelized via 1×1 convolutions, and the APA block introduces only 5–6 ms runtime overhead per nuScenes frame.

Empirical Performance

Method	mAP (%)	NDS (%)	Latency (ms)
PillarNet-18 baseline	65.0	70.8	63
FG-PFE+PillarAttn (full model)	65.7	71.8	69

FG-PFE+PillarAttn yields a +0.7% mAP and +1.0% NDS improvement over the PillarNet-18 backbone at a minor computational cost (Park et al., 2024).

3. PillarAttn in Self-Speculative Decoding for LLMs

PillarAttn also denotes the sparse attention routine within the SparseSpec self-speculative decoding pipeline for LLM inference, addressing memory-bandwidth bottlenecks in chain-of-thought (CoT) generation (Zhao et al., 1 Dec 2025).

Algorithmic Procedure

KV-Cache Length: Let $M_t$ be the current cache size.
Sparse Draft: Attend only to tokens in pillar set $P_t$ of cardinality $\lceil s M_t \rceil$ (e.g., $s=0.05$ ).
Verification: Every $k$ tokens, run full attention, extract all per-head attention scores $\alpha_{i,h,m} = \exp(\ell_{im} - \log Z_i)$ , average across heads and queries:

$\bar \alpha_m = \frac{1}{k\cdot H} \sum_{i=1}^k \sum_{h=1}^H \alpha_{i,h,m}$

Select $P_{t+1}$ as the top $\lceil s M_{t+k} \rceil$ by $\bar \alpha_m$ .

Stride: Repeat, shifting pillar sets dynamically per verification.

This leverages the observation that only a small, dynamically selected subset of cache tokens (“pillars”) receives significant attention mass in CoT regimes.

Performance Summary

Reduces KV-Cache traffic by $\sim$ 90% (6.78× less with $k=16$ , $s=0.05$ , $\alpha=0.75$ accepted tokens).
Yields up to $2.13\times$ throughput over vLLM and $1.36\times$ over MagicDec on Qwen3-8B (generation length $\sim$ 12k) (Zhao et al., 1 Dec 2025).

4. Computational Considerations

The design intent behind PillarAttn is to maximize parallelizability and minimize overhead.

In vision, computation per frame is at most $0.5$ GFLOP for $C_p=64, r=16$ (Park et al., 2024).
In LLMs, memory-bandwidth overhead is reduced in rough proportion to the sparsity $s$ , with stride $k$ governing draft-verify tradeoff (Zhao et al., 1 Dec 2025).
1×1 convolutions and FC layers allow hardware-efficient batched execution.
Optimizations such as BatchNorm folding and ONNX export further reduce practical inference times in 3D detection.
Delayed verification and dynamic KV-Cache management enable high GPU utilization in language modeling.

5. Variants and Architectural Placement

In LiDAR 3D detection, PillarAttn (APA) follows feature extraction streams (V-/T-/H-PFE) and precedes the BEV convolutional backbone, serving as the primary aggregation point before grid scattering.

In LLM inference, PillarAttn sits as the attention kernel for all draft steps in speculative decoding; it does not require new weights or training, instead leveraging verification-phase scores to update its sparse pattern each stride.

PillarAttn modules are lightweight, modular, and compatible with existing backbones such as PillarNet, CenterPoint-Pillar, and PointPillars for perception, or as a plug-in to self-speculative decoding frameworks for generative modeling.

6. Empirical Gains, Limitations, and Future Directions

PillarAttn consistently yields measurable gains in accuracy or throughput.

In 3D detection, NDS gains of +1–4% are observed with only 5–6 ms runtime cost (Park et al., 2024).
In LLM speculative decoding, throughput growth ranges up to $2\times$ with no loss in model accuracy for long-form reasoning (Zhao et al., 1 Dec 2025).

Limitations include:

Diminished benefit for short-sequence or low-sparsity regimes where memory bandwidth is not a bottleneck.
Potential mismatch with rapidly shifting attention patterns, suggesting that more adaptive pillar selection or hierarchical speculation could further close the performance gap.
Current implementations do not address encoder-decoder cross-attention, an open direction for LLMs.

A plausible implication is that PillarAttn-like mechanisms will find broader use in other sequence and structure aggregation tasks that benefit from dynamic, selective, and hardware-friendly attention.

7. References to Core Implementations and Datasets

PillarAttn as Attentive Pillar Aggregation is described in the FG-PFE architecture for LiDAR-based 3D detection, evaluated on the nuScenes dataset and tested with PointPillars, CenterPoint-Pillar, and PillarNet-18 as backbones (Park et al., 2024).

PillarAttn as a speculative-sparse attention kernel is evaluated within the SparseSpec framework for reasoning LLM inference with Qwen3-8B as the primary model (Zhao et al., 1 Dec 2025).

Empirical reporting is based on NDS, mAP, latency (ms), throughput (tokens/s, Frames Per Second), and memory-bandwidth utilization. All performance and methodological details strictly reflect those presented in the cited arXiv sources.

Markdown Upgrade to Chat

References (2)

Fine-Grained Pillar Feature Encoding Via Spatio-Temporal Virtual Grid for 3D Object Detection (2024)

Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PillarAttn.