TensorPool: Tensor-Centric Aggregation & Processing

Updated 4 July 2026

TensorPool is a tensor-centric term that denotes distinct constructs: a many-core AI-native RAN processor, a high-order descriptor for action recognition, and a CP-based pooling layer for graph neural networks.
The hardware variant delivers 8.4 TFLOPS at 4.3W using a 3D-stacked processor with 256 cores and 16 tensor engines, optimized for sub-millisecond real-time AI-native PHY tasks.
In video recognition and GNN applications, TensorPool leverages higher-order statistical aggregation and symmetric CP decomposition to mitigate burstiness and achieve permutation invariance.

In recent arXiv literature, TensorPool denotes distinct tensor-centric research constructs rather than a single unified method: a 3D-stacked many-core domain-specific processor for AI-native radio access networks (Bertuletti et al., 2 Apr 2026), a high-order tensor pooling pipeline with attention for action recognition (Wang et al., 2021), and, in graph representation learning, a CP-based permutation-invariant pooling layer used in tensorized graph neural networks (Hua et al., 2022). Across these usages, the shared motif is the exploitation of tensor structure—either as a computational primitive in hardware or as a statistical/multilinear aggregation mechanism in learning systems.

1. Terminological scope

A common source of confusion is that the same label is used for materially different objects. In the cited literature, TensorPool refers to a processor architecture in one line of work and to pooling operators in two others.

Usage	Domain	Core object
TensorPool	AI-native RAN	256-core, 16-TE processor cluster
TensorPool	Action recognition	High-order tensor descriptor with attention and EPN
TensorPool (the CP layer)	GNNs	Symmetric CP-based permutation-invariant pooling

The hardware TensorPool is introduced in "TensorPool: A 3D-Stacked 8.4TFLOPS/4.3W Many-Core Domain-Specific Processor for AI-Native Radio Access Networks" (Bertuletti et al., 2 Apr 2026). The vision paper on video recognition is "High-order Tensor Pooling with Attention for Action Recognition" (Wang et al., 2021). The GNN usage appears as a self-contained description of TensorPool (the “CP layer”) extracted from "High-Order Pooling for Graph Neural Networks with Tensor Decomposition" (Hua et al., 2022).

This terminological overlap suggests that the name has been used to denote tensor-oriented aggregation or execution mechanisms rather than a single research lineage.

2. TensorPool as an AI-native RAN processor

In the AI-native PHY setting for 6G RAN, the motivating constraints are explicit: deeply optimized AI-Native PHY models impose higher computational complexity than conventional baseband, deployment is subject to sub-msec real-time constraints, and compute at densified 6G cell-sites is constrained by a power budget available for compute ( $\leq 100W$ ) (Bertuletti et al., 2 Apr 2026). The hardware TensorPool addresses this by domain-specialization of many-core programmable baseband processors.

The cluster is organized as 64 Tiles, each Tile containing 4 RISC-V32IMAF cores (Processing Elements, PEs), each PE with a 32-bit FPU capable of two FP16-MACs per cycle, plus a shared FP32 div-sqrt unit. Each Tile also contains a local 32 × 2 KiB single-cycle SRAM bank for data, and a 4 KiB instruction cache. Four Tiles form a SubGroup; four SubGroups form a Group; four Groups compose the full Pool. Connectivity is hierarchical: crossbars connect Tiles → SubGroups → Groups, and PE-to-L1 latency is 3 cycles (same SubGroup), 5 cycles (same Group), 9 cycles (remote Group) (Bertuletti et al., 2 Apr 2026).

Acceleration is provided by 16 FP16 “RedMulE” Tensor Engines (TEs), one per SubGroup, each with 256 FMAs (32 rows × 8 cols). Because each TE can sustain one FP16-MAC per FMA per cycle, the TE array reaches a peak of 4096 MACs/cycle. Internal buffering is specialized for tensor traffic: each TE has three 512-bit data buffers for X, W, Y / Z tiles, and a custom streamer with 16-entry Reorder Buffers (ROBs) per stream and a 32-entry Z FIFO to hide L1 latency and tolerate out-of-order responses. The design also performs burst-grouping at request time and burst-distribution at response time to regenerate 512-bit bursts across the Tile arbiter and avoid serialization at its 7 narrow-request/cycle limit (Bertuletti et al., 2 Apr 2026).

The memory hierarchy centers on 4 MiB of multi-banked L1 scratchpad (2048 banks × 2 KiB). Data reuse is increased via double-buffering: while one half of a tile sits in the TE buffer being computed, the other half is refilled by DMA or by PEs. A reported roofline analysis on an $n \times n \times n$ GEMM shows that for $n \ge 512$ the sustained compute is limited by TE compute rate, not by L2 or L1 bandwidth. The same analysis states that across Tiles, using $K=4$ outstanding bursts, the effective per-TE bandwidth still guarantees that the TE is not memory-bound even for random 512-bit accesses across the 2048 banks (Bertuletti et al., 2 Apr 2026).

3. Throughput, energy efficiency, and 3D integration

The reported peak compute is partitioned between tensor engines and programmable cores: $\pi_{\mathrm{TEs}} = 16 \times 256 = 4096\ \mathrm{MACs/cycle},$

$\pi_{\mathrm{PEs}} = 256 \times 2 = 512\ \mathrm{MACs/cycle},$

for a total of 4608 MACs/cycle. At $f=0.9$ GHz, this corresponds to 4.15 trillion MAC/s or 8.4 TFLOPS (FP16 FMA). On large GEMMs, the measured sustained throughput is 3643 MACs/cycle, i.e. $0.89 \times$ peak ( $\approx89\%$ utilization) (Bertuletti et al., 2 Apr 2026).

The implementation is reported from PnR in TSMC-N7 at 0.75 V/TT/25 °C, with total Pool power 4.3 W. The power breakdown is 63.7 % for TE FMAs, 11 % for streamers plus buffers, 7 % for SRAM macros, and 18.3 % for interconnect plus overheads. The stated efficiency is

$\mathrm{TFLOPS/W} = \frac{8.4}{4.3}\approx1.95, \qquad \mathrm{GFLOPS/W/mm^2} = 57.53.$

The 2D area is 26.6 mm², corresponding to 0.25 TFLOPS/mm² (FP16) (Bertuletti et al., 2 Apr 2026).

Relative to a PE-only cluster called TeraPool, the reported gains are substantial: GEMM throughput 3643 vs. 609 MACs/cycle → 6×, TFLOPS/W 1.95 vs. 0.17 → 11.5×, TFLOPS/mm² 0.25 vs. 0.07 → 3.6×, and GFLOPS/W/mm² 57.53 vs. 6.24 → 9.1×. For AI-native PHY kernels, the PEs execute CFFT, LS-CHE, MIMO-MMSE in <0.15 ms at 0.9 GHz, despite 8×8 MIMO and 8192 REs, hence well within the 1 ms TTI. With TEs + PEs + DMA overlap on three key blocks—512×512 FC+softmax, 3×3 depthwise-sep conv+LN+ReLU, and 4-head MHA—the runtime reduction from sequential to concurrent execution is 16 % (FC), 25 % (Conv), and 1.3 % (MHA), with average TE utilization 67 %, 37 %, and 64 %, respectively (Bertuletti et al., 2 Apr 2026).

A central architectural claim is that physical integration constrains scaling. In 2D, routing channels between the four Groups occupy 21 % of the full-Pool area (31 % of a Group). The proposed response is wafer-to-wafer hybrid bonding (4.5 µm pitch, ~1 Ω, 1 fF per bond), placing two Groups on each die. The result is that vertical inter-Group wires replace large planar channels, reducing channel area by up to 67 %, and the full Pool footprint shrinks superlinearly by 2.32× (from 26.6 mm²→11.47 mm²), with no frequency loss because the Group-to-Group worst-case path <120 ps (Bertuletti et al., 2 Apr 2026).

Metric	2D TensorPool	3D TensorPool
Pool Footprint [mm²]	26.6	11.47
Operating freq. [GHz]	0.9	0.9
TFLOPS/mm²	0.32	0.73

The paper concludes that this TensorPool delivers 8.4 TFLOPS@FP16 at 4.3 W, meets sub-ms PHY real-time constraints, and stays within a 100 W RAN edge budget. It also states that 3D stacking demonstrates a path to scaling more TEs and L1 memory, and identifies future work: integrate larger L1 (via logic-on-logic 3D), investigate on-chip L2, and support mixed-precision or sparsity for further efficiency gains (Bertuletti et al., 2 Apr 2026).

4. TensorPool as high-order pooling for action recognition

In video understanding, TensorPool denotes a descriptor construction that replaces or extends linear pooling with second- and higher-order statistics. Given feature vectors $n \times n \times n$ 0, second-order pooling forms the non-centered covariance or auto-correlation matrix

$n \times n \times n$ 1

or, with centering,

$n \times n \times n$ 2

More generally, $n \times n \times n$ 3th-order tensor pooling aggregates outer products: $n \times n \times n$ 4 with entries

$n \times n \times n$ 5

By construction, $n \times n \times n$ 6 is super-symmetric (Wang et al., 2021).

The paper frames burstiness as a central issue for tensor descriptors built from low numbers of aggregated vectors. It relates the Heat Diffusion Process (HDP) on a graph Laplacian to Eigenvalue Power Normalization (EPN) of the covariance or autocorrelation matrix. With affinity $n \times n \times n$ 7, the loopy graph Laplacian is

$n \times n \times n$ 8

and the heat diffusion process satisfies

$n \times n \times n$ 9

A discrete power approximation is

$n \ge 512$ 0

For higher-order tensors, the method applies HOSVD,

$n \ge 512$ 1

followed by spectrum-wise power normalization. Two stated nonlinearities are MaxExp

$n \ge 512$ 2

and Gamma

$n \ge 512$ 3

leading to

$n \ge 512$ 4

The paper states that HDP and EPN play the same role, namely to boost or dampen the magnitude of the eigenspectrum thus preventing the burstiness, and that EPN acts as a spectral detector of higher-order occurrences (Wang et al., 2021).

The end-to-end pipeline is attention-augmented. For each video subsequence, two pretrained I3D streams (RGB/Flow) produce 400-d features and two end-to-end trainable I3D streams produce $n \ge 512$ 5-d features. These four vectors are Count-Sketch projected to length $n \ge 512$ 6, concatenated into $n \ge 512$ 7, and reweighted by an attention MLP

$n \ge 512$ 8

The design note states that the sketching keeps $n \ge 512$ 9 small, and the HOSVD/EPN stage rescales the spectrum of the mode-unfoldings to remove burstiness (Wang et al., 2021).

Empirically, the paper reports state-of-the-art results on HMDB-51, YUP++, and MPII Cooking Activities. On HMDB-51, 2nd-order + MaxExp reaches 80.3%, 3rd-order + MaxExp reaches 81.1%, and with +IDT fusion the numbers are 85.7% and 87.2%, with TensorPool (3rd-order) +IDT reported as a new SOTA. On YUP++, TO+MaxExp+IDT reaches 93.1%. On MPII Cooking Activities, TO+MaxExp+IDT reaches 80.4%, compared with 77.3% for SO+MaxExp+IDT (Wang et al., 2021).

5. TensorPool as a CP layer in graph neural networks

In graph learning, TensorPool is described as the CP layer: a permutation-invariant multilinear map parameterized through symmetric CANDECOMP/PARAFAC decomposition. Given a set of $K=4$ 0 feature vectors $K=4$ 1, the goal is a symmetric map

$K=4$ 2

that captures multiplicative interactions up to order $K=4$ 3. The underlying partially symmetric tensor

$K=4$ 4

is written as

$K=4$ 5

where $K=4$ 6, $K=4$ 7, and $K=4$ 8 is the CP rank. Defining

$K=4$ 9

the contraction becomes

$\pi_{\mathrm{TEs}} = 16 \times 256 = 4096\ \mathrm{MACs/cycle},$ 0

Because the tensor is symmetric in the first $\pi_{\mathrm{TEs}} = 16 \times 256 = 4096\ \mathrm{MACs/cycle},$ 1 modes, the map is permutation-invariant, and because $\pi_{\mathrm{TEs}} = 16 \times 256 = 4096\ \mathrm{MACs/cycle},$ 2 is chosen independently of $\pi_{\mathrm{TEs}} = 16 \times 256 = 4096\ \mathrm{MACs/cycle},$ 3, the number of parameters does not grow with neighborhood size (Hua et al., 2022).

The CP layer replaces both aggregation and update in message passing. For node $\pi_{\mathrm{TEs}} = 16 \times 256 = 4096\ \mathrm{MACs/cycle},$ 4, one collects the neighborhood $\pi_{\mathrm{TEs}} = 16 \times 256 = 4096\ \mathrm{MACs/cycle},$ 5, projects each $\pi_{\mathrm{TEs}} = 16 \times 256 = 4096\ \mathrm{MACs/cycle},$ 6 by $\pi_{\mathrm{TEs}} = 16 \times 256 = 4096\ \mathrm{MACs/cycle},$ 7, multiplies the resulting $\pi_{\mathrm{TEs}} = 16 \times 256 = 4096\ \mathrm{MACs/cycle},$ 8-dimensional vectors elementwise, and applies $\pi_{\mathrm{TEs}} = 16 \times 256 = 4096\ \mathrm{MACs/cycle},$ 9. The reported update is

$\pi_{\mathrm{PEs}} = 256 \times 2 = 512\ \mathrm{MACs/cycle},$ 0

where the optional $\pi_{\mathrm{PEs}} = 256 \times 2 = 512\ \mathrm{MACs/cycle},$ 1 term reintroduces a low-order sum channel to stabilize training (Hua et al., 2022).

The paper states three expressiveness results. Theorem 1 gives universality for multilinear polynomials: any permutation-invariant multilinear polynomial of degree $\pi_{\mathrm{PEs}} = 256 \times 2 = 512\ \mathrm{MACs/cycle},$ 2 can be implemented as a linear-activation CP layer of some rank $\pi_{\mathrm{PEs}} = 256 \times 2 = 512\ \mathrm{MACs/cycle},$ 3. Theorem 2 states that a CP layer of rank $\pi_{\mathrm{PEs}} = 256 \times 2 = 512\ \mathrm{MACs/cycle},$ 4 suffices to recover ordinary sum or mean of $\pi_{\mathrm{PEs}} = 256 \times 2 = 512\ \mathrm{MACs/cycle},$ 5 vectors in $\pi_{\mathrm{PEs}} = 256 \times 2 = 512\ \mathrm{MACs/cycle},$ 6. Theorem 3 states that, if the CP parameters are chosen at random with respect to a continuous distribution, then with probability 1 the resulting CP layer cannot be represented as sum-pooling plus pointwise activations. The summary conclusion is that CP pooling strictly subsumes sum/mean/max (Hua et al., 2022).

The complexity statement is explicit. For neighborhood size $\pi_{\mathrm{PEs}} = 256 \times 2 = 512\ \mathrm{MACs/cycle},$ 7, input dimension $\pi_{\mathrm{PEs}} = 256 \times 2 = 512\ \mathrm{MACs/cycle},$ 8, output dimension $\pi_{\mathrm{PEs}} = 256 \times 2 = 512\ \mathrm{MACs/cycle},$ 9, and rank $f=0.9$ 0, a standard sum/mean/max plus linear update has parameter count $f=0.9$ 1 and per-node time $f=0.9$ 2. CP pooling has parameter count $f=0.9$ 3 and per-node time $f=0.9$ 4. The paper notes that in practice $f=0.9$ 5, so the overhead is controlled (Hua et al., 2022).

Reported experiments cover OGB node and graph classification. On PRODUCTS, tGNN = 81.79 ± 0.54 %; on ARXIV, 75.38 ± 0.15 %; on PROTEINS, 82.55 ± 0.49 %. For graph tasks, the reported numbers include MolHIV AUC = 0.799 ± 0.02, ZINC MAE = 0.301 ± 0.008, MNIST Acc = 0.965 ± 0.002, and CIFAR10 Acc = 0.684 ± 0.006. The ablation summary states that CP pooling alone beats linear sum pooling by ~2–4 % in node accuracy and ~3 % in ZINC MAE, while combining CP + a residual sum-channel gives the best results (Hua et al., 2022).

6. Conceptual relations and distinctions

The three usages of TensorPool share a tensor emphasis but operate at different levels of abstraction. The AI-native RAN TensorPool is a domain-specialized many-core processor with 256 PEs + 16 TEs, intended for on-premises AI-RAN inference (channel estimation, beamforming, MIMO detection, etc.) and designed around high-throughput tensor computations dominating AI-Native PHYs (Bertuletti et al., 2 Apr 2026). The action-recognition TensorPool is a descriptor construction that aggregates second- and higher-order statistics and applies attention and EPN to mitigate burstiness (Wang et al., 2021). The GNN TensorPool is a rank-controlled multilinear pooling layer that uses symmetric CP decomposition to model high-order non-linear node interactions while remaining permutation-invariant (Hua et al., 2022).

A common misconception is that these are variants of the same algorithm. The cited literature does not support that interpretation. Instead, the shared name spans three separate technical objects: a processor, a video descriptor pipeline, and a GNN layer. Their commonality lies in using tensor structure as the central design axis—compute specialization in one case, high-order statistical pooling in another, and multilinear invariant aggregation in the third.

Taken together, these works show that “TensorPool” has been attached to tensor-centric solutions for three different bottlenecks: sub-ms, power-constrained AI-native PHY execution; burstiness-aware high-order video representation; and expressive permutation-invariant graph aggregation. This suggests that the name functions more as a descriptor of tensor-oriented pooling or execution than as a single canonical method.