Efficient GPU Feature Interaction

Updated 2 December 2025

The paper introduces efficient GPU-based feature interaction by leveraging both implicit (IPNN) and explicit (HitMatch) methods to enhance large-scale retrieval.
It integrates compressed inverted-list representations and dual-tower architectures to effectively manage high-throughput and memory constraints.
Empirical results demonstrate substantial improvements in query speed and ranking quality, validated in real-world ad retrieval systems.

Efficient GPU-based feature interaction refers to algorithmic and systems-level strategies that leverage high-throughput graphics processing units to compute both implicit and explicit cross-feature contributions in large-scale retrieval or registration problems. This challenge is especially salient in applications where high model expressivity must be balanced against extreme throughput and vast candidate spaces, such as embedding-based ad retrieval and real-time 3D mapping. The field integrates compressed indexing, data-parallel computation, and memory-efficient representations to enable explicit feature interactions at scale and under stringent latency constraints.

1. Motivation and Architectural Background

In large-scale recommendation and retrieval systems, particularly those handling advertising or sponsorship selection, embedding-based retrieval (EBR) techniques have become prevalent due to their compatibility with maximum inner product search (MIPS) and approximate nearest neighbor (ANN) solvers. The foundational dual-tower architecture encodes user features $u$ and candidate item (e.g., ad) features $a$ independently via deep models, resulting in $\mathbf{h}_u = f_{\rm user}(u)\in\mathbb{R}^d$ and $\mathbf{h}_a = f_{\rm ad}(a)\in\mathbb{R}^d$ , with all candidate scoring at query time reduced to $s(u,a) = \mathbf{h}_u^\top\mathbf{h}_a$ (Lei et al., 27 Nov 2025).

This approach is highly efficient but limited in its capacity for feature interaction: only a single (late) inner product aggregates user and item information. The lack of early or explicit feature conjunctions constrains recall and ranking quality, especially compared to wide and deep models that become optimal in the ranking phase but are too computationally expensive for retrieval.

This suggests that efficient GPU-based feature interaction methods seek to bridge the gap between expressivity (as in ranking models) and tractability (as in retrieval tower architectures).

2. Explicit and Implicit Feature Interaction: Mathematical Formulations

State-of-the-art frameworks address the interaction bottleneck by including both implicit and explicit mechanisms within the GPU execution graph:

Dual-Tower Baseline:

$s_{\rm DT}(u,a)=\langle \mathbf{h}_u,\mathbf{h}_a\rangle$

Represents pure embedding-space dot-product scoring (late interaction).

Implicit Interaction (IPNN module):

Raw user and ad features are projected into a joint space, so interactions are captured as:

$\tilde{\bm u} = W^{(u)}\bm u,\quad \tilde{\bm v} = W^{(v)}\bm v,\quad s_{\rm IPNN} = \bigl[\mathbf{h}_u, \tilde{\bm u}\bigr]^\top \bigl[\mathbf{h}_a, \tilde{\bm v}\bigr]$

This augments the expressive capacity without enumerating full cross-feature sets.

Explicit Feature Interaction (HitMatch operator):

Models “wide” terms directly via precomputed sparse cross-features:

$s_{\rm HitMatch}(u,a) = \sum_{i=1}^M w_i\,x_i\,L_{a,i}$

where $L_{a,i}$ is a binary indicator of ad $a$ 's participation in cross-feature $i$ , w_i are learned weights, and $x_i$ are user-specific activations.

Unified Scoring:

$s(u,a) = s_{\rm DT}(u,a) + s_{\rm IPNN}(u,a) + s_{\rm HitMatch}(u,a)$

The mathematical unification allows combining the strong modeling power of wide × deep models within the constraints of retrieval-stage compute (Lei et al., 27 Nov 2025).

3. Compressed Inverted-List Representation and GPU Scheduling

To render explicit feature interaction feasible at retrieval speeds, explicit cross-feature matrices ( $L$ ) are encoded using a GPU-optimized, compressed inverted-list structure:

Block grouping: Each posting list (ads sharing a cross-feature) is split by high-order bits of the ad index, yielding uniform computational blocks.
Logarithmic categorization: Posting lists are grouped and padded to powers-of-two for alignment and load balancing.
Block compression: Each block’s header (24 bits for ad index high bits) is stored separately; lowest 8 bits provide compact location information.
Struct-of-arrays (SoA) memory layout: Feature IDs, headers, and ad indices are stored in separate, contiguous arrays, enabling highly coalesced GPU accesses.

At query time, a merge-based scheduler assigns blocks to GPU threads so that all candidate ads matching an explicit cross-feature receive a linear update ( $\texttt{AtomicAdd}$ on a pre-allocated result buffer). This design eliminates dynamic allocation, irregular access, and synchronization bottlenecks evident in baseline sparse GEMV approaches (cuSPARSE, etc.).

4. Performance, Complexity, and Empirical Benchmarks

The proposed GPU-based feature interaction system achieves both maximal resource utilization and drastic latency reduction:

Computational Complexity:
- Dual-tower: $O(d)$ per inner product.
- Naive explicit interaction (sparse GEMV): $O(\sum_i |\{a: L_{a,i} = 1\}|)$ per query, hampered by GPU IO.
- Compressed inverted list: Still $O(\sum_i n_i)$ , with constant factor ~8× improvement via compression, coalescing, and block scheduling.
Empirical evaluations (Lei et al., 27 Nov 2025):
- On NVIDIA T4, HitMatch operator executes in ≤500 μs/query, achieving 1,904 queries/sec (QPS), a 7× improvement over cuSPARSE GEMV (276 QPS).
- Model-level retrieval: DT+IPNN+HitMatch yields GAUC 0.861, Recall@5_1 = 0.768, Recall@10_1 = 0.939, versus baseline dual-tower (GAUC 0.839, Recall@5_1 = 0.730, Recall@10_1 = 0.908).
- Large-scale deployment at Tencent Advertising: online A/B test results include cost increases of 0.37%–1.25% but gross merchandise value (GMV) increases of 1.49%–1.58% and Recall@100_1 improvements of 1.8%–2.5%.

These results confirm that the combination of model and engineering enhancements enables ranking-stage expressivity at retrieval-stage speed in industry-scale production.

5. Integration and Deployment in Large-Scale Systems

The compressed inverted-list method introduces negligible preprocessing overhead (~200 ms), amortized over hundreds of thousands of retrievals, and can be efficiently rebuilt (every few minutes) and swapped into GPU memory. The full index fits readily within a mid-range GPU’s 16 GB memory due to aggressive compression (8 bits per posting, small headers).

Compatibility with ANN libraries for the dual-tower and IPNN stages allows separation of concerns: candidate ANN scan (embedding-based), GPU HitMatch for explicit features, and top-K selection compose the query pipeline. Independence of query processing supports horizontal scalability, with each query operating against the same read-only index and output buffer.

Latency constraints are met with sub-millisecond explicit interaction updates, leaving substantial budget for subsequent (CPU or GPU) ranking or re-ranking stages. A plausible implication is that this architecture can be generalized to other high-QPS, high-cardinality retrieval domains requiring explicit cross-feature matching.

6. Relationships to Broader Research Domains

While primarily developed for ad retrieval, similar engineering motifs are evident in other domains requiring high-throughput, explicit feature-matched scoring. For example, GPU-accelerated feature-based registration systems in SLAM (e.g., FeatSense (Gaal et al., 2023)) employ analogous techniques of data partitioning, block scheduling, and atomic memory operations to maintain real-time volumetric fusion (TSDF) or point cloud alignment. Both paradigms exploit data sparsity, memory alignment, and kernel fusion to surmount memory bandwidth and thread divergence.

A key distinction is that the retrieval context achieves high model expressivity in candidate scoring via explicit cross-feature accumulation, whereas registration pipelines deploy GPU computation for TSDF fusion and mapping, with feature-matching primarily CPU-bound. Nevertheless, efficient GPU-based feature interaction remains critical for high-frequency, high-cardinality inference across multi-modal data.

7. Limitations and Practical Considerations

Explicit feature interaction via GPU-accelerated inverted lists achieves low-latency, wide × deep capacity under both memory and QPS constraints, but imposes certain requirements:

Posting lists must be highly sparse to ensure predictable memory usage and enable aggressive index compression.
For index updates (e.g., ad addition/removal), periodic full rebuilds and atomic swaps into device memory are necessary.
Explicit cross-feature indexing is amenable to horizontal scaling and easy context isolation, but modeling efficacy is bounded by the representational granularity of included cross-features.
The technique is most effective when the candidate universe is static or only slowly evolving; highly dynamic catalogs may increase index rebuild overhead.

Within these parameters, co-designed model and system architectures for GPU-based feature interaction provide a practical and robust solution for matching the retrieval-stage latency envelope with near-optimal model expressivity (Lei et al., 27 Nov 2025).