Heimdall++: High-Throughput Radio & LLM Systems
- Heimdall++ is a dual-purpose system that offers a high-throughput GPU-accelerated pipeline for real-time single-pulse detection in radio astronomy and a test-time scaling framework for LLM verification.
- It employs fine-grained parallelism and custom memory management to reduce cudaMemcpy calls by 6.7× and cudaMalloc invocations by 41.2×, optimizing computational throughput.
- In LLM verification, Heimdall++ uses PPO fine-tuning and multi-sample aggregation, achieving up to 97.5% accuracy and robust candidate evaluation.
Heimdall++ is a designation for two advanced systems in high-throughput scientific computing: (1) an optimized GPU-accelerated pipeline for real-time single-pulse detection in time-domain radio astronomy, and (2) a test-time scaling framework for generative solution verification with LLMs. Both represent substantial technical advancements over their respective predecessors, sharing architectural principles but addressing distinct application domains. Below, each constituent system, its technical innovations, and empirical results are elaborated in detail (Xia et al., 29 Nov 2025, Shi et al., 14 Apr 2025).
1. Architectural Innovations in Radio Astronomy Pipeline
Heimdall++ is an end-to-end redesign of the original Heimdall single-pulse search pipeline, addressing suboptimal GPU throughput due to resource contention and sequential execution. The pipeline encompasses two principal stages:
- CPU Pipeline Creation: This involves file-level I/O, extraction of filterbank headers and metadata (e.g., center frequency, bandwidth, sampling interval), and allocation of Unified Memory buffers. Lightweight
PipelineTaskobjects are constructed and enqueued in a thread-safe creation queue. - GPU Execution: Data chunks are streamed into the device using double buffering. The processing suffix includes RFI mitigation, incoherent dedispersion across user-specified dispersion measure (DM) trials, and candidate search (baseline removal, normalization, matched filtering, peak detection) in multiple asynchronous CUDA streams. Candidate merging and clustering tasks execute exclusively on the GPU, leveraging shared-memory acceleration, with final results transferred back to host memory.
Fine-grained parallelism is achieved via decomposition of DM trials across T OpenMP host threads, each with dedicated CUDA stream , partitioning DM space as . Asynchronous kernel launches and overlapping data transfers eliminate computational stalls.
2. GPU Utilization, Parallelization, and Memory Management
Significant improvements stem from kernel launch strategies and custom memory management apparatus:
- Thread-block configuration: For 256K-sample chunks, is typically $256$–$512$ per block, maximizing streaming multiprocessor (SM) occupancy. GPU occupancy increases from 50% in Heimdall to 92% in Heimdall++ on RTX 3080 Ti.
- Transfer versus compute time: Heimdall++ eliminates redundant
cudaMemcpycalls, reducing PCIe traffic from $7.85$ GB to $1.17$ GB per batch, a decrease. Thus, more wall-time shifts to computation rather than transfer. - Unified Memory and custom allocator: Placement of input and output buffers in Unified Memory allows on-demand paging, supporting working sets exceeding physical device memory. A multi-threaded allocator manages a global queue of reusable blocks, reducing
cudaMallocinvocations by . Allocation is protected by reader-writer locks to permit concurrency.
The clustering stage, reimplemented with shared memory, executes coalesced scans in time (versus the previous pointwise comparisons), optimizing both runtime and global memory bandwidth utilization.
3. Multi-Threaded Pipeline Parallelism
Heimdall++ instantiates a two-stage pipeline (see Fig. 4 in (Xia et al., 29 Nov 2025)):
- Stage 1, CPU pipeline creation threads, perform I/O and buffer setup.
- Stage 2, GPU execution threads, asynchronously read data chunks, launch RFI mitigation and DM-trial kernels, and merge candidates.
Task boundaries are delineated with lock-free or mutex-protected queues. Double buffering is managed via CUDA events to overlap data transfers and compute, avoiding the inefficiency of per-kernel synchronizations.
4. Empirical Performance and Validation
Quantitative benchmarks on NVIDIA RTX 3080 Ti + Core i9-12900K systems demonstrate substantive acceleration:
| Scenario | Speedup Heimdall++ vs. Heimdall | GPU Utilization (%) |
|---|---|---|
| Single-file, T=8 (1GB) | 2.66× | ≈92 |
| Multi-file batch, T=4 | 2.05× | - |
| Stage-level: RFI Mitigation | 3.25× | - |
| Stage-level: Clustering | up to 4.5× | - |
All results are scientifically consistent: candidate lists (time, DM, width, S/N) are identical between Heimdall++ and Heimdall, with a validation metric indicating no missed or spurious candidates.
5. Generative Verification with Chain-of-Thought LLMs
Heimdall++ also refers to a test-time verification system for long Chain-of-Thought (CoT) reasoning in LLMs. Here, a base transformer (e.g., Qwen-32B-Distill) is fine-tuned with PPO to output a CoT trace and a binary answer token .
Reinforcement Learning Framework
Verification is formulated as a one-step RL task:
- State: Verification prompt containing both the problem and a candidate solution
- Action: Generation of
- Reward:
- Objective:
PPO training employs AdamW ( learning rate), batch size 16, max CoT length 200–400, for 1,000 steps.
Scalable Verification and Aggregation
Test-time scaling is achieved via repeated sampling ( independent verifier calls per problem), aggregated using either majority voting or average score. Metrics recorded include false-positive and false-negative rates, and area under ROC curve.
Pessimistic Verification Algorithm
For solution selection, pessimistic verification computes a lower-confidence bound on candidate answer scores:
- For unique answer ,
where is the mean verification score, is the number of samples, and is a tunable parameter.
6. Experimental Results in LLM Verification
Empirical evaluations on AIME2024/AIME2025 utilize DeepSeek-R1-Distill-Qwen-32B and Gemini 2.5 Pro as solver models:
| Configuration | Accuracy |
|---|---|
| Solver only (N=64, no verify) | 54.2% |
| Heimdall (init) | 62.5% |
| Heimdall after PPO (L→400) | 94.5% |
| Heimdall + M=64 (MV) | 97.5% |
| Pessimistic Verification (N=16, M=16) | 70.0% |
| Pessimistic Verification (N=64, M=64) | 83.3% |
| Gemini 2.5 Pro + PV (N=16, M=16) | 93.0% |
Generalization is evidenced by proof verification (9/10 correct judgments) and automatic dataset filtering (≈50% of NuminaMath samples flagged as flawed).
7. Scientific Consistency and Generalization
Heimdall++ in both domains maintains strict equivalence to reference systems: bit-for-bit candidate reproducibility in radio astronomy, and high-fidelity verification in LLM reasoning. Validation metrics such as and multi-sample score distributions confirm robustness and consistency.
A plausible implication is that fine-grained parallelism, custom memory management, and scalable verification—independent of application domain—drive substantive efficiency and trust in modern scientific compute systems.