Parallel Muon Reconstruction
- Parallel muon reconstruction is a high-energy detector technique that uses seeded Hough transforms and hardware-level parallelism to rapidly process muon tracks.
- The method employs a specialized Hough transform with vectorized least squares fitting to achieve precise momentum determination within microsecond latencies.
- Integration of ARM Cortex-A9 Neon SIMD and custom FPGA architectures ensures over 98% segment-finding efficiency and meets stringent first-level trigger timing requirements.
A parallel muon, in the context of high-energy collider detectors, refers to the processing and reconstruction of muon tracks in drift-tube chambers with substantial parallelism to meet stringent real-time trigger requirements. The implementation described in the ATLAS experiment at the High-Luminosity Large Hadron Collider (HL-LHC) is a paradigmatic example, leveraging both algorithmic methods and hardware-level parallel floating-point execution to rapidly identify muon trajectories and suppress background, achieving performance compatible with first-level trigger constraints (Abovyan et al., 2018).
1. Seeded Hough Transform for Track Reconstruction
The parallel muon reconstruction pipeline is built upon a Hough transform tailored for fast unidimensional scans, exploiting prior information from fast trigger chambers. In the ATLAS muon drift tube (MDT) system, the general Hough transform for lines in the plane is defined as:
Each detector hit casts a “vote” for all that satisfy this relation, such that straight tracks yield peaks in ("Hough space"). However, ATLAS leverages the fast trigger chambers (RPC/TGC) to provide a coarse estimate of the track slope (with accuracy ). The Hough transform is "seeded" at the approximate slope , requiring only the intercept of the linear trajectory to be scanned:
For each MDT hit, the measured input includes the tube center , drift radius , and the seeded slope . Geometry yields two possible intercept solutions (for either side of the wire):
Quantizing to 1 mm and sorting all $2N$ values ( = number of hits), a histogram identifies clusters representing true track segments. The final segment fitting is performed by linearizing in and , yielding normal equations for least squares optimization over the selected hits.
2. SIMD Parallelization on ARM Cortex-A9 Neon
To meet latency constraints, parallelization is realized on the ARM Cortex-A9’s Neon Single Instruction Multiple Data (SIMD) engine, capable of 4-wide single-precision floating-point vector operations. Data are organized into 16 byte-aligned arrays:
float r[16], z[16], y[16], sign[16];(`sign = \pm 1b_{\pm}S = \sqrt{1+\bar m^2}, M = \bar m(r, \theta)$23%%%% speedup over scalar code, reducing the segment-fit time from $\sim$2 μs to $\sim$0.5 μs per segment.3. Integrated Detector Hardware Architecture
The processing pipeline is tightly coupled with the detector’s hardware:
- Xilinx Zynq XC7Z045 SoC integrates dual 800 MHz Cortex-A9 CPUs and FPGA fabric.
- On-chamber electronics forward hits through optical GBT links to off-detector "hit-matcher" FPGA, with custom logic including 8k-deep input FIFO and multiple data-shuffling FIFOs.
- The hit matcher associates MDT hits with L0 pretriggers, streaming matched hits (up to 16 per chamber) to segment-reconstruction FPGA IP over AXI4-Stream at 320 MHz.
- The segment-reconstruction IP (pattern recognition and bubble-sort clustering) interrupts the ARM CPU via IRQ, delivering input segments over a 32-bit AXI FIFO.
- The ARM CPU reads segment candidates, executes the vectorized least squares fit, and performs momentum determination, typically in $<$500 ns.
4. Detailed Latency Breakdown
The real-time constraints are governed by the following latency budget (in nanoseconds):
Step Latency (ns) Time of flight ($\mu\to$MDT)</td> <td>65</td> </tr> <tr> <td>Maximum drift time</td> <td>750</td> </tr> <tr> <td>Digitization & on-chamber multiplexing</td> <td>561</td> </tr> <tr> <td>Optical link (max 100 m fibre)</td> <td>516</td> </tr> <tr> <td>Hit matching (PL IP)</td> <td>440</td> </tr> <tr> <td>Transfer to segment-recognition IP</td> <td>250</td> </tr> <tr> <td>Pattern recognition (Hough clustering)</td> <td>204</td> </tr> <tr> <td>Transfer cluster to ARM (AXI)</td> <td>60</td> </tr> <tr> <td>ARM segment-fit (Neon SIMD)</td> <td>500</td> </tr> <tr> <td>Transfer back segment params</td> <td>250</td> </tr> <tr> <td>Momentum determination</td> <td>80</td> </tr> <tr> <td><strong>Total</strong></td> <td>≈3,630</td> </tr> </tbody></table></div> <p>The total cumulative latency for the MDT-based parallel muon reconstruction is thus approximately $3.63\,\mu\lesssim 4\%p_T \approx 20\,\mathrm{GeV}>10\%\approx 1/3 \mu\sim$3 million segments/s per ARM core. The Neon-optimized implementation runs four times faster than scalar code, and six Zynq SoCs per spectrometer sector (two per chamber layer) can sustain up to 3 kHz of pretriggers with margin. 6. Implications for HL-LHC Muon Trigger and Future Directions
The parallel muon reconstruction method enables sharp momentum thresholding in the ATLAS L0 trigger, with the $\sim 3.6\,\mup_Tp_T3\,\mu$s or further increasing throughput.
A plausible implication is the adopted parallel muon processing architecture, combining algorithmic seeding and SIMD acceleration, sets a scalable template for future detector upgrades at even higher luminosities or for integration with more compute-intensive reconstruction algorithms.
References (1)