Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RXTX Algorithm for Efficient XXᵀ Computation

Updated 2 July 2025
  • RXTX Algorithm is an AI-derived recursive block algorithm that efficiently computes matrix-times-transpose products while reducing multiplications and additions.
  • It employs a 4×4 block partitioning and integrates MILP with reinforcement learning to optimize bilinear product selection and minimize redundant arithmetic.
  • Empirical results show up to 9% faster runtimes and notable efficiency gains across various matrix sizes, proving its practical advantage.

The RXTX algorithm is an AI-discovered recursive block algorithm for the efficient computation of the matrix-times-transpose product, specifically XXtXX^{t} for a real matrix XRn×mX \in \mathbb{R}^{n \times m}. RXTX achieves a reduction of both multiplications and total arithmetic operations (additions plus multiplications) by approximately 5% compared to previous state-of-the-art (SotA) approaches, with improvements holding across all matrix sizes—including small matrices (n=4n = 4)—and compounding at larger scales. RXTX was developed using a combination of machine learning-guided search techniques and combinatorial optimization.

1. Algorithmic Structure and Recursion

RXTX proceeds by recursively partitioning the input matrix XX into 4×44 \times 4 blocks: X=(X1X2X3X4 X5X6X7X8 X9X10X11X12 X13X14X15X16).X = \begin{pmatrix} X_1 & X_2 & X_3 & X_4 \ X_5 & X_6 & X_7 & X_8 \ X_9 & X_{10} & X_{11} & X_{12} \ X_{13} & X_{14} & X_{15} & X_{16} \end{pmatrix}. At each recursive step, RXTX computes C=XXtC = XX^{t} using 8 recursive calls for subblocks and 26 multiplications on block submatrices, followed by a recombination through optimized additions.

The recurrence for RXTX is: R(n)=8R(n/4)+26M(n/4)R(n) = 8R(n/4) + 26M(n/4) where R(n)R(n) is the number of multiplications required by RXTX and M(n)M(n) is the number in Strassen-Winograd's general matrix multiplication.

The previous SotA recursive Strassen algorithm for XXtXX^{t} follows: S(n)=4S(n/2)+2M(n/2)S(n) = 4S(n/2) + 2M(n/2) This block partitioning and customized recursion allow RXTX to exploit structure unique to products of the form XXtXX^{t}.

2. Explicit Operation Counts and Base Case

The recurrence resolves to explicit formulas for both RXTX and the prior SoTA: R(n)=2641M(n)+1541n3/2=2641nlog27+1541n3/2R(n) = \frac{26}{41} M(n) + \frac{15}{41} n^{3/2} = \frac{26}{41} n^{\log_2 7} + \frac{15}{41} n^{3/2}

S(n)=23M(n)+13n2=23nlog27+13n2S(n) = \frac{2}{3} M(n) + \frac{1}{3} n^2 = \frac{2}{3} n^{\log_2 7} + \frac{1}{3} n^2

The leading coefficient for RXTX, 26/410.634126/41 \approx 0.6341, is approximately 5% lower than 2/30.66662/3 \approx 0.6666 for the previous SotA, reducing asymptotic operation counts.

At the 4×44 \times 4 base case, RXTX computes 26 specific bilinear products mim_i, such as: m1=(X2+X3X4+X8)(X8+X11)T m2=(X1X5X6+X7)(X15+X5)T  m26=(X6+X10+X12)X10T\begin{aligned} m_1 &= (-X_2 + X_3 - X_4 + X_8)\,(X_8 + X_{11})^T \ m_2 &= (X_1 - X_5 - X_6 + X_7)\,(X_{15} + X_5)^T \ &\vdots \ m_{26} &= (X_6 + X_{10} + X_{12})\,X_{10}^T \end{aligned} and 8 symmetric (diagonal) block products sj=XjXjTs_j = X_j\,X_j^T for j=1j=1 to $16$.

Recombination into the final output blocks, e.g.,

C11=s1+s2+s3+s4C_{11} = s_1 + s_2 + s_3 + s_4

C12=m2m5m7+m11+m12+m13+m19C_{12} = m_2 - m_5 - m_7 + m_{11} + m_{12} + m_{13} + m_{19}

proceeds according to optimized addition schemes.

3. Arithmetic and Computational Efficiency

RXTX requires only 26 multiplications at the 4×44 \times 4 base case level (contrast: 38 for Strassen), and this advantage compounds through recursion, yielding for large nn: R(n)=2641nlog27+1541n3/2R(n) = \frac{26}{41} n^{\log_2 7} + \frac{15}{41} n^{3/2} The addition scheme is also optimized; the number of required additions at each recursive step is reduced from 139 to 100 via common subexpression elimination.

The total operation count (additions plus multiplications) is: R+(n)=15641nlog27615164n2+155164n3/2R_+(n) = \frac{156}{41} n^{\log_2 7} - \frac{615}{164} n^2 + \frac{155}{164} n^{3/2} By comparison, recursive Strassen for XXtXX^t requires: S+(n)=4nlog2774n2log2n3n2S_+(n) = 4 n^{\log_2 7} - \frac{7}{4} n^2 \log_2 n - 3 n^2 RXTX demonstrates a simultaneous reduction in both multiplication and total operation counts.

4. Discovery via Machine Learning and Combinatorial Optimization

RXTX was discovered through an AI-driven approach integrating reinforcement learning (RL) and combinatorial optimization. The process consists of two main components:

  • RL-guided Large Neighborhood Search: An RL agent generates candidate sets of bilinear (rank-1) products.
  • Mixed-Integer Linear Programming (MILP): Two MILP stages:
    • MILP-A: For each target expression in XXTXX^{T}, enumerate linear combinations of the candidate products that realize the target.
    • MILP-B: Find the minimal subset of candidates whose spans cover all targets.

Optimization proceeds by alternately sampling new candidate products and using MILP solvers to achieve compact representations. This approach restricts the search to potential bilinear forms, reducing the size of the combinatorial space and enabling feasible optimization on mid-sized matrices. This contrasts with tensor-based searches (e.g., AlphaTensor), which operate over significantly larger spaces.

5. Empirical Performance and Practical Thresholds

RXTX yields theoretical and practical advantages:

  • For large nn, RXTX uses roughly 95% of the multiplications of the previous SotA.
  • On 6144×61446144 \times 6144 real matrices, with a one-level RXTX application and subsequent BLAS block multiplications, RXTX achieved an average runtime of 2.524s, approximately 9% faster than the baseline BLAS routine (2.778s), outperforming in 99% of runs.
  • Performance thresholds indicate:
    • RXTX outperforms recursive Strassen for n256n \geq 256.
    • RXTX overtakes the naive implementation for n1024n \geq 1024.
    • With optimal recursive cutoffs, RXTX can outperform other methods at sizes as small as n32n \approx 32, though this is hardware and implementation dependent.
Algorithm Base Recursion Asymptotic Constant 4×44\times 4 Rank
Previous SotA S(n)=4S(n/2)+2M(n/2)S(n) = 4S(n/2)+2M(n/2) 2/30.66662/3 \approx 0.6666 38
RXTX R(n)=8R(n/4)+26M(n/4)R(n) = 8R(n/4)+26M(n/4) 26/410.634126/41 \approx 0.6341 34

6. Enablers and Structural Innovations

Several key factors contribute to the efficiency of RXTX:

  • Structure Exploitation: RXTX is tailored to the symmetry of XXTXX^{T}, distinguishing it from generic matrix multiplication methods.
  • Block Recursion: Employing 4×44 \times 4 block splitting (versus 2×22 \times 2 in previous methods) permits more flexible and efficient recombination of products.
  • Optimized Additions: Automated search for common subexpressions minimizes redundant arithmetic, reducing total additions required.
  • Hybrid AI/Optimization Discovery: The integration of RL sampling and MILP-based combinatorial optimization enables the discovery of schemes that surpass those achieved by exhaustive search or human design.

RXTX is thus an AI-discovered, recursively-defined algorithm that systematically reduces arithmetic complexity for XXtXX^{t} computation, combining theoretical reduction with practical performance gains for a wide range of matrix sizes. The results demonstrate the productive intersection of machine learning search and combinatorial optimization in algorithmic discovery.