Consequences of the Moosbauer-Poole Algorithms (2505.05896v1)

Published 9 May 2025 in cs.SC

Abstract: Moosbauer and Poole have recently shown that the multiplication of two $5\times 5$ matrices requires no more than 93 multiplications in the (possibly non-commutative) coefficient ring, and that the multiplication of two $6\times 6$ matrices requires no more than 153 multiplications. Taking these multiplication schemes as starting points, we found improved matrix multiplication schemes for various rectangular matrix formats using a flip graph search.

Summary

The paper demonstrates that flip graph search can reduce the number of scalar multiplications needed for small to medium matrix formats.
It leverages breakthroughs by Moosbauer and Poole to lower ranks significantly, exemplified by improvements in 5×5 and 6×6 schemes.
The study addresses practical challenges, including Hensel lifting from Z₂ schemes to integer coefficients for real-world matrix operations.

This paper explores finding efficient matrix multiplication schemes for various small to medium-sized rectangular matrices. The efficiency is measured by the "rank" of the scheme, which is the minimum number of scalar multiplications required in the coefficient ring (e.g., integers, real numbers) to compute the product. The standard algorithm for multiplying an $n \times m$ matrix by an $m \times p$ matrix requires $nmp$ multiplications. Fast matrix multiplication research aims to find schemes with rank significantly less than $nmp$ , even for small fixed sizes.

The authors leverage recent breakthroughs by Moosbauer and Poole, who found schemes with remarkably low rank for $5 \times 5$ matrices (rank 93, compared to standard 125) and $6 \times 6$ matrices (rank 153, compared to standard 216). These schemes serve as starting points for a search method called the flip graph search.

The flip graph search is a technique that takes an existing, correct matrix multiplication scheme and attempts to eliminate multiplications by applying specific transformations or "flips." While the paper doesn't detail the mechanics of the flip graph itself, its practical outcome is the potential discovery of lower-rank schemes starting from a known one.

A key strategy employed in this work, following previous research by Arai et al. (Li et al., 1 Jan 2024), is an incremental approach. This involves:

Starting from known good schemes: Using the Moosbauer-Poole schemes for $(5,5,5)$ and $(6,6,6)$ as foundational points.
Extending schemes: Constructing schemes for larger formats by combining known schemes for smaller formats. For instance, a scheme for multiplying a $5 \times 6$ matrix by a $6 \times 7$ matrix to get a $5 \times 7$ matrix (format $(5,6,7)$ ) can be built from a scheme for $(5,6,6)$ and a scheme for $(5,6,1)$ . This typically involves treating matrices as blocks and applying the smaller scheme to these blocks. If a scheme for $(n,m,p)$ has rank $R$ , extending it to $(n,m, p+q)$ might involve using the $(n,m,p)$ scheme for the first $p$ columns of the result and an $(n,m,q)$ scheme for the last $q$ columns, leading to a combined rank of $R_p + R_q$ .
Restricting schemes: Conversely, obtaining starting points for smaller formats by effectively "embedding" them within a larger scheme. This is done by setting some variables (matrix entries) to zero. For example, a scheme for $(3,3,3)$ can yield a scheme for $(2,3,3)$ by setting the entries of the third row of the first matrix and the third row of the result matrix to zero. This restriction allows the flip graph search to begin from a potentially good starting point for the smaller format, rather than the standard, high-rank scheme.

The authors systematically explored various matrix formats $(n,m,p)$ where $n \leq m \leq p$ , focusing on formats near $(5,5,5)$ and $(6,6,6)$ . They followed paths of extensions and restrictions (illustrated in Fig. 1), using the best known schemes from one format as starting points for flip graph searches on neighboring formats.

The results, summarized in the table below, show new record-low ranks for many formats compared to the previous best known results (according to Sedoglavic's table [fastmm] as of April 2025).

format	naive rank	previous record	our rank
(4,5,5)	100	76	76
(4,5,6)	120	93	90
(4,5,7)	140	109	104
(4,6,6)	144	105	106
(5,5,6)	150	116	110
(4,6,7)	168	125	123
(5,5,7)	175	133	127
(5,6,6)	180	137	130
(4,7,7)	196	147	144
(5,6,7)	210	159	150
(5,7,7)	245	185	176
(6,6,7)	252	185	183
(6,7,7)	294	215	221 (over $\mathbb{Z}_2$ )

Improvements were found for all listed formats except $(4,6,6)$ and $(6,7,7)$ (where their $\mathbb{Z}_2$ rank is higher than the previous integer record).

Practical Implementation and Application:

Implementing these schemes involves representing the matrix multiplication $C = A \cdot B$ (where $A$ is $n \times m$ , $B$ is $m \times p$ , and $C$ is $n \times p$ ) as a sequence of $R$ scalar multiplications and numerous additions/subtractions. A scheme of rank $R$ is typically defined by three sets of coefficients (often represented as tensors $U, V, W$ ). For the $r$ -th multiplication ( $1 \leq r \leq R$ ), you compute:

$L_r = \sum_{i=1}^n \sum_{j=1}^m u_{rij} a_{ij}$ (a linear combination of entries from matrix $A$ )
$R_r = \sum_{k=1}^m \sum_{l=1}^p v_{rkl} b_{kl}$ (a linear combination of entries from matrix $B$ )
$P_r = L_r \cdot R_r$ (the scalar product) Finally, the entries of the result matrix $C$ are computed as linear combinations of these products:

$c_{ab} = \sum_{r=1}^R w_{rab} P_r$

Implementing such a scheme requires:

Storing the coefficients $u_{rij}$ , $v_{rkl}$ , and $w_{rab}$ . These can be numerous, especially for larger $n,m,p,R$ .
Efficiently computing the $R$ linear combinations for $L_r$ and $R_r$ . This involves many additions and subtractions.
Performing the $R$ scalar multiplications $P_r$ .
Efficiently computing the $np$ linear combinations for $c_{ab}$ .

The schemes found in this paper were initially discovered over the field $\mathbb{Z}_2$ (coefficients are 0 or 1). For practical use with integer or floating-point matrices, these schemes need to be valid over the integers (or rationals/reals). The authors applied Hensel lifting to translate the $\mathbb{Z}_2$ schemes to schemes with integer coefficients. They found that many, but not all, of the minimal $\mathbb{Z}_2$ schemes could be lifted. For formats like $(4,5,5)$ , the schemes with the lowest $\mathbb{Z}_2$ rank (74) could not be lifted, and they had to use schemes with a slightly higher rank (76) that could be lifted to integers. This highlights a practical challenge: a low rank over a specific field doesn't guarantee the same rank or even validity over other fields.

The computational cost of finding these schemes is substantial (days on high-core machines per search path). However, the computational cost of using a found scheme is determined by the number of scalar multiplications ( $R$ ) and the total number of scalar additions/subtractions involved in computing the linear combinations. While the number of multiplications is reduced compared to the standard algorithm, the number of additions/subtractions typically increases.

These low-rank schemes are most practically applied within highly optimized linear algebra libraries (like BLAS implementations). For larger matrix multiplications, these small-sized fast schemes are often used recursively in a block-wise manner (e.g., using a $(2,2,2)$ Strassen scheme recursively for large matrices). The schemes found in this paper for formats like $(5,5,5)$ , $(5,6,7)$ , etc., provide more base-case options for such recursive algorithms or can be used directly for operations involving matrices of these specific small dimensions.

The main benefits for real-world applications are potential performance improvements for matrix multiplications of these exact dimensions or as base cases for larger recursive multiplications, particularly in domains like scientific computing, numerical simulations, and potentially parts of machine learning inference where fixed small matrix operations occur. The trade-off is increased code complexity, potentially higher register pressure, and less straightforward memory access patterns compared to the standard algorithm.

The resulting schemes are made publicly available, allowing practitioners to incorporate them into specialized libraries or applications where these specific matrix sizes are critical performance bottlenecks.