- The paper demonstrates that flip graph search can reduce the number of scalar multiplications needed for small to medium matrix formats.
- It leverages breakthroughs by Moosbauer and Poole to lower ranks significantly, exemplified by improvements in 5×5 and 6×6 schemes.
- The study addresses practical challenges, including Hensel lifting from Zâ‚‚ schemes to integer coefficients for real-world matrix operations.
This paper explores finding efficient matrix multiplication schemes for various small to medium-sized rectangular matrices. The efficiency is measured by the "rank" of the scheme, which is the minimum number of scalar multiplications required in the coefficient ring (e.g., integers, real numbers) to compute the product. The standard algorithm for multiplying an n×m matrix by an m×p matrix requires nmp multiplications. Fast matrix multiplication research aims to find schemes with rank significantly less than nmp, even for small fixed sizes.
The authors leverage recent breakthroughs by Moosbauer and Poole, who found schemes with remarkably low rank for 5×5 matrices (rank 93, compared to standard 125) and 6×6 matrices (rank 153, compared to standard 216). These schemes serve as starting points for a search method called the flip graph search.
The flip graph search is a technique that takes an existing, correct matrix multiplication scheme and attempts to eliminate multiplications by applying specific transformations or "flips." While the paper doesn't detail the mechanics of the flip graph itself, its practical outcome is the potential discovery of lower-rank schemes starting from a known one.
A key strategy employed in this work, following previous research by Arai et al. (Li et al., 1 Jan 2024), is an incremental approach. This involves:
- Starting from known good schemes: Using the Moosbauer-Poole schemes for (5,5,5) and (6,6,6) as foundational points.
- Extending schemes: Constructing schemes for larger formats by combining known schemes for smaller formats. For instance, a scheme for multiplying a 5×6 matrix by a 6×7 matrix to get a 5×7 matrix (format (5,6,7)) can be built from a scheme for (5,6,6) and a scheme for (5,6,1). This typically involves treating matrices as blocks and applying the smaller scheme to these blocks. If a scheme for (n,m,p) has rank R, extending it to (n,m,p+q) might involve using the (n,m,p) scheme for the first p columns of the result and an (n,m,q) scheme for the last q columns, leading to a combined rank of Rp​+Rq​.
- Restricting schemes: Conversely, obtaining starting points for smaller formats by effectively "embedding" them within a larger scheme. This is done by setting some variables (matrix entries) to zero. For example, a scheme for (3,3,3) can yield a scheme for (2,3,3) by setting the entries of the third row of the first matrix and the third row of the result matrix to zero. This restriction allows the flip graph search to begin from a potentially good starting point for the smaller format, rather than the standard, high-rank scheme.
The authors systematically explored various matrix formats (n,m,p) where n≤m≤p, focusing on formats near (5,5,5) and (6,6,6). They followed paths of extensions and restrictions (illustrated in Fig. 1), using the best known schemes from one format as starting points for flip graph searches on neighboring formats.
The results, summarized in the table below, show new record-low ranks for many formats compared to the previous best known results (according to Sedoglavic's table [fastmm] as of April 2025).
| format |
naive rank |
previous record |
our rank |
| (4,5,5) |
100 |
76 |
76 |
| (4,5,6) |
120 |
93 |
90 |
| (4,5,7) |
140 |
109 |
104 |
| (4,6,6) |
144 |
105 |
106 |
| (5,5,6) |
150 |
116 |
110 |
| (4,6,7) |
168 |
125 |
123 |
| (5,5,7) |
175 |
133 |
127 |
| (5,6,6) |
180 |
137 |
130 |
| (4,7,7) |
196 |
147 |
144 |
| (5,6,7) |
210 |
159 |
150 |
| (5,7,7) |
245 |
185 |
176 |
| (6,6,7) |
252 |
185 |
183 |
| (6,7,7) |
294 |
215 |
221 (over Z2​) |
Improvements were found for all listed formats except (4,6,6) and (6,7,7) (where their Z2​ rank is higher than the previous integer record).
Practical Implementation and Application:
Implementing these schemes involves representing the matrix multiplication C=A⋅B (where A is n×m, B is m×p, and C is n×p) as a sequence of R scalar multiplications and numerous additions/subtractions. A scheme of rank R is typically defined by three sets of coefficients (often represented as tensors U,V,W). For the r-th multiplication (1≤r≤R), you compute:
- Lr​=∑i=1n​∑j=1m​urij​aij​ (a linear combination of entries from matrix A)
- Rr​=∑k=1m​∑l=1p​vrkl​bkl​ (a linear combination of entries from matrix B)
- Pr​=Lr​⋅Rr​ (the scalar product)
Finally, the entries of the result matrix C are computed as linear combinations of these products:
cab​=∑r=1R​wrab​Pr​
Implementing such a scheme requires:
- Storing the coefficients urij​, vrkl​, and wrab​. These can be numerous, especially for larger n,m,p,R.
- Efficiently computing the R linear combinations for Lr​ and Rr​. This involves many additions and subtractions.
- Performing the R scalar multiplications Pr​.
- Efficiently computing the np linear combinations for cab​.
The schemes found in this paper were initially discovered over the field Z2​ (coefficients are 0 or 1). For practical use with integer or floating-point matrices, these schemes need to be valid over the integers (or rationals/reals). The authors applied Hensel lifting to translate the Z2​ schemes to schemes with integer coefficients. They found that many, but not all, of the minimal Z2​ schemes could be lifted. For formats like (4,5,5), the schemes with the lowest Z2​ rank (74) could not be lifted, and they had to use schemes with a slightly higher rank (76) that could be lifted to integers. This highlights a practical challenge: a low rank over a specific field doesn't guarantee the same rank or even validity over other fields.
The computational cost of finding these schemes is substantial (days on high-core machines per search path). However, the computational cost of using a found scheme is determined by the number of scalar multiplications (R) and the total number of scalar additions/subtractions involved in computing the linear combinations. While the number of multiplications is reduced compared to the standard algorithm, the number of additions/subtractions typically increases.
These low-rank schemes are most practically applied within highly optimized linear algebra libraries (like BLAS implementations). For larger matrix multiplications, these small-sized fast schemes are often used recursively in a block-wise manner (e.g., using a (2,2,2) Strassen scheme recursively for large matrices). The schemes found in this paper for formats like (5,5,5), (5,6,7), etc., provide more base-case options for such recursive algorithms or can be used directly for operations involving matrices of these specific small dimensions.
The main benefits for real-world applications are potential performance improvements for matrix multiplications of these exact dimensions or as base cases for larger recursive multiplications, particularly in domains like scientific computing, numerical simulations, and potentially parts of machine learning inference where fixed small matrix operations occur. The trade-off is increased code complexity, potentially higher register pressure, and less straightforward memory access patterns compared to the standard algorithm.
The resulting schemes are made publicly available, allowing practitioners to incorporate them into specialized libraries or applications where these specific matrix sizes are critical performance bottlenecks.