Matrix Multiplication is a basic kernel in Linear Algebra. MKL is one of the leading libraries for Intel architectures. Here, we present performance comparisons of the GEMM procedures for a DELL system based on a Pentium 4, 3.2GHz processor.

MKL SGEMM achieves up to 9 GFLOPS. Goto's SGEMM achieves up to 11 GFLOPS. If we apply our adaptive Winograd algorithm on top of MKL (W-MKL) and we normalize the performance using the formula 2N^3/nanoseconds, we achieve up to 10GFLOPS and on top of Goto's (W-Goto) we achieve up to 13GFLOPS. Notice that the peak performance for this application is 12.8GFLOPS. In practice, we achieve an execution time that is faster than any algorithm based on the classic N^3 algorithm.

MKL DGEMM achieves up to 5.5 GFLOPS. Goto's SGEMM is slightly better for large problems and worse for small problems. If we apply our adaptive Winograd algorithm on top of MKL and Goto's and we normalize the performance using the formula 2N^3/nanoseconds, we achieve up to 6.5GFLOPS. Notice that the peak performance for this application is 6.4 GFLOPS. Once again, we achieve an execution time that is faster than any algorithm based on the N^3 algorithm.

With complex elements, we have more options and most of these options have better performance than the one offered by MKL and Goto's. Here we experiment with the 3M algorithm: This algorithm does not work on complex matrices, it splits a complex matrix into two single precision matrices and it performs 3 SGEMM (or DGEMM see below) and 4 Matrix additions (for a total of 25% operation saving). So we the term 3M-Goto, we identify the 3M algorithm using Goto's SGEMM. With the tern 3M-W-Goto, we identify the 3M algorithm applying our Winograd algorithm on top of the Goto's SGEMM.

MKL CGEMM achieves a peak performance of 2.5GFLOPS (normalized as 2N^3/ns). Goto's implementation achieves consistently better performance up to 2.8 GFLOPS. We can improve the performance of both by applying our Winograd, the 3M algorithm or both as shown in the performance plot above. Using both 3M and Winograd algorithm applied to the three Goto's SGEMM (3M-W-Goto achieves up to 4.2 GFLOPS), we can achieve up to 40% execution time reduction w.r.t. MKL CGEMM.

MKL and Goto's ZGEMM achieves a peak performance of 1.4 GFLOPS (normalized as 2N^3/ns). We can improve the performance of both by applying our Winograd, the 3M algorithm or both as shown in the performance plot above. The 3M algorithm, instead of working on complex matrices, it splits a complex matrix into two single precision matrices and it performs 3 DGEMM and 4 Matrix additions (saving 25% operations). Using both 3M and Winograd algorithm applied to the three DGEMM, we can achieve up to 27% execution time reduction w.r.t. MKL ZGEMM.

- P.D'Alberto and A.Nicolau

**Adaptive Winograd's Matrix Multiplications**

ACM Transaction on Mathematical Software 2008 PDF - P.D'Alberto and A. Nicolau

**Adaptive Strassen's Matrix Multiplication**(2007)

The 21th International Conference on Supercomputing PDF - P.D'Alberto and A. Nicolau

**Adaptive Strassen and ATLAS's DGEMM: A Fast Square-Matrix Multiply for Modern High-Performance Systems**(2005)

The 8th International Conference on High Performance Computing in Asia Pacific Region (HPC asia) PDF Presentation - P.D'Alberto and A. Nicolau

**Using Recursion to Boost ATLAS's Performance**(2005)

The Sixth International Symposium on High Performance Computing (ISHPC-VI) PDF - P.D'Alberto, G.Bilardi, and A.Nicolau

**Fractal LU-decomposition with partial pivoting**

technical report PDF | PS - G.Bilardi, P.D'Alberto, and A.Nicolau

**Fractal Matrix Multiplication: a Case Study on Portability of Cache Performance**(2001)

WAE 2001 PDF | PS | presentation

Copyright (c) 2007, P. D'Alberto, A. Nicolau, and A. Kumar.