## MKL (s/d/c/z)GEMM: Banchmarking.

Matrix Multiplication is a basic kernel in Linear Algebra. MKL is one of the leading libraries for Intel architectures. Here, we present performance comparisons of the GEMM procedures for a DELL system based on a Pentium 4,  3.2GHz processor.

### SGEMM: Single Precision General Matrix Multiplication MKL SGEMM achieves up to 9 GFLOPS. Goto's SGEMM achieves up to 11 GFLOPS. If we apply our adaptive Winograd algorithm on top of MKL (W-MKL) and we normalize the performance using the formula 2N^3/nanoseconds, we achieve up to 10GFLOPS and on top of Goto's (W-Goto) we achieve up to 13GFLOPS. Notice that the peak performance for this application is 12.8GFLOPS. In practice, we achieve an execution time that is faster than any algorithm based on the classic N^3 algorithm.

### MKL DGEMM achieves up to 5.5 GFLOPS. Goto's SGEMM is slightly better for large problems and worse for small problems. If we apply our adaptive Winograd algorithm on top of MKL and Goto's and we normalize the performance using the formula 2N^3/nanoseconds, we achieve up to 6.5GFLOPS. Notice that the peak performance for this application is 6.4 GFLOPS. Once again, we achieve an execution time that is faster than any algorithm based on the N^3 algorithm.

### CGEMM: Single-Precision Complex General Matrix Multiplication With complex elements, we have more options and most of these options have better performance than the one offered by MKL and Goto's.  Here we experiment with the 3M algorithm: This algorithm does not work on complex matrices, it splits a complex matrix into two single precision matrices and it performs 3 SGEMM (or DGEMM see below) and 4 Matrix additions (for a total of 25% operation saving). So we the term 3M-Goto, we identify the 3M algorithm using Goto's SGEMM. With the tern 3M-W-Goto, we identify the 3M algorithm applying our Winograd algorithm on top of the Goto's SGEMM.

MKL CGEMM achieves a peak performance of 2.5GFLOPS (normalized as 2N^3/ns). Goto's implementation achieves consistently better performance up to 2.8 GFLOPS. We can improve the performance of both by applying our Winograd, the 3M algorithm or both as shown in the performance plot above.  Using both 3M and Winograd algorithm applied to the three Goto's SGEMM (3M-W-Goto achieves up to 4.2 GFLOPS), we can achieve up to 40% execution time reduction w.r.t. MKL CGEMM.

### ZGEMM: Double-Precision Complex General Matrix Multiplication MKL and Goto's ZGEMM achieves a peak performance of 1.4 GFLOPS (normalized as 2N^3/ns). We can improve the performance of both by applying our Winograd, the 3M algorithm or both as shown in the performance plot above.  The 3M algorithm, instead of working on complex matrices, it splits a complex matrix into two single precision matrices and it performs 3 DGEMM and 4 Matrix additions (saving 25% operations). Using both 3M and Winograd algorithm applied to the three DGEMM, we can achieve up to 27% execution time reduction w.r.t. MKL ZGEMM.

## References

1. P.D'Alberto and A.Nicolau
ACM Transaction on Mathematical Software 2008 PDF

2. P.D'Alberto and A. Nicolau
The 21th International Conference on Supercomputing PDF

3. P.D'Alberto and A. Nicolau
Adaptive Strassen and ATLAS's DGEMM: A Fast Square-Matrix Multiply for Modern High-Performance Systems (2005)
The 8th International Conference on High Performance Computing in Asia Pacific Region (HPC asia) PDF Presentation

4. P.D'Alberto and A. Nicolau
Using Recursion to Boost ATLAS's Performance (2005)
The Sixth International Symposium on High Performance Computing (ISHPC-VI) PDF

5. P.D'Alberto, G.Bilardi, and A.Nicolau
Fractal LU-decomposition with partial pivoting
technical report PDF | PS

6. G.Bilardi, P.D'Alberto, and A.Nicolau
Fractal Matrix Multiplication: a Case Study on Portability of Cache Performance (2001)
WAE 2001 PDF | PS | presentation

HOME

Copyright (c) 2007, P. D'Alberto, A. Nicolau, and A. Kumar.