A while ago I inquired about benchmarks for dgemm. In the meantime my own machine arrived and shows some impressive numbers over 2.3 TFLOPs.

As expected faster memory helps… This run was with 2x 48 GB MT6000 memory.

With not completely stable (4x48) 192 GB@4800 MT it peaks at 2 TFlops.

Unfortunately, I seem not to be able to get an uploaded picture displayed… any help?

I see the picture, although it looks a little truncated on all sides.

the X-axis of the graph is n, correct?

How does the scaling look with bigger matrices? In the past I remember this is where MKL took the lead over other BLAS.

Sorry, I did not label the axis: y is TFlops, x is the matrix size.

I checked now bigger matrices as well.

You are very right. This is were Intel shines and BLIS and Openblas looses! My matrix sizes are small(er) though.

Thanks for sharing this!

AMD has made progress with their homegrown BLAS, AOCL, the past ~year with the launch of version 4.x by including AVX-512 and targeting Zen 4/5 architectures specifically… but I did notice AMD often used very small problems in their comparison benchmarks which seemed to be the only area they performed competitively in compared to MKL.

My thought is that BLAS performance doesn’t matter much in small problem sizes since they can be solved in seconds; I’m more worried about the large matrices that take days or weeks to solve.

I do realize there are a subset of problems that can be captured by solving many small matrices where these other BLAS are competitive with MKL, but none of the problems I ever find myself up against are like this.

I was really wondering about the BLIS performance drop for larger matrices. Indeed when I checked for the influence of the thread affinity I saw better results, still below the MKL data but definitely better.

tter.

The new AGESA 1.2.0.2 update that’s recently come out for some motherboard’s is supposed to reduce CCD to CCD latency by half, I suspect this update would improve BLAS performance.