Poor MATLAB Performance on Threadripper 5995WX System (Win10)

Hey everyone,

I recently got my hands on a Threadripper 5995WX based system (Dell Precision 7865) for running large optimization runs on MATLAB using Genetic Optimization techniques. I was previously using a 6-core Xeon based system and got the Threadripper system to process multiple optimization runs a lot faster. However, for some unknown reason, the speedup has not been anything close to what I was expecting. On prior computers, I was easily able to achieve nearly 100% CPU utilization (as per Task Manager), whereas on the Threadripper machine, CPU utilization hovers between 15-40% at best. I also tried to time the whole optimization run and the time difference isnt much. I am using the Parallel Computing Toolbox on MATLAB and have verified that it is setup to use 64 cores on the Threadripper machine.

Specifications of the computer are below:
CPU - Threadripper 5995WX 64 core
RAM - 256GB (8x32GB)
SSD - 2TB PCIe Gen 4 (OEM Dell one)
GPU - RTX A4500 20GB (Not really used by my code)

I am not sure if this issue is related to MATLAB or Windows 10. I have followed the instructions given on the MATLAB Central page to setup the BLAS and LAPACK libraries on my MATLAB install. For reference, I am using MATLAB R2023a.

I have also installed CorePrio and checked the NUMA Dissociater in an effort to fix the problem, however I haven’t really seen much of an improvement through that. I am also not particularly sure of what exact settings I should be using on it either.

It would be great if someone could help me diagnose and debug this issue. Happy to provide more details & test runs as necessary.

Thanks!

Welcome!

Do you know what BLAS matlab is actually using? even though AMD wrote AOCL, Intel’s MKL can often out perform it, even on AMD CPUs because of the amount of optimization Intel did to the BLAS.
Also if you are using MKL, you may need to pass an argument to matlab on start up to have it use some of the more advanced SIMD instructions to get the most performance.

Also, how much of the optimization run in bottlenecked to a single thread? You could check this with a utility like AMD ÎĽProf.

Thanks for the advice!

My MATLAB install is currently running “AOCL-BLIS 3.2.1 Build 20220912” and “AOCL-libFLAME 3.2.1 Build 20221019, supports LAPACK 3.10.0”, as reported by the MATLAB command window. Do you think reverting to Intel MKL will help as compared to these versions of AOCL?

Also, what are the arguments that I need to pass to MATLAB on startup for the SIMD instructions? I can try running those and report back with some test results. Same with AMD’s uProf tool.

It’s probably worth a try to revert to MKL to see if performance increases, but I wouldn’t expect anything too major even if it did increase performance.

The argument to pass to matlab to enable more SIMD instructions is really only applicable to AMD CPUs running on the MKL BLAS since MKL doesn’t know what instructions AMD CPUs support unless specifically spelled out.

most users will make a .bat file next to the actual matlab executable and populate it with:

@echo off
set MKL_DEBUG_CPU_TYPE=5
matlab.exe

​​​ ​ ​

​​​ ​ ​
I suspect the optimization run might be falling onto the wrong side of amdahl’s law rather than a BLAS problem, which uProf should help identify.

Thanks! I did try reverting to MKL and test some more optimization runs - seems like the performance difference is minimal.

However, I did notice something interesting in Task Manager. The MATLAB Parallel pool is configured to run with 64 workers with 1 thread each, which corresponds to 64 processes showing up in Task Manager. Out of these, 32 of these show up as background processes with minimal CPU usage (0.1-0.2%), but proper RAM consumption (~1GB). The remaining 32 are grouped into 1 section and appear in the foreground Apps section. And out of these 32, only 1 has any significant CPU usage ( fluctuating between 10-40%).

I am not sure what exactly to make of this, but decided to put this out here anyway incase someone with more knowledge and experience than me can chime in.

I also performed a system profile with AMD uProf. It reports that the System Idle process [PID:0] occupied over 81% of the total CPU time during the profile, with the remaining MATLAB processes collectively occupying the remaining ~19%.

Any news and updates regarding this problem or its solution? I am considering getting a new Threadripper workstation myself to do some MATLAB work with the Parallel Computing Toolbox, and I’m concerned to hear of this problem!

Some questions (if you don’t mind) :

  1. What exactly is your Parallel Computing Toolbox MATLAB code that you are running on your Threadripper? Is it the Parallel computing option on the MATLAB ga function, or is it the parfor, or batch, or is it something else? And do you encounter the same problem you mentioned in all these different situations in which MATLAB does parallel computing? ( I use lots of parfor. Do you get this problem with parfor? ) Do you also encounter the same problem in cases where MATLAB does not use the Parallel Computing Toolbox but should use multiple cores of the CPU properly? ( e.g. on my 8 core CPU, if you try to calculate the inverse of a very large matrix using the “inv” function without using parfor or the Parallel Computing Toolbox, you should see nearly 80%-100% CPU usage. Do you also encounter this problem using the “inv” function similarly on your Threadripper? )

  2. Is your Dell Threadripper system your own personal computer or is it company/university owned? If it is the latter, just wondering if it is possible the IT admin did something to the computer to limit its performance?

  3. Maybe you should post this on the MATLAB official forum and see what replies you get from experienced MATLAB forum contributors or official MATLAB staff?

  4. Which version of Windows 10 are you using? I think for the 5995WX you should use Win10 Enterprise or Win10 Pro for Workstations (instead of the “normal” Win10 Pro)? Did you try disabling simultaneous multithreading?

1 Like

I had just automatically assumed the nature of the problem wasn’t parallelizable when I heard:

but you’re right it could possible have something to do with implementation.

I don’t think this is a Windows scheduler problem because of how severely under utilized the cores are.
It’s possible that the Threadripper architecture is the problem with poor cross-chiplet cache coherency (I could have sworn uProf would warn about that though), another way to check this would be to run the code on an Intel system which tend to have more favorable cache latencies between disparate cores.

Four things:

  1. Remind me what is the core/CCD/CCX configuration of the 5995WX again? (i.e. does the data have to traverse a CCD/CCX for it to be able to execute?

  2. In the Parallel Computing Toolbox, make sure that you have set the maximum number of workers/threads to be AT LEAST 64. (I think that the default is 12.) Same thing for your cluster profile manager – make sure that you take the number of workers up from its default as well, if it already isn’t set for 64 PHYSICAL cores.

  3. Disable SMT in the BIOS.

  4. If all else fails, run cluster admin center " matlabroot\toolbox\parallel\bin\admincenter.bat (on Microsoft® Windows® operating systems)".

Set up your cluster resources so that you have 64 individual workers running 1 thread per worker (each).

By default, the Parallel Computing Toolbox uses OpenMP for its parallelisation rather than MPI whereas when you spawn the workers via admincenter, it will use MPI instead. And then set your cluster profile in your cluster profile manager to be the profile that was created via admincenter.

Give that a shot and let me know if that works out for you.

2 Likes

Hi, sorry for leaving this thread hanging for a bit. I have been running a bunch of tests that often got muddied by random issues since its a shared system and is in use by other people as well.

To answer @alch’s questions:

  1. I have an optimization routine that uses Purdue’s GOSET toolbox (see: Genetic Optimization System Engineering Toolbox 2.6 - Elmore Family School of Electrical and Computer Engineering - Purdue University) that internally uses MATLAB’s Parallel computing toolbox to parallelize optimizations. I generally don’t use Parallel computing toolbox elsewhere, so cannot really attest to its behaviour in other applications.

Prior to acquiring this threadripper workstation, I tested my optimization code across 2 different Intel based machines - a regular desktop with 6 cores (forget the exact specs, but it was a 10th gen i7 with 16GB ram) and a workstation with dual 8-core xeons and 256GB RAM. I noted nearly perfect scaling of the computation time with the number of cores atleast for these 2 machines - and I assumed the scaling would continue till 64 cores. Genetic Algorithms operate on very large population sizes (in my case, 10,000 to 15,000 individuals in one generation), so it should be fairly easy to achieve good parallelization.

  1. It is a university owned computer. As far as I know, IT admin hasn’t changed any configurations on the computer from what it shipped with. I have admin access to the machine and its BIOS for changing any settings if necessary.

  2. I have seen previous threads on the official MATLAB forums discussing similar issues. One common reply that I have seen is to disregard the use of the GA function and to use a different optimization method (which is completely irrelevant to the processor issue at hand). Also, GOSET isn’t officially developed by Mathworks, so I doubt any support will come from any Mathworks staff on this.

  3. Currently using Windows 10 Pro (doesn’t specify Workstation anywhere). Does the windows edition have a big impact on performance? With regards to SMT, I did notice some performance improvement by disabling SMT, still in the process of running tests to fully validate that.

@twin_savage I did run this code on a couple of Intel based machines (one 6 core and the other 16 core) and I did see near-perfect CPU utilization and really good scaling when moving from the 6 to 16 core machine. This led me to believe that atleast it wasn’t the parallelizable nature of the code that was the issue.

@alpha754293 I believe the 5995wx is an 8 CCD architecture? I have edited the maximum number of workers to be 64 (and I can see on the bottom left that the parallel pool generated has 64 workers). SMT is disabled, I did see a small performance bump with that, still running a few tests on that to fully quantify the performance.

I had no idea about the Admin Center and OpenMP vs MPI situation with MATLAB, will try your suggestions and report back.

1 Like

If you need more specific deployment instructions on how to get and use admin center (to spawn individual workers rather which will work together in the parallel computing pool) rather than spawing say 64 workers in one large pool – please let me know and I can give you my deployment notes for that.

I have found that when I am just doing a simple right-divide solve of Ax=b, that using admin center and spawning n number of indvidiual workers can be upto 40% faster than just clicking the bottom left button of your MATLAB environment to start a parallel pool (which some codes, when it starts a parallel task, will effectively do the same thing, based on the settings and configuration of your parallel computing environment).

(The speed up varies depending on the workload.)

At the very least, if you deploy the workers via admin center and it doesn’t yield much of a benefit – so long as it’s not hurting your performance as well, then you’ve got nothing to lose by trying.

Thanks.

1 Like

@alpha754293 that would be fantastic, please share your deployment notes and I can try that out and report back with my results.

Two dumb questions:

  1. For the work that you are doing, do you HAVE to use Windows (10) or are you able to switch to something else like CentOS 7.7.1908? (Though I do recognise that with your CPU, that OS might be too old for you, so if you’re stuck with it – that’s not a problem.

  2. Do you have admin rights to your system? (based on your work’s IT policy)

Thanks.

(My original deployment notes were for CentOS, but I am spinning up a Windows 10 VM, so I can take the screenshots, etc.)

So this may be unrelated to the parallelization issues but I cant help but ask the question, is it because of the MKL BLAS that you are limited on your parallel pool? See MATLABs best performance is vendor locked period because intel one MKL and other programs generated by the Intel C++ Compiler and the Intel DPC++ Compiler improve performance with a technique called function multi-versioning: a function is compiled or written for many of the x86 instruction set extensions, and at run-time a “master function” uses the CPUID instruction to select a version most appropriate for the current CPU. If the current CPU is non intel I could only ever get MATLAB to use up to 16 threads. However, as long as the master function detects a non-Intel CPU, it almost always chooses the most basic (and slowest) function to use, regardless of what instruction sets the CPU claims to support and irregardless of the force override as this function has priority over anything you configure MATLAB with arguments wise. This could be interfering with the parallel pool potentially?

I could be way off base though. Its been some time since I used MATLAB. I got fedup with its license cost and switched to Python. If I need performance and multithreaded concurrency I used Golang and Goecharts. Northwestern has a great article discussing it lightly (Data Science and the Go Programming Language: School of Professional Studies | Northwestern University)

That said dont change your workflow if its working. I was just defending why I no longer use MATLAB. MATLAB has a lot of strengths but I always disliked how its performance was locked into CUDA and Intel

I hope im wildly off base and yall find a solution

I’ve ran it on my AMD Ryzen 9 5950X cluster with 32 processes without any issues.

The 5950X itself is a 16-core/32-thread CPU (but I disable SMT in the BIOS), but via admincenter, I am able to create a cluster, have my cluster headnode be the MJS and I think also runs the MCE, and then my compute nodes are the workers. And then in the MATLAB environment/window itself, I make sure that the cluster profile is what’s used (rather than local) and then also check to make sure that the number of workers and the number of threads, and the number of threads per workers are set in accordance to that.

(Huh. It looks like that I might not have saved the results from the A\b benchmarking that I ran on it. Whoops.)

Python CAN be good for SOME things, but my current experience with it is that I spend more time, just trying to get the damn thing to compile, than actually doing any useful work with it. With MATLAB, I am less likely to run into that.

(I have found this to be true with a lot of technical computing applications where basically, the licensing cost is the cost to offload/outsource making sure it works to somebody else and that’s what I have found, that the licensing cost pays for.)

I’ve also tried GNU Octave as an alternative to MATLAB, but there are a lot of features and functions that it can’t/doesn’t do natively, so a low-to-moderate effort was spent trying to “convert” some of the MATLAB functions to Octave, and in the end, again, I ended up doing more of that than any, actual, useful work.

(IIRC, there WAS a way to get Octave to use MPI as well, but I never quite got that far in terms of trying to implementing that.)

For me, it’s a love-hate relationship.

Love it because it DOES speed up computations that can be sped up.

Hate it (but I don’t blame MATLAB for this one, but rather AMD) because their competing standard just, well…couldn’t compete.

(As far as I can tell, pretty much no commerical application for technical computing, really optimises for OpenCL GPU computations. Last I read about any attempts/experiments for doing so, still showed that it was woefully lacking behind CUDA.)

1 Like

*Interpret successfully not compile

Python interprets each line. Its code isnt compiled and executed immediately as a binary. That said this is entirely okay and fair reasoning to avoid making that switch. If ones experience isn’t deep enough into the code or particular language then it makes no sense to use it for their workflow. MATLAB has this as a strength as you said. Its definitely a nice suite. There are days I do miss somethings.

Huh you might be on to something. Maybe MATLAB doesnt handle the CCXs appropriately. I wonder if it has to do with the shared memory allocation and specific numa configuration of AMDs TR? Or rather getting MATLAB to see that properly since MATLAB is optimized for Xeons

This is partially true. There are things that do but they are highly highly niche. As in HPC super computer level niche so you may as well be correct

The bug report that we filed with the developers would beg to differ.

(cf. GitHub - modelon-community/Assimulo: Assimulo is a simulation package for solving ordinary differential equations.)

Varies.

Sometimes, open source tools/codes/applications has the benefit that it is open source.

Sometimes, it’s some manager, that’s looking at managing their budget.

Other times, the open source stuff might actually be what’s on the bleeding edge (whereas some commerical applications, because they need it to be stable and “not break”, and therefore; might lag behind by a bit/a while.)

It varies.

In my experience, it is neither.

IIRC, the Ryzen 9 5950X has two CCDs and one I/OD. Therefore; I would be better able to abide by this hypothesis if I ran into the same issue as I would’ve with said Ryzen 9 5950X given the multi-CCD nature of the processor, but alas, I did not.

Conversely though, without seeing (and running) @ProAtOverthinking’s code, I won’t really be able to properly try and help diagnose it.

But in any case, generally, MPI is faster (by upto 40% faster) running the same code vs. OpenMP (which is how the default parallel pool kicks off as a shared memory pool).

(LS-DYNA does the same thing, NASTRAN does the same thing, ABAQUS does the same thing, and Ansys (Workbench) mechanical does the same thing.)

As such, I will only run something using OpenMP/SMP if and only if I have to – but that is also usually indicative of a problem with my case setup than anything else.

So, unless there is an error that @ProAtOverthinking will encounter, otherwise, it doesn’t hurt to try and see if we can make/force MATLAB to kick off the parallel pool using MPI rather than SMP. I haven’t come across a scenario/case where running the MPI pool won’t work, but that doesn’t necessarily mean that there aren’t any. I just haven’t personally ran into that yet.

From the last time that I looked at OpenCL acceleration – and it has been a while, but the last time that I looked at it, the impression I got was that if you were developing for OpenCL, you were DELIBERATELY and PURPOSELY developing for OpenCL.

Also IIRC, BEST CASE, OpenCL was ~70% the performance of CUDA for relatively standard BLAS routines.

And then Nvidia would come out with a new card and welp – there goes any potential advantage that would’ve been had with using said AMD card.

Its not true python. Its Cython. A python interpreter that converts your code line by line to C code then compiles it. Or optionly you can even call C functions and C static typing into the very same script to optimized select parts of the code. So yes your correct yours was compiled but not because of the nature of python

To not muck up the thread further ill table the conversation but I look forward to seeing what a potential solution may be

In addition to author’s question, does any one know actual status of AVX-512 support (by MATLAB), especially on zen4 processors? Gonna build 7975wx, sometimes use MATLAB myself. The information obtained from Google is contradictory.

I suppose that would depend on what version of MATLAB we’re talking about, but I know MKL has supported AVX-512 since at least the 2019 release and MATLAB keeps up with MKL releases.

You should be able to query the version of BLAS and it’s supported instruction set (on current system) in MATLAB using the version -blas command.

​​​ ​ ​
​​​ ​ ​

As an aside to the aside, I’ve recently been investigating the “new” AOCL vs MKL performance differences. Preliminarily it looks like AMD might finally, after 20 years of being significantly behind Intel, have a competitive product as of release 4.1. Makes me wonder is swapping over to AOCL from MKL could solve OP’s problems.

Probably later today I’m going to post a compute benchmark people can run to investigate themselves, it’ll have

MKL 2022.2.1.0
AOCL 4.1.1
Apple vecLib
OpenBLAS 0.3.25
Arm-Performance-Libraries_23.10

BLASes in it to test out against eachother under than same compute workload; it would be super interesting to see how Apple’s silicon or the Ampere Linux machines competes against Threadripper or Xeon on a real scientific workload and not something synthetic.

2 Likes

Just a quick update for you -

  1. I am still in the process of writing the deployment notes. (I thought that I had it, but in checking my OneNote, it appears that I don’t.)

  2. It also appears that Mathworks either changed the set up a bit or that when my IT pushes the MATLAB installer package to my work laptop, it does so with the “-silent” option which means I don’t really see what it does – I just know that it does it.

So, that being said, I did some googling and found out that for you to use the admincenter, you will need to have a license to the MATLAB Parallel Server for it to work.

You can check with your IT department about what kind of licenses you have.

  1. So long as this pre-requisite is met, then the PowerPoint that I am putting together (working through some issues with it) should work for you.

Thanks.

edit
Okay.

I’ve uploaded the PowerPoint file here now.

Take a look at it and hopefully, these instructions will work for you.

Let me know if you have any questions.

Thanks.

1 Like