Hi, this kind of thing is my bread and butter at work.
As a developer writing parallel multi thraded code, you write your code as per usual for the most part, bit you might be able to squeeze more performance out of some multi-threaded / multi-core parallel workloads if you detect what CPU you’re running on and adapt your approach.
NUMA practically affects two things:
- core<->ram latency across sockets (only servers)
- core<->core latency cache latencies
- performance optimized vs efficiency optimized cores (only laptop+desktop)
Good news is that you can check how much you’re affected by various of these effects - thanks to some instrumentation.
And you can either constrain a thread to only run on a particular core or you can give hints to the os what kind of core you’d like. You’re also typically able to change this over the life time of your thread.
The first one is a big deal for servers, if your code is running on a multi-socket system. e.g. a dual socket epyc server. It’s up to the kernel to ensure that the thread allocation memory gets memory close to the thread asking for it (mmap). There’s some additional kernel APIs that can be used to tune this, but that’s about it.
In general you should use pthreads API to set a mask on your threads to instruct the kernel to limit them to on socket before loading your starting dataset on order to minimize cross socket communication.
…and if you still get memory far away, ask on mailing lists and file bugs with your in-house sysadmins and/or file bugs with your in-house malloc developers or performance team and your in-house kernel developers … … or if you don’t work for one of the E corp like companies where you have these folks on hand, ask on mailing lists and rummage through bugs.
If you want to share data across sockets, ask yourself (in order):
- can you split the dataset further, so that it fits on a socket (in cases where data is naturally shardable)
- can you load 2 copies of it, or is it prohibitively expensive
- can you schedule your workload across threads to limit cross socket communication e.g. run multiple thrradpools constrained to a socket
- how much does ram/cores cost relative to your time, e.g. is it worth spending a month for 1% performance improvement, how about 10%? About the only solution that takes less to deploy is a change in job shape if this is running in the cloud on VMs, just ask for fewer cores per replica of your job to give the cloud scheduler ability to fit you on a socket.
On single socket systems, sometimes the ram latency will be different across cores on some CPUs, but there’s nothing you can do about it.
On Linux you can use the perf tool to look at various hardware / PMU counters to see how many data cache hits misses you have in which cache… There’s lots of numbers in there and some people have made careers out of optimizing code in assembly and staring at those numbers, but perf tool has a small explanation for what these numbers are too.
As I mentioned before, On single socket systems there’s generally nothing you can really do about ram latency, some cores will have higher ram latency on amd-wx CPUs, but mostly ram latency will be the more or less the same across all other CPUs. As a software developer since these “more detached” cores are there on wx, it’d be foolish to not take advantage of those.
What’s different with chiplets is core<->core cache latencies.
One thing that helps on AMD with highly parallelizable callback like workload is to code your threadpools to keep separate queues per CCD and prefer pulling work from their own queue before looking at stealing work from other CCDs queues.
Look into and learn about various cache coherence algorithms and look into and learn about things like branch caches and prefetchers to get a feel for which parts of your code are potentially reading from far away caches.
OS won’t bounce your threads across cores often, depending on how much work you do per callback you might be able to swap the notion of “my queue” relatively infrequently (basically check if you’re still on the same core before pulling next thing off the queue).
Performance vs efficiency cores.
This is harder because it’s a new thing on laptop/desktop… how do you decide between two things that a user is waiting on, which one should be the first to be bumped to an efficiency core when a P cores is full.
This isn’t part of my daily concern… and even as a desktop/laptop/phone user my own preferences change a lot. Not sure.
Yeah you can pthread mask your process, but what then?