Do you need to explicitely code an application for NUMA? Or is it something like the OS enabling NUMA on the system?

melty · August 19, 2022, 1:18pm

I’m looking into parallelisation and utilising all hardware efficiently (if there’s a word for that - maybe high performance compute?) and am puzzled with NUMA. As the title states, I’m wondering if NUMA is a hardware or platform enabled thing, or if it’s something a developer has to enable or program explicitely.

I’ve heard of NUMA aware applications and I know that certain workloads benefit while others run at a detriment. I don’t know where to find this kind of information, along with which languages support NUMA (if it is explicitely programmed).

I was afraid to ask here but I either really cannot find an answer from searching for days or the answer is right in front of me and I’m completely missing it… which seems more likely.

twin_savage · August 19, 2022, 9:39pm

I don’t really consider myself much of a programmer but here’s my take from an HPC perspective:

The first thing you should identify is if the problem you are solving is parallelizable in the first place:

Are there dependencies on parts/substeps of the solution in relation to each other or can the work be done simultaneously without chunks of cpu work relying on the output of other chunks of work?
You wouldn’t want to just pipeline a solution across different numa nodes, it would perform more poorly than if the problem was kept on a single numa node typically.
Memory locality to the different processes on numa nodes needs to be taken into account, if work chunks are only loosely reliant on each other in terms of compute but are internodal memory bandwidth hungry or thrash numa memory allocation boundaries then the process is likely not a good candidate for being parallelizable— if the later is true it may be possible to just duplicate working memory across numa nodes but this can get prohibitively expensive because if you’re trying to parallelize the problem in the first place it is likely large.

The second thing I would look at (but perhaps large code developers wouldn’t because they want to target as wide a range of computers as possible and approach this differently) is what is your target architecture like? is it only 2 numa nodes or is it 100?— the answer to this would likely dictate how you manage memory.
Most hardware platforms need to have numa exposed to the OS via a bios option, otherwise the hardware will be presented to the OS/software as “one” big logical numa node and when processes span across the hardware numa nodes (but don’t know it because the bios is abstracting that information away) a performance penalty can be incurred.

Regarding actual software development: You likely aren’t going to be writing your own solvers so you’ll be hooking into existing ones (this makes the programming language less relevant); you’ll want to pick optimized solvers and feed them in a way that is conducive to the hardware architecture (numa or not).

diizzy · August 20, 2022, 8:10am

It’s basically down to the kernel and its drivers however you might need to interact with the kernel and make it aware of your software’s preferences.

← not a kernel developer

risk · August 21, 2022, 5:06am

Hi, this kind of thing is my bread and butter at work.

As a developer writing parallel multi thraded code, you write your code as per usual for the most part, bit you might be able to squeeze more performance out of some multi-threaded / multi-core parallel workloads if you detect what CPU you’re running on and adapt your approach.

NUMA practically affects two things:

core<->ram latency across sockets (only servers)
core<->core latency cache latencies
performance optimized vs efficiency optimized cores (only laptop+desktop)

Good news is that you can check how much you’re affected by various of these effects - thanks to some instrumentation.

And you can either constrain a thread to only run on a particular core or you can give hints to the os what kind of core you’d like. You’re also typically able to change this over the life time of your thread.

The first one is a big deal for servers, if your code is running on a multi-socket system. e.g. a dual socket epyc server. It’s up to the kernel to ensure that the thread allocation memory gets memory close to the thread asking for it (mmap). There’s some additional kernel APIs that can be used to tune this, but that’s about it.

In general you should use pthreads API to set a mask on your threads to instruct the kernel to limit them to on socket before loading your starting dataset on order to minimize cross socket communication.

…and if you still get memory far away, ask on mailing lists and file bugs with your in-house sysadmins and/or file bugs with your in-house malloc developers or performance team and your in-house kernel developers … … or if you don’t work for one of the E corp like companies where you have these folks on hand, ask on mailing lists and rummage through bugs.

If you want to share data across sockets, ask yourself (in order):

can you split the dataset further, so that it fits on a socket (in cases where data is naturally shardable)
can you load 2 copies of it, or is it prohibitively expensive
can you schedule your workload across threads to limit cross socket communication e.g. run multiple thrradpools constrained to a socket
how much does ram/cores cost relative to your time, e.g. is it worth spending a month for 1% performance improvement, how about 10%? About the only solution that takes less to deploy is a change in job shape if this is running in the cloud on VMs, just ask for fewer cores per replica of your job to give the cloud scheduler ability to fit you on a socket.

On single socket systems, sometimes the ram latency will be different across cores on some CPUs, but there’s nothing you can do about it.

On Linux you can use the perf tool to look at various hardware / PMU counters to see how many data cache hits misses you have in which cache… There’s lots of numbers in there and some people have made careers out of optimizing code in assembly and staring at those numbers, but perf tool has a small explanation for what these numbers are too.

As I mentioned before, On single socket systems there’s generally nothing you can really do about ram latency, some cores will have higher ram latency on amd-wx CPUs, but mostly ram latency will be the more or less the same across all other CPUs. As a software developer since these “more detached” cores are there on wx, it’d be foolish to not take advantage of those.

What’s different with chiplets is core<->core cache latencies.
One thing that helps on AMD with highly parallelizable callback like workload is to code your threadpools to keep separate queues per CCD and prefer pulling work from their own queue before looking at stealing work from other CCDs queues.

Look into and learn about various cache coherence algorithms and look into and learn about things like branch caches and prefetchers to get a feel for which parts of your code are potentially reading from far away caches.

OS won’t bounce your threads across cores often, depending on how much work you do per callback you might be able to swap the notion of “my queue” relatively infrequently (basically check if you’re still on the same core before pulling next thing off the queue).

Performance vs efficiency cores.

This is harder because it’s a new thing on laptop/desktop… how do you decide between two things that a user is waiting on, which one should be the first to be bumped to an efficiency core when a P cores is full.

This isn’t part of my daily concern… and even as a desktop/laptop/phone user my own preferences change a lot. Not sure.

Yeah you can pthread mask your process, but what then?

SoManyBanelings · August 21, 2022, 10:51am

Not a developer so i can only talk to squezing the max out of a multi numa node system, if your application can be parallelized running 1 instance on each numa node gives the max performance in my use cases, so in a dual epyc 64c system i run 8 instances each confined to one of 8 numa nodes both cores/memory, this avoids all the cross numa latencies and can speed up some hpc usage.

also used it with success on a older dual 22c intel system running windows server, where running 2 instances confined to 1 socket each gave a 30-50% speedup over 1 instance running on both sockets

some apps are more numa aware than others, but nothing in my experience beats hard pinning the app to one node/socket using numactl like
linux numactl --cpubind=5 --membind=5 ./app
for windows theres start /NODE 0 /AFFINITY "hex of the cores" app.exe or process lasso