I picked up a 3960x + 128GB 3200C16 ram to participate in some kaggle competitions which I do in my spare time.
I was incredibly impressed with the performance of the 3960x in Xgboost (which is perfectly parallelized process workload). I was so impressed I went for the 3970X upgrade hoping to see a 33% increase in performance.
What’s weird is that my wall time for exactly the same model training tasks is the same for the 3970X as it is for the 3960X.
E.g. a single model fit was 36mins 30 seconds on the 3960X and 36mins 15 seconds on the 3970X.
On one of my gridsearch tasks I ran overnight, a stock 3960X managed 6 hours 40 mins, the 3970X overclocked to 4.0Ghz all cores did it in 6 hours 15 mins.
I’m perplexed… I’d expect a much bigger uplift out of the box for the additional 8 cores that didnt happen.
Both CPUs are on the same board, same cooler, same ram at the same speed (3200Mhz) with the same infinity fabric ratio (1600).
I suspect your workload might be memory bandwidth bound (like many machine learning workloads), and CPU time used is used with different effectiveness.
You can try running your workload under using perf, and comparing various counters.
You can also try varying the number of threads you’re allocating.
This is a good suggestion, unfortunately I no longer have the 3960X so I’m working off of old wall times.
I can try to plot train time as a function of number of threads and maybe the data might show where we start to get some slowdowns.
It’s not the end of the world, I was going to pair the CPU with some GPUs to train the less complex models on the GPU while concurrently training the more complex models on the CPU.
If the plateau is around 24 cores, then that does free up 4 cores per GPU.
did you use anything special to get the xgboost to use all cores?
what language and what IDE do you use for the analysis work?
I use jupyter lab and could never get anything to use more cores from whatever the system decides automatically .
however i was able to use NVIDIA CUDA to accelerate my training time in certain applications(mostly which are CUDA comaptible) through lots of hassle but no luck on CPU side.
i have a 3950x on a x570 board with 128 gigs of ram.