Ryzen 7 2700X slower than i7 4710HQ?

sceps · August 5, 2018, 4:41pm

I ran some R code (training ML models) from an old project previously run on an i7 4710HQ Windows 7 laptop years ago, and was shocked to find that the 2700X with Fedora 28 runs it ~20% more slowly! Even though it has twice the cores and is ~1GHz faster.

I ran a bunch of scenarios on the 2700X, varying the number of threads, CPB, SMT:

user	system	elapsed	OS	CPU	Threads	Turbo/CPB	HT/SMT	Load average
126.421	0.532	2275.998	Fedora 28	R7 2700X	8	ON	OFF	>42
126.583	0.541	2324.118	Fedora 28	R7 2700X	7	ON	OFF	?
136.636	0.574	2433.931	Fedora 28	R7 2700X	7	OFF	OFF	?
221.49	7.48	2679.77	Windows 7	i7 4710HQ	7	ON	ON	?
155.427	0.555	3233.690	Fedora 28	R7 2700X	15	ON	ON	>100
168.859	0.609	3247.277	Fedora 28	R7 2700X	15	ON	ON	?
173.312	0.650	3250.313	Fedora 28	R7 2700X	16	ON	ON	>70
161.403	0.611	3270.098	Fedora 28	R7 2700X	14	ON	ON	?
162.120	0.540	3442.758	Fedora 28	R7 2700X	7	ON	ON	?

Note that in some cases the computers were being used for other moderate tasks, so consider these ballpark figures with a generous margin of error.

Uncontrolled variables include:

Windows 7 vs Fedora 28.
R version R-3.5.0-4.fc28 from repo (no custom optimized libraries etc) vs don’t remember whatever the current version was years ago
versions of various packages called in the code. current now vs years ago

In summary, disabling SMT got the 2700X to complete more quickly, but still not as quickly as I would expect.

During a few runs I had htop/glances open and happened to notice that the load averages were very high, i.e. much more than 16 or 8. Death of performance by context switching?

Confused by these results.

tkoham · August 5, 2018, 6:49pm

The missing piece here is that R is single threaded: having high cpu usage doesn’t mean it’s actually utilizing more cores correctly.

not all that suprising honestly, ryzen IPC is around ivy/haswell and the second generation seems mostly to be cache and interconnect optimizations.

that combined with instruction set differences and context switching efficiency, like you mentioned, could easily make up the difference

catsay · August 5, 2018, 7:05pm

Considering that you where using an entirely older different version of R (+ associated libraries) in the past for the 4710HQ makes it relatively difficult and unrealistic to compare these results.

The software may have changed substantially over time. So it’s really an apples to oranges comparison.

There may be hardware processing differences and software performance regressions, particularly somewhere in the external libraries code.

Without a more current test for the 4710 one can’t and shouldn’t use this data to make conclusions or future decisions based on this data.

Turning off SMT shouldn’t even be a factor in speeding things up. But since it does I suspect a more software optimization related problem.

sceps · August 6, 2018, 2:07am

@tkoham R is single threaded, but there are packages like doParallel to create a parallelized backend. Some packages implement parallelized code which can then take advantage of this.

If the IPC of both CPUs is around the same, then why would the one with double the cores and 15-50% faster clock speed still be slower?

@catsay Yes, I wish I could run the test again on the 4710HQ. That wasn’t my machine though. I’ll try to get them to run it again.

What is it about SMT that would impact performance so badly?

What does it mean when htop lists processes in a darker color? glances did not list those.

Maybe a clue

When running for n cores, there are n processes listed in both htop and glances, but htop lists an additional n^2 processes in a dark color.

If those n^2 are real processes, it might explain the high load averages, the overloading of the system, and the slow performance. Pure speculation: maybe some packages parallelize by default or badly and end up “double parallelizing” each already-parallelized thread?

EDIT: Seems the dark entries are process threads. n^2 threads running doesn’t sound good for performance. Now to figure out why this happens and why SMT would make it worse.

sceps · August 8, 2018, 1:09am

train, rfe, sbf, bag and avNNet were given an additional argument in their respective control files called allowParallel that defaults to TRUE. When TRUE, the code will be executed in parallel if a parallel backend (e.g. doMC) is registered. When allowParallel = FALSE, the parallel backend is always ignored. The use case is when rfe or sbf calls train. If a parallel backend with P processors is being used, the combination of these functions will create P^2 processes.

My code does use caret though not the functions mentioned, but this seems to be exactly what is happening.

Since it is squared, a higher core count CPU will get vastly more overloaded, and apparently in a way that hits SMT hard. At least that’s a plausible explanation.

Now I need to find what model/function is causing the nested parallelism. It wasn’t all of them and I had to watch htop to notice the squared processes show up.