Second thoughts about Ryzen 9 5950X...may be an Intel Xeon W-1290P instead?

SgtAwesomesauce · March 3, 2021, 5:52pm

All the cpus have the hardware required (I guess with the exception of the apus), but something about the motherboards that I don’t fully understand means that they may not support ecc.

Imo, ecc is overrated and I never really run it unless the server just happens to have it already.

JDev · March 4, 2021, 5:12pm

THIS!

ECC is way cheaper than it used to be, but it’s still difficult to find decent speeds and timings and you’re loosing a lot of optimization for something that’s significantly less likely to occur than one of a thousand other issues.

If it’s there, great. But if seeking it out means paying more money AND loosing performance somewhere else at the same time (speed/timing) then no thanks. I’m not running a hospital on an airplane

(and for those individuals that are operating hospitals on airplanes… that sounds dangerous. Stop doing that).

thro · March 5, 2021, 12:38am

It depends on your requirement for data accuracy and stability.

It doesn’t have to be “mission-critical” like a hospital, etc… e.g., if you have a rendering machine/farm where a crash in the middle of say, a 48 hour render would result in needing to start over, you’d be stupid not to run ECC. It’s CPU heavy, not memory throughput heavy, and any crash may lose you days of render time.

if say, DDR4-2133 is “Fast enough” and stability is important, then there’s zero point buying say ddr4-3200 for the cost of ddr4-2133-ECC and running the stability/data integrity gauntlet.

Different people have different trade-offs.

Personally in my own (desktop) machines I don’t run ECC because they double as gaming machines, I’m cheap, and they don’t hold any data I really care about. But if it was a business only PC I’d be sticking ECC in it if available because memory bandwidth performance for the stuff I do is plenty, and stability is more important.

The flip side of all of the above is that these days DATA is so much larger than code. To a far greater extent than in the past.

The chances of a single bit-flip causing a crash today is much less likely than a bit-flip causing data inaccuracies. If you’re doing art work (for example - audio/video), no problem; a bit flip in one pixel probably won’t be noticeable, or it will end up being anti-aliased or smoothed anyway. If you’re doing statistical analysis … maybe a big problem.

ChrisA · March 6, 2021, 9:45pm

thro:

I’d say that Ryzen has bigger caches vs. the equivalent competing i5/i7 or xeon. Again, if a problem set fits into Intel’s cache it will be faster as intel’s cache is slightly faster. but it’s much smaller. Ergo, I’d suggest that Ryzen is more future-looking and will have more consistent performance as code gets bigger.

Once you get into the bigger Xeons (which have larger cache than Ryzen), you’re up against EPYC which has even bigger caches (because it is essentially multiple Ryzen on a package).

Also those Xeons typically run at much lower clock speeds which offset the benefit somewhat. But they’re more intended for heavy multi-threading (e.g., vm servers).

There’s a couple of caveats with that, but the huge cache is one reason why Ryzen 5000 series is so fast vs. the competing desktop processors. It’s also a reason why EPYC (and thread ripper) rocks.

That’s a great summary, thank you

Very intersting about EPYC & Xeon’s, the latter doesn’t sound suitable for a workstation that will mostly be single threaded.

Thank you again

That’s what I thought I read about the CPU’s being able to use ECC but perhaps not utilise the benefits.

I think for casual use, ECC is surplus, however in my situation, my computer earns me money and if I can spend a few extra bucks for a 1% reliability increase, I’ve got no issue with that. Problem is, with my job, downtime can easily mean thousands of pounds (1.5x that for dollars) is lost…and I like money, it’s helps with food for the winter

Thank you for that, gives a good perspective! I’m not going to be operating a hospital within the confines of a aeroplane any time soon

In my situation, this is sort of the crux of the matter. I create drawings that are used to construct buildings, I can potentially be responsible for any errors in dimensions…hence my Professional Liability insurance protects me for £2,000,000. I’d prefer not to use it, so ECC is preferred, for that tiny chance of an error. I have to say, my CAD software is pretty solid, but once in a while I have seen strange things happen - an error that doesn’t corrupt the drawing, but adds artefacts. Of course it could have been a multitude of other hardware, but if I can protect in any ways that don’t cause a significant performance impact, I’ll take em!

I’m really appreciating yours and others comments, it’s so helpful for someone that doesn’t have an IT department, or the funds for one!

JDev · March 9, 2021, 12:28am

I am wondering if there’s any real world evidence of single bit-flips happening and causing actual changes to stored data with any level of occurance that is meaningful.

I think this referenced potential is more of a hypothetical threat than an observed phenomenon.

The scenario where some sort of memory error causes a number for a unit of measurement for something in a CAD file to have one number changed or something of the sort – I struggle to believe this is a real risk.

The kind of memory errors which ECC corrects for can and do cause application crashes. But stored, altered data… I just can’t find anything doccumenting these occurances in modern applications. There’s just so many rare scenarios that have to align for this to occur vs. the application just altogether crashing. I’d be far more worried about human generated bugs causing data values to be stored incorrectly.

The Google study from '09 that everyone references for error occurance is in terms of billion hours of operation per Mib. They’re metrics that are somewhat hard to understand and extrapolate to modern situations and the articles that are written that quote the study have a tendency to horrifyingly mis-translate and mis-interpret what the data is saying in terms of practical occurance.

In reality, that study reports something like 92% of DRAM modules saw 0 errors in the course of one year with only 8% experiencing any error at all. It goes on to point out that there was at the time a concern that increased memory density would increase error rates, but improvements in technology seemed to counteract that. And that’s based on '09 tech.

So then assuming you’ve got the 8% chance of an error occuring at all… what are the chances that the error has an impact on working memory that happens to contain information that’s meaningful data for whatever you’re working on vs. the larger percentage of ram that isn’t being used for something that’s a stored value in your application?

A render leveraging thousands of dollars of compute power – that’s a great example where ECC might matter even for a situation that isn’t life and death. However - this is not because of pixel information which once again I don’t know that this is a REAL possibility, but rather becasue the crash in and of itself is damaging. If you’re going more than 5 minutes without saving while working in any CAD program – once again you’ve got bigger problems than ECC lol.

OH! Also in general high elevation does increase these risks but if you’re using £ then chances are that isn’t a problem. The UK is one of the lowest average elevation nations in the world-- definitely in the top 25 lowest countries by average elevation haha.

thro · March 10, 2021, 4:44pm

Ecc vs data errors is a no brainer imho

If you need data accuracy and may be sued or get inaccurate decision making otherwise: run ECC. Especially if you aren’t severely memory bandwidth constrained - and most serious money making machines are not.

It may be a low error rate but the percentage increase in BOM cost to implement ECC for any machine used to generate revenue and needs accurate data is so small as to be meaningless.

Skipping ecc in an engineering or science workstation, you’re basically saving no money, increasing risk of potentially being sued for millions of dollars and likely having virtually no meaningful impact to performance.

Ecc can also help against things like rowhammer so it makes sense in machines subject to external threats as well.

Imho the only places ecc make no sense are gaming machines, graphic artists machines and dumb terminals. But for the latter two performance isn’t massively impacted by the small hit to memory bandwidth anyway and again the cost increase percentage wise to total bill of materials is minimal.

ChrisA · March 10, 2021, 10:34pm

Thank you @thro , you took the words out of my mouth, and then typed them out

I do massively appreciate your thoughts and concerns @JDev , about the justification for ECC, and it’s likely a subject that’s been done to death. I look at this decision through many eyes - the business owner, the guy buying the gear and the guy that has to check his insurance if a fault appears in a drawing, among many others.

Referring to your own comment though, when data are stored on the hard drive, that’s fine and to some extent out of my control (though I’ve been trialling TrueNAS for nearly 12 months with great success), but when dozens of GB of drawing are stored and worked on in RAM, what you perceive as a minor corruption, ‘could’ mean money to me. For instance and this is just a hypothetical, if I draw a building using vectors that’s 120 feet long and a corruption causes that line to become 118 feet - that’s a problem for the whole construction process, from tenders to Structural Engineering to material costings to positioning on the site to so many other factors that I don’t have the time to list. Guess who’s responsible for all of that loss? Yup, me.

So in summary, when you consider the factors you mentioned, with unstable applications, human error and so on, if I can pay a few extra dollars to increase reliability even by 1%, I darn well will. For the same reason I have 5 copies of all of my drawings, saved in multiple geographic locations - some of these individual drawing files are worth up to $10,000 (converted) in man hours. When they’re worth that much, why shouldn’t I take an extra precaution when they’re in working memory - of course if you google “Is ECC worth it”, you’ll get a dozen results suggesting it’s not. However if you google “Why ECC is important”, lo and behold you’ll get different results compelling a completely opposite argument, moreover it’ll reinforce your previous preconceived notions.

So I do massively appreciate your thoughts @JDev , it’s great to hear another perspective, but as it’s my business, I have to be cautious.

JDev · March 14, 2021, 4:05pm

Obviously if bit flips were causing your CAD drawings to actually have data values change that would be an absolute no brainer. It would also make me incredibly concerned that MOST desktops that MOST federal contractors work to do day to day work don’t have ECC memory.

But my question was whether that was actually even in the realm of might ever happen. MOST of the ram usage isn’t for values that would cause that to happen. Most of your ram is being used for calculated values. This is why your ram usage is always substantially higher than anything being stored for a working file.

And the larger your drawing, the more scaled that memory usage is for values that aren’t actually stored values. And even if a value that the bitflip happens on is part of an actual stored value, presumably the program expects some function to be called if that value were to suddenly change to recalculate everything based on it. I’m not sure the scenario you describe is actually something that could actually happen from a memory error and I struggled to find any research suggesting it did. I’m not saying it’s not possible, but if it were I’d sure like to understand the circumstances that led to that because it seems like software mitigation could prevent this for consumer hardware.

When people talk about the impact of ECC, it’s generally in terms of downtime cost or known and apparent data corruption. Or as was appropriately pointed out the cost of time loss for an extended job that must be completed in one pass.

Most developers I know that work for government agencies are assigned laptops to work on. Generally it’s not possible for them to use their own hardware even if they wanted to. If we’re saying that bit flips cause the kind of perceived value changes you are describing and that people are subsequently sued for these errors (something else I couldn’t find a single case for – not to say it doesn’t exist), then it seems like the conversation of ECC vs not is a much larger concern than the way it’s conventionally framed.

It just seems like we’re talking about a 8% chance of a .2% chance of a .0001% chance (assuming about 8mb of actual stored value in a gig of working memory which is probably generous) of a bitflip even happening in a data values that would be able to cause the scenario you’re describing and then I’m curious whether or not the program would still crash anyway.

My point isn’t regarding whether you should use ECC memory. If the cost is negligible and you don’t need the memory bandwidth anyway than you’ve answered your own question up front with the conditions.

I’m just questioning whether stored data values changing in a cad file is even in the realm of things that actually happen. Because if not, that information may be relevant for those individuals who later read this to determine between two significantly different systems based on whether or not they need ECC and get scared based on something that might not even actually be a possible threat of a memory error.

ChrisA · March 28, 2021, 9:04pm

I do appreciate your comments, thank you.

LiquidSquad · April 13, 2021, 4:18pm

Vega 64 is so outdated though. Threadripper + Ampere is an excellent combo.

thro · April 14, 2021, 5:18am

That is indeed the case, but if the error will potentially put you out of business (and it’s YOU personally on the hook for it), and you can mitigate it for a couple of hundred dollars - wouldn’t you take that deal?

If I can virtually eliminate that risk for a couple of hundred bucks, or say, 5% of my workstation BOM cost, you can be damn well sure I’ll take that deal.

Plenty of people do not run ECC and are fine. As you say, the risk is small.

Personally though on a machine responsible for engineering decisions or science in particular, I just see no sense in skipping it - the cost to mitigate it entirely is small, too.

system · January 12, 2022, 11:18pm

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.