edit: I've got a c2750 based system that is my firewall and I'm not exactly thrilled to discover this is now what I might potentially have to look forward to in the future.
I wonder is it a design flaw or a spec issue, if lets say a clock source crystal goes out of spec does it cause this or has Intel got a design / manufacturing flaw rearing its ugly head?
From what little digging I've done it sounds like a design flaw with the silicon. Which means no microcode update possible. This has the potential to be a pretty big "oops" in that it can be quite costly.
Incoming trainwreck. Any business using these things better back up their stuff and change their host servers to something else, because this will be expensive if something goes wrong.
Exactly. And think of all the stand alone appliances that are using these CPU's as well. All those pre-built NAS boxes and lower power windows home servers. The more I think about how many devices this is going to cover, the higher the price tag I see this costing.
Assuming Intel doesn't sweep it under the rug that is...
Netgate has become aware of an issue related to a component manufactured by one supplier that affects some of our products. This is a widely-used component that is used by many companies around the world.
There is a lot of confusion and misinformation on the subject, and most systems will never experience the issue. Those that do will not suddenly stop working, but if the component fails, the system will not successfully reboot. We are working with the component supplier and our manufacturing partner to resolve this issue as quickly as possible.
Although most Netgate Security Gateway appliances will not experience this problem, we are committed to replacing or repairing products affected by this issue for a period of at least 3 years from date of sale, for the original purchaser.
A board level workaround has been identified for the existing production stepping of the component which resolves the issue. This workaround is being cut into production as soon as possible after Chinese New Year. Additionally, some of our products are able to be reworked post-production to resolve the issue.
We apologize for the limited information available at this time. Due to confidentiality agreements, we are restricted in what we can discuss. We will communicate additional information as it becomes available.
As always, please be assured we will do the right thing for our customers at Netgate and the pfSense community.
That would be very stupid of them as it is hurting their bottom line a fair bit:
But secondly, and a little bit more significant, we were observing a product quality issue in the fourth quarter with slightly higher expected failure rates under certain use and time constraints, and we established a reserve to deal with that.
To understand what’s going on and why C2000 SoCs can fail early, let’s start with Intel’s updated spec sheet, which contains the new errata for the problem.
AVR54. System May Experience Inability to Boot or May Cease Operation
Problem: The SoC LPC_CLKOUT0 and/or LPC_CLKOUT1 signals (Low Pin Count bus clock outputs) may stop functioning.
Implication: If the LPC clock(s) stop functioning the system will no longer be able to boot.
Workaround: A platform level change has been identified and may be implemented as a workaround for this erratum.
Semi-critical flaw uhu? Only fix is to rework the board, aka change the whole board. Probably all C2000 stuffs produced a year ago and older are living on borrowed time, they will all die sooner or later it seems. Certainly semi-critical right. Intel marketing is going full damage control.
Edit: Apparently SuperMicro is doing free RMA for their C2000 series stuff:
Like many people running pfsense, I have a C2758 running on a supermicro SuperServer barebones. I reached out and submitted an RMA via:
and they got back to me today, letting me know that they will do a "cross shipment" sending me a replacement before I even send the defective one back.
Well isn't that peachy, my asrock C2750DI hasn't died yet though. I am assuming there won't be any microcode fixes, or basic BIOS updates to resolve the issue. Just RMA if you have any kind of warranty?
Only if it happens during warranty period and gets into news because they weren't fast enough with issuing blanket NDAs. Like what happened this time. Instead of calling it stupid, I'd call it miscalculation. I've also read somewhere (maybe even on these very forums) that starting with Skylake they put some shitty thermal interface under CPU heat spreader, and it degrades after a year or so of active 24/7 use.
Nope. The difference is that the Rangley C2000 SOC's were sold as running 24/7 for years in networking hardware. You know, firewalls, routers etc. Desktop CPU's are not expected to run 24/7 for years. Had these chips been in consumer laptops it would have been no problem. But they're sold on the professional market for high availability. So no, no chance in Hell they planned this. This hurts their brand even.
Do you really think that these chips were designed to run 24/7 from first sale till EoS? That's not how HA works or what HA means. HA doesn't mean that this specific piece of hardware will not break, it means that you have multiples of them in redundant configuration plus spares. If anything happens, enterprise price margin pays for a replacement during warranty period, and service contract pays for it after that. Most, if not all, of these vendors (including Cisco) have liability disclaimers in their EULAs.
It only hurts them because the issue went public. And even when something like this goes public, it doesn't hurt vendors or manufacturers that much. Remember Packet of Death? Did it hurt Intel? Maybe. But does it affect their sales now? Not at all.