[SOLVED] NOT URGENT: CPU Errors, or RAM I wonder?

Following on from this post: [SOLVED] (UK) The quest, for a lower power motherboard using my existing LGA2011 CPU/ECC REG RAM

It seems that the new CPU I installed may be duff :frowning:

I installed it on May 14th, the day after I noticed:
(TL:DR is Channel 2 memory error)

May 15 03:08:37 Xeon MCA: Bank 11, Status 0x8c000045000800c2
May 15 03:08:37 Xeon MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
May 15 03:08:37 Xeon MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 0
May 15 03:08:37 Xeon MCA: CPU 0 COR (1) MS channel 2 memory error
May 15 03:08:37 Xeon MCA: Address 0x22f8ac380
May 15 03:08:37 Xeon MCA: Misc 0x1221040004000a8c

Then today I noticed:
(TL:DR is Channel 3 memory error)

May 25 02:02:55 Xeon MCA: Bank 12, Status 0x8c000048000800c3
May 25 02:02:55 Xeon MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
May 25 02:02:55 Xeon MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 0
May 25 02:02:55 Xeon MCA: CPU 0 COR (1) MS channel 3 memory error
May 25 02:02:55 Xeon MCA: Address 0x69da41440
May 25 02:02:55 Xeon MCA: Misc 0x90000200020108c
May 25 04:19:31 Xeon MCA: Bank 12, Status 0x8c000048000800c3
May 25 04:19:31 Xeon MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
May 25 04:19:31 Xeon MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 0
May 25 04:19:31 Xeon MCA: CPU 0 COR (1) MS channel 3 memory error
May 25 04:19:31 Xeon MCA: Address 0x69da41440
May 25 04:19:31 Xeon MCA: Misc 0x90000200020108c
May 25 06:36:08 Xeon MCA: Bank 12, Status 0x8c000048000800c3
May 25 06:36:08 Xeon MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
May 25 06:36:08 Xeon MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 0
May 25 06:36:08 Xeon MCA: CPU 0 COR (1) MS channel 3 memory error
May 25 06:36:08 Xeon MCA: Address 0x69da41440
May 25 06:36:08 Xeon MCA: Misc 0x90000200020108c

Now, considering I’ve taken out 3/4 of the originally installed RAM, wouldn’t I be unlucky if one of the remaining RAM modules have an issue.

I’ll be switching out the current RAM modules for one of the others I have, I do hope it’s not the admittedly second hand CPU that is the problem.

I’m currently considering the advice of others and accepting this platform (CPU and motherboard) are just due for replacement because of power usage. I have my eye on a G4560 and a board that will fit it…still mulling that one over.

Thanks for reading!

OK, I think I’ll change out the ram now

May 25 08:52:45 Xeon MCA: Bank 12, Status 0x8c000048000800c3
May 25 08:52:45 Xeon MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
May 25 08:52:45 Xeon MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 0
May 25 08:52:45 Xeon MCA: CPU 0 COR (1) MS channel 3 memory error
May 25 08:52:45 Xeon MCA: Address 0x69da41440
May 25 08:52:45 Xeon MCA: Misc 0x90000200020108c

Right, so I’ve swapped out all RAM for spare. I’m going to run a scrub once my snapshot machine has done a backup.

Did you check the BIOS? Maybe the change made BIOS to default on different clock speed/timings that are incompatible for your config. Or MCE threshold values, or whatever. I’d check this first. MCE sometimes throws out stuff that’s entirely normal, and these could also be correctable and corrected errors.

I’m not familiar with specifics ofLGA2011 platform, so can’t contribute much more in depth or practical help here.

1 Like

Small update, I ran a scrub that took 6 hours ish and see no further errors.

@Exard3k Sorry mate, didn’t see a notification about your message until now, will respond fully when I have an actual keyboard and not phone!

Could just be a bad CPU seat, you can try reseating it

2 Likes

Was just about to suggest this too - I’ve got an older dual LGA2011 SuperMicro board and two times now I’ve pulled the server out to work on it and it wouldn’t boot back up because of a memory channel suddenly being ‘bad’. Both times I pulled the CPU on that channel, cleaned the pads + socket with a special nylon brush meant for dispensing flux for BGA rework, and that took care of it.

Just something to consider.

Thank you again. I had reset the BIOS after the new CPU was acting odd, and thought it was probably sensible as a precaution.

So far, I’ve still not had any errors after replacing the RAM…so far!

Cheers guys. I think I’ll get one of those nylon brushes and give it a clean if it plays up again.

So far I’ve had no boot issues at all with the new CPU. Really appreciate your thoughts :+1:

I have also heard of a technique where you gently massage the socket pins against the grain with a small nylon brush to make them more springy again, which improves contact and pressure with the pads.

1 Like

Thank you for that mate :+1:

1 Like

Yep, if it’s running now then I’d just leave it alone and call it good unless it starts having problems again. :joy:

For reference - US distributor so not really useful to you, but I keep one of these on hand in case a socket ever needs cleaned out. You can find something similar at most distributors of solder supplies/etc, Excelta makes a lot of ESD safe brushes in different sizes but they’re a little expensive, you can probably find something similar for a lot less.

2 Likes

From your log file, Interesting that the time between 02:02:55 and 04:19:31 (8196 seconds), is almost identical as from 04:19:31 to 06:36:08 (8197 seconds), and that 8196 is (edit: almost) a power of 2 (2^13 = 8192) :ghost:

edit 2: I guess that could be the background patrol scrub RAS feature catching a memory error before an explicit read does, which would suggest that writes or memory refresh was faulty. If you reset BIOS and disabled the patrol scrub, then you might not have resolved the issue.

2 Likes

You are way more clever than me :blush: I will keep an eye on that machine, though I’m in the next slooooow process to change up my general server arrangement, and it might become the 3rd backup machine.

That’s the way I like to do things! :laughing:

I’m definitely getting one of those, thank you for thinking of my non-US location :+1:

Only a minor update, but no further RAM issues have been logged :crossed_fingers:

2 Likes

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.