Proxmox on Gen 1 Ryzen hardware

I have an interesting Proxmox problem that I haven’t found a solution for. I have 4 home built servers in my home lab, 2 of which are very much based on desktop hardware. One is an X370 motherboard with a Ryzen 1700X processor and the other is a Ryzen 1600 running on a B450 motherboard. The four servers were running VMware until I upgraded to ESXi 7 and ran into hw compatibility issues. When running ESXi 6.7 these “servers” were rock solid and never gave me trouble.

In any case, figured it would be a great time to test out and learn more about Proxmox so I downloaded latest (7.1), did my reading and had fun setting them up on those two. Except I started having severe issues on both servers in that they would simply hang once to twice a day and need a reboot to get back online. The error I would get on them is:

watchdog: BUG: soft lockup - CPU#XX stuck for XXXs!

I have tried finding something on this and I have done a lot of things, BIOS resets/configs etc but nothing works. I can even find very old posts with this error for Proxmox so not even sure that this is tied to Ryzen specifically. To note, the 1700x had one VM running on it which is pretty much idling and the 1600 one has no workload.

I finally decided to take the 1600 instance as this was the one that was giving me more problems, and I decided to try out XCP-ng and try that out as well (again - to learn and get a feel for it). That was 8 days ago and the machine has been rock solid under XCP-ng.

Not sure what I am missing - anybody out there seen this type of issue with Proxmox before? I am just a step away from converting the 1700X over to XCP-ng as well but I was having fun with Proxmox and would like to mess around with it some more, but only if I can get it stable.

I’m running a Proxmox install on a Ryzen 1700X with a B450 motherboard, haven’t had any issues yet.

I’m new to PVE myself, but anything that looks relevant in the syslog when this is happening?

Is your BIOS up to date on both boards?

Thanks for the reply. And yep, BIOS is fresh and nothing comes up in Syslog that can point me in the right direction either. The only thing I see is when I hook up a monitor and I get the watchdog entry littering the screen. At this point I have given up. I am just going to switch over to XCP-ng so I can start running these two servers properly.

When using first generation Ryzen this could be the segementation fault bug. I had an affected CPU and would constantly crash when using virtualization. There is no fix for this, other than to do an RMA. Affected are first generation Ryzen CPUs produced before and including week 25 of 2017. The manufacturing week is printed on the heat spreader.

3 Likes

Have you checked that it isn’t trying (and failing) to get to a lower power state?

2 Likes

Yeah, I feel you, have had the same problem… It took me a damn long time to figure it out and yes as @Trooper_ish said it has something to do with the lower C-States. Youll have to disable them in order for it to work reliably. It’s really unfortunate but then it works very well. Have had 140 days of uptime, so yes it’s stable XD

You can fix it with the following cmdline flag:

# R1600
processor.max_cstate=5 rcu_nocbs=0-11
# R1700x
processor.max_cstate=5 rcu_nocbs=0-15
3 Likes

Not sure about the c-states, I just posit as a possible; I think I disabled mine ages ago, but could have been for other reasons.

Simple thing to try out though :man_shrugging: