A tale of troubleshooting USB issues on ASUS ROG STRIX Gaming X570E

In August of 2021 I ordered a pretty pricey computer from system76, a Thelio Mira (R1) Ryzen 9 5950 with 128G of ECC memory.

My goal was to use proxmox to run multiple VMs for a home lab and storage server. I purchased a 4 bay OWC Mercury Elite Pro Quad 3.5 SATA enclosure with a USB 3.2 10G interface. I built a 4x 14TB striped mirror (raid 10) giving me about 24TB of storage space. Doing a zfs scrub on the zfs pool would elicit errors and eventually hang the usb bus losing power to it, the mouse and keyboards plugged into the motherboard would go dark.

Errors would look something like this:
[Sun Jan 9 20:04:31 2022] sd 8:0:0:0: [sdd] tag#21 uas_eh_abort_handler 0 uas-tag 4 inflight: CMD IN
[Sun Jan 9 20:04:31 2022] sd 8:0:0:0: [sdd] tag#21 CDB: Read(16) 88 00 00 00 00 00 36 16 ce 28 00 00 02 00 00 00
[Sun Jan 9 20:04:31 2022] sd 6:0:0:0: [sdb] tag#15 uas_eh_abort_handler 0 uas-tag 4 inflight: CMD IN
[Sun Jan 9 20:04:31 2022] sd 6:0:0:0: [sdb] tag#15 CDB: Read(16) 88 00 00 00 00 00 36 16 a4 28 00 00 04 00 00 00
[Sun Jan 9 20:04:31 2022] sd 6:0:0:0: [sdb] tag#14 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD IN
[Sun Jan 9 20:04:31 2022] sd 6:0:0:0: [sdb] tag#14 CDB: Read(16) 88 00 00 00 00 00 36 16 a0 28 00 00 04 00 00 00
[Sun Jan 9 20:04:31 2022] sd 8:0:0:0: [sdd] tag#20 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD IN
[Sun Jan 9 20:04:31 2022] sd 8:0:0:0: [sdd] tag#20 CDB: Read(16) 88 00 00 00 00 00 36 16 ca 28 00 00 04 00 00 00
[Sun Jan 9 20:04:31 2022] sd 6:0:0:0: [sdb] tag#13 uas_eh_abort_handler 0 uas-tag 2 inflight: CMD IN
[Sun Jan 9 20:04:31 2022] sd 6:0:0:0: [sdb] tag#13 CDB: Read(16) 88 00 00 00 00 00 36 16 9c 28 00 00 04 00 00 00
[Sun Jan 9 20:04:31 2022] sd 6:0:0:0: [sdb] tag#12 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD IN
[Sun Jan 9 20:04:31 2022] sd 6:0:0:0: [sdb] tag#12 CDB: Read(16) 88 00 00 00 00 00 36 16 98 28 00 00 04 00 00 00
[Sun Jan 9 20:04:31 2022] sd 6:0:0:0: [sdb] tag#9 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD IN
[Sun Jan 9 20:04:31 2022] sd 6:0:0:0: [sdb] tag#9 CDB: Read(16) 88 00 00 00 00 00 36 16 94 28 00 00 04 00 00 00
[Sun Jan 9 20:04:31 2022] scsi host8: uas_eh_device_reset_handler start
[Sun Jan 9 20:04:31 2022] scsi host6: uas_eh_device_reset_handler start
[Sun Jan 9 20:04:31 2022] usb 6-2.3: reset SuperSpeed Plus Gen 2x1 USB device number 5 using xhci_hcd
[Sun Jan 9 20:04:31 2022] scsi host8: uas_eh_device_reset_handler success
[Sun Jan 9 20:04:31 2022] usb 6-2.1: reset SuperSpeed Plus Gen 2x1 USB device number 3 using xhci_hcd
[Sun Jan 9 20:04:31 2022] scsi host6: uas_eh_device_reset_handler success
[Sun Jan 9 20:05:47 2022] sd 7:0:0:0: [sdc] tag#25 uas_eh_abort_handler 0 uas-tag 4 inflight: CMD IN

I got an advance RMA from OWC thinking that it was a problem with the storage enclosure but the replacement had the same problem. With two of the enclosures, I split the load by putting 2 drives in each and the errors were much harder to generate by doing a ZFS scrub, but they would still occasionally occur. I started doing some googling and there seems to be a littany of USB issues with the X570E chipset. None of them seemed to be exactly my problem, but I figured I would try a PCI USB card as a test. I ordered a PCI3 X4 Inateck PCIe to USB 3.2 Gen 2 Card with 20 Gbps Bandwidth from amazon, and two 1X FebSmart 5-Ports USB 3.0 Superspeed 5Gbps cards.

The problem went away by putting all 4 drives on the 10G card or 2 each on the two 5G USB cards.

A ZFS scrub with 4 SATA drives in 1 OWC enclosure would peg the USB bus at around 950megabytes per second. This would cause the USB hangs/resets about 1-5 minutes into the scrub.

The issue seems to be related to maxing out an individual usb port on the motherboard moreso than the USB controller. The motherboard has 3 USB controllers. The X570E on this motherboard is cooled by an active fan. I do have a loose theory that it might be related to heat or load on the USB portion of the chipset.

As another test, I hooked up 3 SSD drives with USB3-to-SATA cables and plugged them into 3 ports on the same onboard USB controller and the problem was not immediately repeatable. I plugged two SSD drives into another OWC 2 drive dock and the problem was immediately repeatable. (The SSDs were able to peg the USB port at 1GB/second). ZFS scrub is mainly reads.

I had contacted system76 about the issue and they advised they weren’t able to offer support since I was running proxmox. It would be some months before I got around to throwing popOS on another SSD and using that to repeat the tests. With popOS my initial test showed USB timeouts and resets, but the system didnt crash the USB bus and the filesystem was able to recover every few minutes, restart, and then hang again. Subsequently when doing the 3x and 2x SSD tests instead of the 4xSATA tests the system would hang and would requre a reboot.

I had planned on just running my system with the 4x USB3.2 card in the 3rd PCI 4x slot, but I recently got a new Radeon 6600 XT video card that consumes 2.5 PCI slots making one of the 1X slots unusable. My use case was having 3 GPUs in the machine, as well as the USB 3.2 card. I also wanted to pass an entire USB controller to a desktop VM so I could plug in whatever USB devices I needed without having to pass them individually through proxmox/qemu. Since I didn’t trust the onboard USB controllers, I used one of the 5G USB cards to pass to my desktop VM.

As my intent for this system grows, I really needed the onboard USB ports to function reliably. I don’t want to consume 2 of my PCI slots trying to replace their function. I still want to run 3 GPUs and add in a 10G NIC for smb use to other machines.

I hoped that support at system76 would have passed me over to one of their hardware engineers early on to explain the issue and have them replicate it in their lab but that didn’t happen. I’ll be shipping the system to them at my cost it seems to troubleshoot further and hopefully replace the motherboard. I’m hoping a newer rev x570e does not have this problem. I’m dreading the thought of receiving a replacement system with the same problem.

I did update the motherboard bios during my initial troubleshooting efforts.

1 Like

Quick update.

TLDR, there is definitely a platform issue with AMD Starship and Matisse USB controllers on the Zen3 x570, and Zen2 TR40, WRX80 (Threadripper, Threadripper Pro) chips. (sustained peak bandwidth over the controller will crash it or cause it to time out).

I shipped the box to System76 and they were able to reproduce the issue on both AMD platforms. X570 and TR40. System76 was nice enough to offer me a refund since I needed the USB ports for full IO and losing PCI slots to run a working USB 3.2 controller reduced my ability to fill my specific need. (2 gpu passthrough + 10g nic)

I decided to get a Supermicro Superworkstation 5014A-TT Threadripper Pro 3955WX system instead of a Lenovo P620.

The Lenovo ecosystem would probably be good for spare parts in the future, but it has a proprietary motherboard and power supply, plus a few lower bandwidth PCI slots. The Supermicro has 6 full x16 slots.

I plan on replacing the chip with a Threadripper Pro 5000 some time in the next few years. I’m excited that it should be a drop in replacement.

Supermicro sells the system without a GPU, but the bmc controller has a passthrough usable gpu for running a vga monitor. It also comes with an onboard 10g nic, and plenty of slots to run lots of GPUs. I’ve got a 2.5 height 6600XT consuming the bottom 2 PCI slots, two other single height GPUs, the USB card and I’ve got room for 1 more. I plan on

I’ll post a separate review of the Supermicro in a few days, it has both the Matisse and Starship controllers onboard, but it also has a single ASMedia ASM3242 type C port which does not exhibit the crashes under load.