How to ZFS on Dual-Actuator Mach2 drives from Seagate without Worry

bambinone · August 15, 2023, 3:13am

I got my new 2x18 SAS drives in and they’re working great. However my kernel log is getting spammed with the following:

[84302.447898] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)                                                                                                                                          
[84312.462949] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)                                                                                                                                          
[84379.029583] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)                                                                                                                                          
[84471.253048] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)                                                                                                                                          
[84471.256257] mpt3sas_cm0: log_info(0x31120436): originator(PL), code(0x12), sub_code(0x0436)                                                                                                                                          
[84507.029904] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)                                                                                                                                          
[84507.187357] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)                                                                                                                                          
[84563.388526] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[84563.559160] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[84609.426064] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[84711.831236] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)

Anybody else? SAS HBA is a bog standard SAS3008.

John-S · August 17, 2023, 2:05pm

@bambinone I know exactly what this is.

The ZED daemon is responsible for handling, at user mode, all zpool events. Things like adding/removing devices or device errors.

On the 3008-based LSI HBA chipsets (9300, 9400 series HBA’s), the mpt3sas driver generates an info message every time a checksum data transfer event happens. With passive copper cables, this happens a lot and under STP transports of ATA over SAS, it can happen very frequently (but that’s not what is happening here.

To confirm my conjectures above, you can open up the mpt3sas.h file and see the debug flags needed to get a bit more verbosity out of the driver. I have assembled the bit mask for you here:

echo 0x0006020A >/sys/module/mpt3sas/parameters/logging_level

You’ll find that the above verbosity level shows that the 3008 chipset is experiencing occasional checksum errors on CDB Write 16 as a result of an improper link equalization handshake on initialization. Reset the link and you’ll get a different rate of occurrence. It is a known problem with passive copper cabling. At scale and for Enterprise, I tended to go with optical whenever possible as it completely eliminates the problem with the HBA link equalization.

While I was still with Seagate, I tried to work with Broadcom’s firmware engineers out of their Boulder Colorado office, but they refuse to fix this legacy LSI problem. In their mind this is chipset is obsolete and should be replaced with the new TriMode/NVMe capable HBAs.

The fix…

This is not an error. It is an informational message from the mpt3sas driver indicating that it had to do a partial retransmission of the SRB payload of the SCSI command.

However, the ZED daemon on Linux was never finished. It can’t tell the difference between an error, a warning, or an info-level message. Use the storcli command line tool to read the firmware revisions on the card and it will generate a stream of events that the ZFS ZED daemon will interpret as errors and offline a device!

The fix was to fix the ZED daemon. To that end, I sponsored 9 patch sets for ZFS via Allen Jude at iXsystems. Most of them landed in OpenZFS 2.1, while the rest are landing with OpenZFS 2.2.

This resolves most of the issues, but also gives you the ability to control how frequent info/warn messages need to be before ZFS considers a device eviction. This completely resolved issues I was having with my old shoddy SAS JBODs hosting SATA drives.

The patch sets I sponsored are attached.

I would recommend upgraded to OpenZFS 2.2 as soon as possible. If you want some pre-built packages, I can help. I have them in production for the latest Arch, Debian, and Ubuntu 22.04.

Seagate.Sponsored.Zfs.Changes.txt (7.7 KB)

wendell · August 17, 2023, 3:22pm

wow, paging @aBav.Normie-Pleb this gives me an idea for another/related problem we were experiencing with broadcom HBAs. I’m going to try a redriver WITH the hba. I bet that solves the issue I was having. It it does, holy crap. Turns out a related problem is still happening on non-legacy chipsets…

aBav.Normie-Pleb · August 17, 2023, 3:32pm

@Wendell

Thanks for still thinking of me

Yet, this shouldn’t have anything to do with the issues I’ve encountered since they also happen if the HBAs are just seated in a system, EVEN IF NO CABLES AT ALL ARE CONNECTED TO THEM.

Extra fun starts when you seat multiple different models, for example a Tri-Mode HBA like the 9500-16i and the pure PCIe Switch HBA P411W-32P. The drivers seem to “overlap” and fight each other and Windows is f’d.

Broadcom seems to be better at suppressing and masking issues than LMG

Would be great to see a properly working PCIe Switch HBA before I die…

John-S · August 18, 2023, 10:51am

Can you DM me or point me to the thread that describes your challenges in more detail?

I’ve a Web3/DeStor project that leverages vGPU pipelines in conjunction with 12+ NVMe storage endpoints to achieve 10+ million 4K IOPs. I literally spent a week of evenings trying to debug a set of four-way bifurcation cards and have given up completely on that strategy - it’s a silly proposition to believe you can lift up and stretch out that many lanes from the PCI bus and fan them out across that many cards & cables!

Anyway, I’ve refocused my attention on the Highpoint 1580 class of cards which leverage the PEX 88048 PCI switch chipsets for both M.2 and U.2/U.3 targets.

I’m learning a lot, but would like to focus on resolving your specific challenges, perhaps in a separate thread.

@wendell, this is ultimately what I am building, with a bit of luck, on an XCP-NG hypervisor with VFIO passthrough of the GPUs, feeding 16+ NVMe IO targets to solve the Merkle tree crypto proofs for Web3 Distributed Storage networks. The warm backing storage is dual actuator drives in a 45 Drives Storinator while everything else lives on a WRX threadripper mainboard.

https://web3mine.io/v0.1/docs/compute_provider/Filecoin%20Sealing/SupraSeal%20implementation/Introduction

Your timing of banging the gong for VFIO support could not have been more timely… Thank you!

aBav.Normie-Pleb · August 19, 2023, 2:48am

@John-S

Thank you for your offer of assistance!

Here is a short summary of my issues with the Broadcom P411W-32P Active PCIe Switch HBA, basically the same as the Highpoint 1580 (not PCIe Bifurcation), Broadcom Tri-Mode HBAs like the 9400-8i8e or 9500-16i cause the same issues:

I just want to use it with Windows on a system that goes to S3 Sleep regularly (officially supported by Broadcom, with eight U.2 NVMe SSDs directly connected to it):

When waking a system from S3 Sleep it always crashes with a BSoD.
You have to use specific driver/firmware versions, otherwise it causes a different BSoD during boot.
Dispite selling officially compatible cables for the use with directly attached SSDs Broadcom says they do no longer want to support this dispite being in the middle of this model’s life cycle. So they don’t want to take on my case with directly attached SSDs anymore.
With the latest firmware they killed support for directly attached SSDs, you have to get a specific backplane to be able to use it.
Other users can confirm the issues I’ve discovered, Broadcom doesn’t care to fix them.
Broadcom’s entire behavior is basically spitting into the customer’s face

I have a larger thread to collect my PCIe experiments, the experience with the P411W-32P is starting around post #66:

bambinone · August 22, 2023, 6:30pm

This is really great insight, thank you.

I am actually already running OpenZFS 2.2 (rc3 I think) so that must be why zed seems to be working fine even though mpt3sas is throwing these logs.

Yes, I’m using the SAS version of the Mach.2, four drives/eight LUNs total. However, I’m using two of these trayless backplanes—which only accept 7-pin SATA-style data—with the appropriate SFF-8643 breakout cables. I’m actually surprised it works at all with the dual-actuator drives, but I bet this is the source of the checksum data transfer events.

Interestingly, when I applied this mask the errors stopped. I’ll have to dig into the bits to figure out what’s going on there.

Technoob · August 27, 2023, 3:51pm

Total noob here!

Is there script in the wild to help create the partitions for SATA drives on truenas scale. I tried using John’s script but truenas doesn’t have all the binary dependencies.
I have 3 2x18 drives

Thanks

wwed26 · November 7, 2023, 12:46pm

This is what I needed to know and see, thank you

wwed26 · November 8, 2023, 5:56pm

Damn it, this is the price point I wanted but it was SAS…but I dont think they sell the 2x14 in sata

jode · November 8, 2023, 6:23pm

They’re SAS drives

wwed26 · November 8, 2023, 6:29pm

Yea I flipped Sata and Sas thanks for pointing that out

Exard3k · November 8, 2023, 6:32pm

They’re retailing for double the price of regular drives…at least where I’m living. I’d rather get two drives tbh

Maybe this will change in the future…we’ll see.

jode · November 8, 2023, 6:37pm

Yeah, I was lucky to pick them up below regular HDD prices. Probably due to SAS and relatively unknown new tech.
IMHO, even in the niche use case of ZFS work horse they have not quite lived up to my possibly unreasonable high expectations.
What I mean is: good tech if it fits your use case, but I wouldn’t pay (much) over proven tech. There is a right price for every product.

jftuga · November 10, 2023, 12:03am

I would consider starting the bash script with:

set -euo pipefail

I was wondering about a just for fun experiment. What if you partitioned the drive into 4 equal sizes such that 2 of the partitions would run from different actuators but also be very close to the center of the drive? Is it the case that I/O is faster towards the center? It seems like it would be because the heads don’t have to travel as far. If you then RAID 0’s those together, could you achieve faster throughput? If so, you could then have one “fast” partition and then one “slow” one.

Thoughts?

Exard3k · November 10, 2023, 12:16am

seek times (latency) are the same. Throughput closer to the center is about half, so it gets slower the more data you put on it.

But there are also less bits to read or write per revolution. Unless you can spin the disk faster in the inner tracks (old CD-ROM drives used to do this to compensate).

A RAID0 with multiple partitions of the same drive will probably severely tank throughput. There isn’t much sequential in switching between two major offsets all the time, although firmware probably compensates to some degree for this mistake.

not 100% sure on that.

jftuga · November 10, 2023, 12:52am

Thanks for your insight. Would RAID 0 perf tank if it was split between platters/actuators?

Toddimus · November 11, 2023, 6:27pm

Thanks for the new video @wendell! It’s quite timely for me. I’m setting up a new storage/Proxmox server using a bunch of the 2x18 SATA drives. I have 6 of the SATA drives and have been experimenting with different ways to configure them and which OS/GUI to use with them. The plan is to run Proxmox 8 on the bare metal and then pass the drives through (via an Adaptec HBA) to a VM that’s running on Proxmox. I currently have another machine that’s doing just this, albeit with TrueNAS core. It seems to work pretty well.

So these new dual actuator drives bring with them awesome performance potential but also some quirks that make it a bit more tricky to implement in TrueNAS and/or Houston. Neither of these “GUIs” will allow me to split the drives at the halfway point as is required to use them properly. At least I haven’t been able to find a way to make it work via their GUIs. What I ended up doing was using the script from @John-S to name the drive halves and then manually create a pool using the Linux command line zfs/zpool commands. This seems to work well and allows me to import the pool into either TrueNAS Scale or Houston.

The issue with this method is that neither TrueNAS or Houston seems to understand what’s going on with the pool.

It seems to want to refer to the Vdev constituents as /dev/sdX, rather than the partitions that actually comprise the Vdevs. Also, the sizes shown are incorrect and each drive is shown twice. This also leads to weird readings for temperature. It does read the drive temperatures but seemingly sees half of them at zero degrees.

I can live with the weird temperature readings and constituent drive space size errors. I’m just wondering what the best OS/software infrastructure base for these drives would be (here in November 2023).

I looked into the kernels used for TrueNAS Scale and Houston and both of them are understandably using STABLE releases of the kernel/Linux distro. TrueNAS uses kernel 6.1 and is unlikely to be updated any time soon. Houston is to be run on Ubuntu 20.04LTS, if I’m reading things correctly.

Looks like from Linux kernel 6.3 and onward, there are optimizations implemented.

Updating the kernel underneath TrueNAS scale seems to be frowned upon. Not sure how Houston would deal with a newer kernel/OS version.

Then there’s the drive topology… with 6 of the 2x18 drives, I’m left with a bunch of options. For now, I have it set up as 4 mirror Vdevs with an offset in A/B pairs to provide more resilience against a whole drive failure:

rust pool

Mirror1
**Drive 1A
**Drive 2B
Mirror2
**Drive 2A
**Drive 3B
Mirror3
**Drive 3A
**Drive 4B
Mirror4
**Drive 4A
**Drive 1B
Hot Spares
**Drive 5A
**Drive 5B

My concern is how ZFS will deal with a drive failure. If a whole physical drive fails, then my guess is that it will replace both partitions (i.e. Drive XA and Drive XB) with both partitions of the hot spare Drive 5A and Drive 5B. But what happens if one actuator fails? My assumption is Drive XA is replaced by Drive 5A. Does this leave me with my proverbial pants down if another drive fails?

And what about expanding the pool? If I were to add a single, dual actuator drive, they would both end up in the same mirror Vdev, thus not following the offset scheme to preserve some redundancy. If that new drive were to fail, it would take down the whole pool. Seems like it would require adding two more physical drives and making mirrors of opposite drive actuators (i.e. A1/B2 and A2/B1 mirrors). I suppose this isn’t really that much different than with conventional drives. To expand, you just add another mirrored Vdev.

Another possible scenario would be a mirror of stripes, as opposed to the stripe of mirrors as described above. Something like this:

rust pool

Mirror1
** Stripe 1
*** A1, B1, A2, B2
** Stripe 2
*** A3, B3, A4, B4

Not even sure this is possible but I presume it is. BUT, it would seem to break the golden rule described in posts above of never having A and B actuators of the same drives in the same Vdev. Would a scenario like this be susceptible to the issue with queues? I’m showing my noobness here. I just don’t totally understand how that would work at a kernel level. Regarding hot spares in this scenario, it would seem that it’s more straightforward but again, I’m not sure how ZFS would deal with it.

But what about expansion in this scenario? Could a single physical drive be added? I don’t think so because if that new drive failed, it would take down both sides of the mirror.

A1, B1, A2, B2, A5
A3, B3, A4, B4, B5

So it would need a new pair of physical drives to be added:

A1, B1, A2, B2, A5, B5
A3, B3, A4, B4, A6, B6

@wendell Thanks again for the video and inspiration. You mentioned in the video at the end for us viewers to ask what we’d like to see regarding setups with these dual actuator SATA drives. I have a feeling there’s going to be more folks lurking around trying to figure out how to navigate all of this. So my suggestion for video and/or how-to content would be to address some of the intricacies I’ve raised above.

So what’s a geek to do? I don’t mind being on the bleeding edge, as it’s just a homelab and the data will be backed up using the 1-2-3 principle.

–EDIT to deal with adding drives in the second scenario of mirrored stripes.

xin_hua_wang · November 18, 2023, 2:44am

what is the smartctl report for the sas drives? have the relocated sector count etc? thx

Trooper_ish · November 18, 2023, 11:12am

If you want to add a vdev/mirror, then partition the drive, and replace an existing half-mirror first, which would free up that half of a drive, to combine with the leftovers new partition.

So:
M1 1A + 2B
M2 2A + 3B <–replace with 5B
M3 3A + 4B
M4 4A + 1B
New M5 5A + the freed up 3B

I think I personally would go with two raidz2 with these drives, in case just half one drive dies, but another whole drive dies during Resilver. But, just hypothetical. And I don’t have any drives larger than 4tb

And you would not be able to add to it