How to ZFS on Dual-Actuator Mach2 drives from Seagate without Worry

wendell · May 14, 2023, 3:45am

Check the videos out here:

Buy Yours Here

SAS3

SATA3

What is script this for?

This script is for users of ZFS and dual actuator hard drives!

Behold – the latest innovation, long overdue, the DUAL ACTUATOR mechanical hard drive.

The Seagae Exos Mach.2 hard drive; the 18TB variant. Dual actuator you say?? What’s that?

These drives are basically perfect for ZFS – double the iops at a cost delta of maybe +$10 per drive? Sign me up! But since each drives presents as two in order to keep the complexity (and therefor cost) as low as possible we must be careful about our ZFS pool geometry – it would be handy if we could quickly determine optimal ZFS pool setups to maximize redundancy and to keep in mind that when a drive dies we could lose up to TWO device entries in our ZFS pool when just ONE physical device dies.

That means spreading the redundancy across multiple raidz1 vdevs (so that if a drive fails catastrophically we are GUARANTEED the failing vdev component drives are not in the same vdev).

It is really easy to do that with this handy script!

More Background

Be aware that these drives are available in both SAS and SATA variants. Since SATA drives don’t support logical unit numbers, it’s just an LBA block range that determines which actuator you get (front half of the drive is one actuator, back half is the other).

For SAS, the drive presents as two logical units.

Even though the SAS version of the drive is not super atypical, basically all hardware raid controllers will not work with dual-actuator drives.

ZFS, on the other hand, is made for these drives. They work together great!

Dual Actuator Drives in the Wild

Understanding The Script

ls /dev/disk/by-id/

You should have output similar to the following:

The # indicate unique drives with the 1 or 0 column indicating the logical unit number (in the case of SAS drives).

The yellow part indicates where the disk device is.

The Script

#!/bin/bash

# Directories
DIR=/dev/disk/by-id

# Array for devices
declare -a DEVICES_00=()
declare -a DEVICES_01=()

# Iterate over each wwn device
# 6000c500d is dual actuator wwn prefix
# last char is 0 to avoid -part devices that 
# may appear here as well 
for FILE in ${DIR}/wwn-0x6000c500d*0; do
    # Extract the last two characters of the wwn
    LAST_TWO_CHARS=$(basename $FILE | cut -c25-26)
    # Check which array to add the device to based two characters of the wwn
    if [ $LAST_TWO_CHARS == "00" ]; then
        DEVICES_00+=("/dev/disk/by-id/$(basename $FILE)")
    elif [ $LAST_TWO_CHARS == "01" ]; then
        DEVICES_01+=("/dev/disk/by-id/$(basename $FILE)")
    fi
done

# Ideally you have at least *3* devices (6 entries) you are using with this script
# base command
echo "zpool create mach2tank -o ashift=12 \ "

# raidz1 vdev made of the top halves of drives
if [ ${#DEVICES_00[@]} -gt 3 ]; then
    echo " raidz1 ${DEVICES_00[@]}   \ "
fi

echo -e "\n"

# raidz1 vdev made of the bottom halves of drives
if [ ${#DEVICES_01[@]} -gt 3 ]; then
    echo " raidz1 ${DEVICES_01[@]}     \ "
fi

echo -e "\n"
echo "if you would rather have one big raidz2 vdev..."

# base command
echo "zpool create mach2tank -o ashift=12 \ "

# raidz2 or 3 is an option for these drives which does NOT require more than one vdev

echo " raidz2 ${DEVICES_00[@]}   \ "
echo " ${DEVICES_01[@]}     \ "

echo -e "\n\n"

echo " For additional redundancy you could use raidz2 with 2 devs, raidz3 with 1 vdev, and so on. "
echo " You can use the output above to make a manual selection about vdevs and device mapping as well."

for my 12 drive setup that creates:

# zpool create mach2tank -o ashift=12 \
>  raidz1 /dev/disk/by-id/wwn-0x6000c500d8e7532b0000000000000000 /dev/disk/by-id/wwn-0x6000c500d8fdffdf0000000000000000 /dev/disk/by-id/wwn-0x6000c500d921f10b0000000000000000 /dev/disk/by-id/wwn-0x6000c500d925b4a30000000000000000 /dev/disk/by-id/wwn-0x6000c500d956405f0000000000000000 /dev/disk/by-id/wwn-0x6000c500d9564dbf0000000000000000 /dev/disk/by-id/wwn-0x6000c500d956ac330000000000000000 /dev/disk/by-id/wwn-0x6000c500d99817cb0000000000000000 /dev/disk/by-id/wwn-0x6000c500d9a507cf0000000000000000 /dev/disk/by-id/wwn-0x6000c500d9a512230000000000000000 /dev/disk/by-id/wwn-0x6000c500d9a8d9b70000000000000000 /dev/disk/by-id/wwn-0x6000c500d9ff3c5b0000000000000000   \
>  raidz1 /dev/disk/by-id/wwn-0x6000c500d8e7532b0001000000000000 /dev/disk/by-id/wwn-0x6000c500d8fdffdf0001000000000000 /dev/disk/by-id/wwn-0x6000c500d921f10b0001000000000000 /dev/disk/by-id/wwn-0x6000c500d925b4a30001000000000000 /dev/disk/by-id/wwn-0x6000c500d956405f0001000000000000 /dev/disk/by-id/wwn-0x6000c500d9564dbf0001000000000000 /dev/disk/by-id/wwn-0x6000c500d956ac330001000000000000 /dev/disk/by-id/wwn-0x6000c500d99817cb0001000000000000 /dev/disk/by-id/wwn-0x6000c500d9a507cf0001000000000000 /dev/disk/by-id/wwn-0x6000c500d9a512230001000000000000 /dev/disk/by-id/wwn-0x6000c500d9a8d9b70001000000000000 /dev/disk/by-id/wwn-0x6000c500d9ff3c5b0001000000000000     \
>
[root@storinator ~]# zpool status
  pool: mach2tank 
 state: ONLINE
config:

        NAME                                        STATE     READ WRITE CKSUM
        mach2tank                                     ONLINE       0     0     0
          raidz1-0                                  ONLINE       0     0     0
            wwn-0x6000c500d8e7532b0000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d8fdffdf0000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d921f10b0000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d925b4a30000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d956405f0000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9564dbf0000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d956ac330000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d99817cb0000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9a507cf0000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9a512230000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9a8d9b70000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9ff3c5b0000000000000000  ONLINE       0     0     0
          raidz1-1                                  ONLINE       0     0     0
            wwn-0x6000c500d8e7532b0001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d8fdffdf0001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d921f10b0001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d925b4a30001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d956405f0001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9564dbf0001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d956ac330001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d99817cb0001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9a507cf0001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9a512230001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9a8d9b70001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9ff3c5b0001000000000000  ONLINE       0     0     0

errors: No known data errors

Perfecto! It’s raidz1, but even though its raidz1, it’s still fully redundant. You don’t have to use the wwn devices, but there is no harm in it.

Let’s pull a drive!

[root@storinator ~]# zpool status
  pool: mach2tank 
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 00:00:01 with 0 errors on Sun May 14 03:42:13 2023
config:

        NAME                                        STATE     READ WRITE CKSUM
        mach2tank                                     DEGRADED     0     0     0
          raidz1-0                                  DEGRADED     0     0     0
            wwn-0x6000c500d8e7532b0000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d8fdffdf0000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d921f10b0000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d925b4a30000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d956405f0000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9564dbf0000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d956ac330000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d99817cb0000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9a507cf0000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9a512230000000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9a8d9b70000000000000000  UNAVAIL      0     0     0
            wwn-0x6000c500d9ff3c5b0000000000000000  ONLINE       0     0     0
          raidz1-1                                  DEGRADED     0     0     0
            wwn-0x6000c500d8e7532b0001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d8fdffdf0001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d921f10b0001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d925b4a30001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d956405f0001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9564dbf0001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d956ac330001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d99817cb0001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9a507cf0001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9a512230001000000000000  ONLINE       0     0     0
            wwn-0x6000c500d9a8d9b70001000000000000  UNAVAIL      0     0     0
            wwn-0x6000c500d9ff3c5b0001000000000000  ONLINE       0     0     0

errors: No known data errors

With just one drive pulled, the pool is still operational, and exactly one component is missing from each vdev. Even though technically we’ve suffered two failures our zfs pool is still fully operational since the failures are guaranteed to be spread across more than one vdev with this setup method.

What about SATA drives?

SATA drives can’t present as two logical units, so you have to use partitions to get the extra speed. See below for the post from John-S

github.com

suykerbuyk/disk-helpers-scripts/blob/d1c58af8fe445a7adb2dd351468aecbe4acab6da/exosx2_zfs_ops.sh#L362


      
          		then
                          	run "zpool labelclear -f ${LUN}-part1 &>/dev/null || true " &
          		fi
          	done
          	wait
          	printf "END: $TGT ${FUNCNAME[0]}\n"
          }
          
          #=======================================================
          # Creates GPT partitions aligned to Osprey LBA splits
          CreateOspreyPartitioning(){
          	printf "RUN: $TGT ${FUNCNAME[0]}\n"
          	ALIGN=$((1024*4))
          	GPT_OVERHEAD=34
          	for LUN in ${TARGET_LUNS[*]}
          	do
          		SERIAL=$(echo ${LUN} | awk -F '_' '{print $2}')
          		LARGEST_END="$(sgdisk -a ${ALIGN} -E ${LUN} | grep -v Creating)"
          		LARGEST_BEG="$(sgdisk -a ${ALIGN} -F ${LUN} | grep -v Creating)"
          		TOTAL_SECTORS=$((GPT_OVERHEAD + LARGEST_END))
          		MID_LBA=$((TOTAL_SECTORS/2))

Performance

The performance of this array is quite good, even with just two raidz1 vdevs:

2147479552 bytes (2.1 GB, 2.0 GiB) copied, 0.967566 s, 2.2 GB/s

for sequential read and write. This would nearly saturate a 25 gigabit ethernet connection. With some performance tuning and giving up a little more space for redundancy, it is possible to achieve nearly 5 gigabytes per second performance from just 12 mechanical drive. Record breaking, I’m sure.

Exard3k · May 14, 2023, 4:04am

Yeah you don’t want to mess this up at pool creation, can be fatal. That script is very helpful in that regard.

More logical devices will make much more optimized vdev configs possible. And using 1x12-wide RAIDZ1 non-Mach vs 2x12-wide RAIDZ1 with Mach drives will be just better performance, fully utilizing the actuators. Doubling the amount of vdevs is probably the way to go.

And with 9TB (vs. 18TB) per logical drive, it will reduce resilver time by a lot. Good news for all RAIDZ resilver fans out there

I’m not so sure about degraded performance in RAIDZ. It certainly isn’t nice, but is it worse with multiple vdevs in degraded state?

I’ve seen news about R&D on x4 Actuator drives, so we may see HDDs breaking the GB barrier in sequential IO. Probably not with SATA3 interface though

Ramp up the compression + lower RAIDZ width or mirrors, then that dual 25G NIC totally runs at 100%

twin_savage · May 14, 2023, 4:32am

It would be complicated to try and setup a pool with an odd number of vdevs wouldn’t it?

also all the adaptec pcie4 gen hardware raid controllers from the past 2ish years are dual actuator aware:

I have too, along with NVME controller-based mechanical drives.

I was surprised to learn that only 1 head on “traditional” hdds are actually doing anything at one time; I had always assumed that all the heads were constantly reading and writing at the same time; with dual actuator drives, 2 heads can read/write data at one time.

Considering that modern high density HDDs have 20 heads on them, it would seem that if hdd manufactures would just allow all the heads to read/write at once with one main voice coil actuatior it could greatly speed up sequential performance.
This would seem to be possible because each head already has it’s own piezo stage to track the tracks.

Exard3k · May 14, 2023, 4:44am

It’s possible, but you’re really asking for trouble because you don’t want to have 2 logical drives in the same vdev. Or make one vdev with standard drives and then two vdevs with dual actuator drives if odd numbers are totally your thing

But not having an odd number of vdevs isn’t anything to worry about for any other metric other than aesthetics.

Seems more like technical limitations to me. Intentionally crippling performance while SATA SSDs and NVMe conquer the world and grab your market share? unlikely.

And sequential performance is all you need for archival, backup or cold storage. Throughput is still abyssmal compared to capacity, but seeing something to close the gap is a good thing. Can’t wait for Toshiba and WD to catch up and keep the competition rolling.

twin_savage · May 14, 2023, 5:29am

I’m just thinking for future hardware topologies, so this isn’t a detraction of the current solution (because its light years better than what we had before): its not ideal to have vdev count tied to lun count, if hdd manufacturers decided to pull the trigger on all-heads-read/write-at-once drives for max sequential performance for a specific niche, assuming they could overcome the control problem and put that many head driving blocks on the pcb, we’d end up with a 20 vdev wide pool. It seems like raid0’ing all the heads/platters together and then doing some fancy lba translation in the firmware so that sequential lba access hits all the platters at once might be a more reasonable future approach.

If the hdd manufactures could pull this off we’d have ~5GB/s from a single 10 platter hdd, at that point the NVME interface really would seem apt for production drives and not just something in the lab.

Yes, 100% it is a technical/cost problem. The firmware to maybe do a little more than strictly sequential read/writes by pushing the piezo stages on the heads harder to reach out to many more adjacent tracks would be very complicated, but very tempting to implement.
I heard a seagate talk where they mentioned (albeit like 6 years ago) that one of the primary reasons that hdds weren’t multichannel in this nature is the cost of the ADC and amplifier channels they had to add; this seems like something that would continue to miniaturize and come down in cost somewhat; although these are analog components so they’ll never scale like digital.
and I’m sure EAMR would add complexity too.

jode · May 15, 2023, 2:51pm

Dual-actuator drives are very interesting, indeed. Almost pulled the plug, but then I looked if backblaze had stats on this model and they don’t.

But they do have this vendor comparison chart:

They argue successfully that choosing a vendor with a magnitude higher failure rate than the industry-best works for their business, but it does not work for me, especially since the currently available “drives in the wild” are refurbished units.

Sadly, this is most likely what I’m going to do.

But dual- (or rather multi-) actuator technology may extend significant marketshare numbers for HDD technology for a couple of years into the future.

John-S · May 15, 2023, 8:10pm

@jode

Number 1, I work for Seagate as a storage architect, primarily focused on Web3 (because it is the most fun and is moving faster than anything else in the industry). Full disclosure:
https://www.linkedin.com/in/john-suykerbuyk/

I have to admit, this year’s report really surprised me, particularly when compared to the previous report that declared the Bathtub Curve was dead:

Where the Seagate drives were demonstrated as having among the highest in terms of reliability.

To focus on this years report there are a couple of interesting takeaways:

First, you’ll see a generational “uptick” in every vendor’s 8 to 12 TB drive failures over time. We did great up till 8TB and then stumbled for about a year, and when I say “we”, I think you can see HGST hit a 4.3% fail rate. WD didn’t participate.

While I love the Backblaze reports and study them every single time they publish it, one of the hazards is that they only report “what was”. Even if you wanted to, you can’t go back in time and buy the best of the best of the generations being reported because technology marches on.

The one thing you can count on is that with our 5-year warranty, I can assure you it catches our undivided attention when a generation or population of drives experiences issues in the field as it costs “US” real money to replace it - a lot more than the profit margin in the sale of a drive.

Regarding Dual Actuator - as others have noted, having two independent devices in one 3.5" drive slot does not eliminate the failure domain of the drive slot. But in actuality, this is not any different than a single actuator drive, is it?

I mean if a disk has 10 platters, whether or not they are divided up across two actuators or serviced by a single disk actuator, the loss of a single surface or head results in the loss of ALL of the data. However, with dual actuator, half of it continues to function as normal.

Ergo, if you leverage your erasure coding to maintain the fail domain of a single drive slot, the robustness of the array actually goes up because you can redistribute half the drive slot data PRIOR to replacing the “whole” disk.

In the simple case, creating vdevs of the “A” actuators and another set of vdevs from the “B” actuators, (LUNs 0 & 1 in SAS), enables you to maintain the fault domain of a single slot.

So in summary, the fault domain for a conventional drive is the drive slot. The fault domain of a dual actuator drive is STILL the drive slot, but the two halves, almost always continue to function relatively independent of each other. Ergo, when one actuator fails, the other is still online, your data is still there.

The only thing you need to do differently, is to structure your vdevs such that the erasure coding respects the fault domain of the single drive slot.

twin_savage · May 15, 2023, 8:42pm

hot take:
WDC ruined HGST’s once sterling reputation.

It was my understanding that both luns of a dual actuator drives should be thought of as one failure domain because most failure modes of the hdd will affect both “halves”. I remember a talk with Muhammad Ahmad saying this and how ZFS has no guard rails to enforce the failure domain concept.

John-S · May 15, 2023, 10:10pm

@twin_savage, I’ll offer you a bit of a thought experiment to contemplate.

As a percentage of all failures, how often does a drive “die” for a massive PCB/component failure that renders it incapable of communicating with the outside world vs how often does a media error result in uncorrectable errors (generally on only one surface)?

Mohammad Ahmad, a very good friend and colleague of mine going all the way back to the days of Maxtor, was 100% correct in asserting that one should treat the drive slot as the fault domain. And even if, only “half” the drive has failed for media errors, you lose both LUNs when you replace the drive to rebuild/replace the failing LUN.

In the case of ZFS, it functionally does not make much of a difference, particularly in the case of RAIDZ. However, in the case of a large JBOD population leveraging stripes of draid’d devices, this is not true.

Consider the following:

It was created by creating a dRAID1 vdev across 12 LUNs, on each of the two actuators with the following command line:
zpool create STXSAS draid1:4d:1s $(ls /dev/disk/by-id/wwn-0x6000c500d98* | grep 0000000000000000) draid1:4d1s $(ls /dev/disk/by-id/wwn-0x6000c500d98* | grep 00010000000000)

In this case, if I have a failure of a media surface that is one of the LUNs, let’s imagine in the vdev labeled ‘draid1:4d:11c:1s-0’, I can offline the corresponding LUN in the other vdev allowing the spare to take its place PRIOR to replacing the entire disk, mitigating my risk of compound failure by 50%.

This may not seem like a big deal, but I’m currently testing and deploying ExosX2 DA drives in our 5u84 JBOD’s with integrated AMD Epyc processors. In the simple case, it is a single unit with 84 drive slots and 168 “LUNs”, typically arranged as 20 vdevs, of 10 LUN ( 0 & 1) RAIDZ2 VDEVs with four SSDs mirrored as the meta/special vdev.

But I can just as easily create a handful of draid2 vdevs, striped together respecting the fail domain of a single disk slot such that no vdev contains both a “LUN0” and a “LUN1” any single drive.

If we move on from ZFS to something like CEPH, you can construct a crush map that leverages the abstract concept of a “shelf” to maintain an erasure coding topology such that the loss of two OSD targets in one disk slot is completely handled with appropriate erasure coding. In fact, if one finds Dual Actuator interesting for ZFS, it absolutely rocks for CEPH and a proper crush map! Remember when we were being told to not use drives bigger than 4TB on CEPH because rebuilding & rebalance would take too long? ExosX2 gives us a way to overcome that while continuing to push storage density boundaries.

Needless to say, dual actuator brings changes to how we guard against eventual failure, but as HDDs are poised to grow to 30 TB per drive slot (and beyond), dual actuator is the only HDD technology that shatters the IOPs/TB barrier.

twin_savage · May 15, 2023, 11:05pm

I see the distinction you are making there as to why it can be beneficial to more granularly take the luns away. Appreciate the insight.

wendell · May 15, 2023, 11:28pm

funfact: this is also why in our zfs build videos I always say to keep a slot or two free and when resilvering to use that slot to ADD a drive before removing the one that has failed, even if the one that has failed appears stone cold dead. This is a best practice with all storage types. you “automatically” have “spare” slots if you have a “hot spare” but its not always true we go to a lot of trouble to explain why hot spares can be soo cool. In addition to being a spare that is right there, ready to go, they’re taking up a spare slot automatically. So when you pull the dead drive you can have a new hot spare that lives there AND still not have compound failure chances

Exard3k · May 15, 2023, 11:50pm

Quantity of LUNs (that are actual real disks inside the same drive slot, not some logical construct sharing ressources) is a quality of its own. Recompiling crush map and finding new sweetspots should be just as manageable as consciously designing and adapting your ZFS vdevs. But you still have to do it. But with this being a SAS-only feature, you can assume people buying these know what they are doing.

ZFS with NAS systems like TrueNAS and UnRaid being available to consumer-level admins may face issues. They probably get the SATA SKU and ZFS and their frontends may implement a care-free option to account for that DA.

And I think we can agree on Ceph being not deployed by an uneducated customer base. I do see the opportunities in ceph and the 9TB vs 18TB argument on an OSD is very compelling (any resilver really). I’m a bit worried about inflating the number of OSDs and the impact on system load and overall performance per TB of deployment. But I’m looking forward to how these things manifest in practice.

Always good when new things rock the boat.

Everybody gangsta until they type zpool replace and realize they don’t have a slot/bay left.

John-S · May 16, 2023, 2:04pm

@Exard3k, you certainly raise a lot of excellent points. I’d like to point out that another option that can hugely simplify deployment and management of dual actuator SAS drives is via LVM LUN aggregation:

github.com

suykerbuyk/disk-helpers-scripts/blob/6f6254a167d25fe97610a5d3b294ceba2c5ce037/mk_lvm.sh#L93


      
          	done
          }
          
          
set_drv_queue() {
          	if [[ -b /dev/${1} ]]
          	then
          		run "echo '${OSD_SCHEDULER}' >/sys/block/${1}/queue/scheduler"       # noop, deadline, cfq
          		run "echo '${OSD_READ_AHEAD}' >/sys/block/${1}/queue/read_ahead_kb" # Default = 4096
          	fi
          }
          mk_mach2_lvm() {
          	msg "Creating Mach2 LVM config"
          	IDX=0
          	LUN_0_TAG=$(find $DSK_PATH -name "${MACH2_DSK_PREFIX}*" | cut -c 1-38 --complement | sort -u | head -1)
          	LUN_1_TAG=$(find $DSK_PATH -name "${MACH2_DSK_PREFIX}*" | cut -c 1-38 --complement | sort -u | tail -1)
          	for DEV in $(find $DSK_PATH -name "${MACH2_DSK_PREFIX}*" | cut -c 1-38 | sort -u ); do
          		LUN_0="${DEV}${LUN_0_TAG}"
          		LUN_1="${DEV}${LUN_1_TAG}"
          		KDEV0="$(ls -lah $LUN_0 | awk -F '/' '{print $7}')"
          		KDEV1="$(ls -lah $LUN_1 | awk -F '/' '{print $7}')"

Via this simple algorithm, we can create striped logical volumes that can be treated and managed by the SDS on top of it, exactly the same way as a conventional single disk might be. The only foresight that needs to be applied is to ensure the LVM stripe side is an even multiple of the SDS logical block size.

In this way, we can ensure an even distribution of IO to both actuators AND ensure that both actuators are kept in relative sync to one another - almost as effectively as the olden days of the never fully realized strategy of RAID2.

When I was building CEPH clusters around the previous incarnation of 18TB ExosX2, the 14TB MachII drive, I found the use of LVM striping to be a huge simplification that rendered as good performance so long as the level of threaded client concurrency didn’t (excessively) overlap CEPH Placement Groups.

Lastly, I’ll also point out that ExosX2 also has a SATA variant. Obviously, SATA does not support proper LUNs (without one of two largely incompatible port multiplier strategies!), so Seagate chose to split the LBA range of the 18TB drive at the mid-LBA point. LBAs from 0 to mid-LBA go to actuator “A”, LBAS from mid-LBA+1 to MAX LBA go to Actuator “B”.

The one and only disadvantage the SATA ExosX2 drive has over the SAS flavor, is that SAS brings a command queue depth of 256 commands whereas SATA only has a max queue depth of 32 commands while both actuators share the same common command queue. Ergo, you must ensure that you never fill/saturate the command queue for one side whose command completions are dependent upon the other side.

Seagate sponsored some enhancements to the BFQ scheduler to ensure the above deadlock does not happen:
BFQ Linux IO Scheduler Optimizations

However, I have never been able to create this kind of deadlock in any real-world application. You literally have to saturate one actuator with a single thread and then block that single thread on command completion from the other actuator.

How I solved the puzzle of split LBA dual actuator is via simple, labeled GPT partitioning that takes into account the miscellaneous overhead of GPT partitions themselves and solves all the sector alignment requirements. The end result is a collection of block device targets that are easy to assemble in /dev/disk/by-partlabel/

For example:

This bash script is a full on TUI that is capable of creating the properly aligned GPT partitions on each actuator, labeling in accordance with a potential zpool name, actuator “A” or “B” and suffixing the label with the last 7 digits of the parent device serial number such that:
STXTEST_A_ZVV01M6P
STXTEST_B_ZVV01M6P

Are the “A” and “B” actuators of the same device:
/dev/disk/by-id/ata-ST18000NM0092-3CX103_ZVV01M6P

All the magic happens here:

github.com

suykerbuyk/disk-helpers-scripts/blob/6f6254a167d25fe97610a5d3b294ceba2c5ce037/exosx2_zfs_ops.sh#L361


      
          		then
                          	run "zpool labelclear -f ${LUN}-part1 &>/dev/null || true " &
          		fi
          	done
          	wait
          	printf "END: $TGT ${FUNCNAME[0]}\n"
          }
          
          
#=======================================================
          # Creates GPT partitions aligned to Osprey LBA splits
          CreateOspreyPartitioning(){
          	printf "RUN: $TGT ${FUNCNAME[0]}\n"
          	ALIGN=$((1024*4))
          	GPT_OVERHEAD=34
          	for LUN in ${TARGET_LUNS[*]}
          	do
          		SERIAL=$(echo ${LUN} | awk -F '_' '{print $2}')
          		LARGEST_END="$(sgdisk -a ${ALIGN} -E ${LUN} | grep -v Creating)"
          		LARGEST_BEG="$(sgdisk -a ${ALIGN} -F ${LUN} | grep -v Creating)"
          		TOTAL_SECTORS=$((GPT_OVERHEAD + LARGEST_END))
          		MID_LBA=$((TOTAL_SECTORS/2))

Via this strategy, I can create a set of striped RAIDZ1 stripes (via that script), that can turn this:

Into a data shuttling monster that can saturate USB3 Gen2 10gb links all day long while still being able to survive the removal and replacement of any single drive. I personally use the 10-drive version in with a set of RAIDZ2 stripes (against the GPT partitions) to render 164TB of ready to go, ZFS file system capacity that is more often than not, used as a target for ZFS send/recv of production data sets.

John “S”

John-S · May 16, 2023, 4:22pm

About two years ago, I wrote an unpublished white paper on my experiments with Dual Actuator drives. I hadn’t yet figured out the striping of LUNs via ZFS, but perfected LVM LUN aggregation. Included are some basic “CEPH” recipes and performance numbers.

For those who are interested, here is a link:

bambinone · May 17, 2023, 4:06pm

Is each “half” of the SAS model dual-ported for multipath? Or does it use the second port for the second half, so there’s only one port per half?

jxdking · May 17, 2023, 4:28pm

I had trauma with piles of Seagate failed drives during my childhood. I would avoid Seagate for the rest of my life. Do we have other options from other brands?

John-S · May 17, 2023, 5:32pm

They are single-ported, with no multipath support.

However, in my experience, most larger clusters where multipath would typically play a role, treat the storage enclosure as the primary failure domain. Active/Active failover is not leveraged as often as it was even 5 years ago except in HPC scenarios.

I know that’s a rather broad, sweeping statement, but I have not deployed a single multipath system in over 18 months out of several hundred PiBs of storage across dozens of data centers. For those scenarios where it used to apply, erasure coding across enclosures in the rack(s) is generally a far more cost-effective option.

Your mileage may vary of course and for some scenarios, it is absolutely necessary.

bambinone · May 17, 2023, 5:59pm

They are single-ported, with no multipath support.

That tracks. Thanks for clarifying.

However, in my experience, most larger clusters where multipath would typically play a role, treat the storage enclosure as the primary failure domain.

Sure, I’m thinking more for SMBs with single racks (or not even that) that still need some highly-available on-prem storage. I maintain a bespoke zfs-ha deployment at my office, for example. Or for folks running e.g. TrueNAS HA in a SuperMicro BigTwin or whatever.

Humbly, it’s not clear to me how treating enclosures as the primary failure domain eliminates the need for dual-ported drives. The two seem to be somewhat orthogonal? My zfs deployments can tolerate an enclosure failure—even with single-ported SATA/NL-SAS drives—simply based on how I’ve allocated drives to enclosures (e.g. the first drive in each raidz1 vdev in enclosure A, the second drive in each raidz1 vdev in enclosure B…). On the other hand, a deployment without dual-ported drives* can’t be highly available, regardless of the number of enclosure failures it can tolerate.

* – ETA: And, I guess, without expensive SAN switches.

Exard3k · May 17, 2023, 10:41pm

For a SAN this is probably correct, but for SDS and with things like Ceph (how we got into this discussion in the first place), these things are no longer required. You can define failure domains in software and only your creativity is the limit.

Ceph is HA by design and runs on commodity hardware. It’s a totally different beast, but we don’t need special hardware anymore to maintain HA. Everything is cheap and expendable

nx2l · May 18, 2023, 1:34am

Why not use multipath to load balance across the dual heads of each drive?