How to ZFS on Dual-Actuator Mach2 drives from Seagate without Worry

wendell · May 18, 2023, 3:01am

each independent head assembly has its own set of patters, so multipath doesn’t really do you much good. LUNs is the way to go. And almost nothing “new” today is multipath (or has been for 5 years. When I ask SAN vendors about it, they laugh at me. But I can find out who the folks are that are wet behind the ears at least.)

twin_savage · May 18, 2023, 3:13am

Dual path connections are becoming a rarity in hardware outside of SAS expanders. It would not be ideal to lose half the capacity of your drive if running without a SAS expander in front of it, so compatibility is likely the primary reason.
At any rate sas-3 is already 1.2GB/s one direction which is more than enough bandwidth for the hdd even with 2 active heads.

SAS is actually faster than people realize, current generation SAS is faster than pcie 4.0 per phy lane, and x4 SAS links (multilink SAS) is a thing and is more reliable over spans than pcie at the same time due to using over two times the signaling voltage

bambinone · May 18, 2023, 4:49am

Ideally you’d want two paths to each half for controller/host redundancy. There wouldn’t be much benefit to what you propose since you’re already reaping the performance benefits of having both actuators going regardless of any multipath/load-balancing.

I still see a lot of brand new dual-expander backplanes and JBODs on the market so I don’t really understand what you’re trying to say here. Anyway, peasants like me are out here slumming it with gear much older than that. Dual-ported drives, cabling rings, SAS-2, SATA interposers, zfs-ha, etc. are still very much real things down here in the muck.

Unfortunately because each half only has one path, I won’t be able to drop these into an existing zfs-ha deployment for a nice speed boost.

nx2l · May 18, 2023, 9:31pm

oh… i thought the point of dual heads was either one dedicated to writes and one for reads or double write/read heads on same platters so you could double iops that way

Trooper_ish · May 18, 2023, 9:42pm

That’s what I had hoped for too.

Welp, I guess not…

wendell · May 18, 2023, 10:15pm

it still does double the performance though iops, throughput, you name it.

twin_savage · May 18, 2023, 10:20pm

iops are doubled because each of the two arm groups in the hdd can have a head read, write and seek independently of each other. It doesn’t hurt anything to have the two active heads on different platters.

Having two heads per platter would only add to complexity without any benefits… unless all heads were already active on a given future hypothetical hard drive and more bandwidth/iops were still required; but at that point that’s something like 11GiB/s of sequential bandwidth off of one 3.5" hdd with 11 platters, 44 heads (two for each side of a platter), each head transferring about 250 MiB/s as is common now.

PhaseLockedLoop · May 18, 2023, 11:17pm

You know id love a video on just the technology that these guys are doing. When people tell me spinning rust has no innovation left i have to respectfully disagree. There is still a lot stuff going on in that space

Its even cooler that zfs has a config to handle this without actually planning for it from the upstart

Whizdumb · May 18, 2023, 11:42pm

Unpopular opinion time…

I would worry that if this technology wasn’t very well adopted, your options for obtaining replacement drives when 1 fails down the road will lead to some very interesting issues when it comes time to save your array. If the availability of these drives isn’t there, what are you going to do? You’d probably have to buy 2 drives of approximately half the size of the combined dual actuator drive to replace them(or waste money on large drives you can’t fully utilize). This also means needing more bays than you might have.

I’m a little uneasy about some outlier scenarios too, like, what happens if a single actuator on the drive fails (More moving parts) then you have to intentionally degrade a second VDEV to replace it? And the thought of 1 drive being split and failing across multiple VDEVs scares me personally. I hate to be a naysayer, but I think that this technology is coming a little late to the game. When you consider the performance of modern flash based storage and the ever reducing cost of it, putting data at more risk than spinning rust already affords, just to double IOPS, is too “worrying” for me to adopt.

I’m open to being proven wrong or even persuaded to try it but it would take a lot to put my mind at ease with this technology.

John-S · May 19, 2023, 12:36am

@Whizdumb you offer some legit apprehensions. I will offer the notion that these are Enterprise drives which means they are fully covered by standard Enterprise-class warranties for up to 5 years.

This has the implication that Seagate has to maintain inventory and stock for at least 5 years, and since these drives are being deployed “elsewhere” in the industry by very high volume customers, I’m pretty sure replacements will be available for a good long while.

With regards to more moving parts = more failures, the only additional part is an extra voice coil actuator (the thing that sweeps the head disk assembly across the media surface).

Most HDD failures are the product of failing heads and/or media. In that sense, they really aren’t any different than conventional, single actuator drives with the exception of being able to “walk and chew gum” at the same time - I.E. do twice as many seeks per second, render twice as much data per second (YMMV depending on the SDS you put on top of them).

I will add that a conventional 3.5" drive, under full load, draws about 12 watts of power, but the dual actuator drives draw closer to 14 watts (when being driven hard on both actuators). If we’re talking 50+ drives, that 2-watt delta per disk adds up in both the context of thermals and power demand. At idle power consumption is essentially the same.

John “S”

nx2l · May 19, 2023, 12:42am

doesnt feel like it … feels more like they put two small drives in same form factor

John-S · May 19, 2023, 12:44am

@nx2l, can you offer us a description of your test environment? Maybe we can help!

Exard3k · May 19, 2023, 12:55am

Does it matter? If I can get double the drives for a given form factor, comparable power and same raw capacity, this isn’t a negative unless price/TB makes this uneconomical. It’s not like there’s a better alternative for smaller form factors atm.

It’s like using namespaces on NAND and ~doubling the performance by doing so. I’d take it.

Whizdumb · May 19, 2023, 1:57am

@John-S I certainly appreciate all of the potential benefits of this dual actuator technology. And, I didn’t even consider the performance improvement per watt you pointed out. Certainly in a scenario where performance is preferred over resilience, this tech is fantastic. But most of the arguments that I made specifically regarding disk topology and VDEV integrity are still concerning to me specifically when it comes to adoption in the ZFS space. Time will tell if these 2 play nicely together. I just hope that this isn’t the new ECC argument when it comes to ZFS.

Either way, I do want to thank you for shedding additional light on the capabilities of the tech. It’s much appreciated.

jode · May 19, 2023, 3:27am

I don’t think you realize that these drives are not available to anyone on this channel new. As a result there is no access to a 5-year manufacturer warranty.
I looked around and could only find refurbished drives with 2 year seller warranty. Not bad, but not the same. A warranty claim would be filled with another drive from the same stock.

Well, close. One drive, one controller board, same number of heads and platters, but two actuators that can move independently and enable logically two independent drives. There are many ways to spin this architectural improvement.

I like the idea a lot. I have a lot of drives for the sole purpose of improving thoughput.
These drives (at the current refurbished cost that’s competitive to conventional drives) finally offer to saturate the bandwidth of the SAS2008 or SAS2308 controllers that many of us are rocking here.

The often cited ‘performance improvement per watt’, while real, to me feels too much like a marketing argument for an architecture of a hdd where half of the heads can move independent from the other half.

rrubberr · May 19, 2023, 3:53am

I saw the same listing, but the fact that these are 2x14 generation helium drives and have been sitting around for in excess of a year dissuaded me. That plus the headache of fitting real 29 pin SAS into my very OEM HP workstation…

In any case, once old stock 2x18 generation SATA drives come onto the refurbished market I’ll be excited to buy a couple! When you’ve only got two 3.5 inch bays and no room for exotic RAID setups the jump from 250 to 550 mb/s in this form factor is very appealing

bambinone · May 19, 2023, 3:53am

Some really great points. I agree that it seems a little silly when ~8TB SAS and SATA SSDs are almost within reach of the average homelabber. But these are still much cheaper per TB (refurbished).

I’d like to get some of these for my home fileserver (don’t need dual-ported drives at home at least) and I’ve also thought about potential failure modes. I’m planning to mirror both halves of each pair of drives, like this:

A0 - B0
A1 - B1
C0 - D0
C1 - D1

…which means that if a drive fails and I can’t find a direct replacement, I’ll have to replace one 7+7TB drive with two 8TB drives, or two 7+7TB drives with four larger drives (to avoid an excessive waste of space). That’s not very appealing.

I’ve also thought about putting in an odd number of dual-actuator drives and staggering the mirrors, something like this:

A0 - B1
B0 - C1
C0 - A1

That would leave a free slot in my server for a cold or hot spare.

wendell · May 19, 2023, 4:01am

Things are about to enter the retail channel in a big way. These seem to be second gen drives…the 14tb drives were more “big enterprise” while the new Exos2 have all the lessons learned.

I’m pretty impressed with how much polish im seeing in their stack tbh, with the exos2 drives I hsve. There is often an Icarus factor where a company tries to overdo it with new tech and ends up shooting themselves in the foot (see also: Intel accelerators, maybe). However I am not seeing that here.

I love keep it simple, stupid. And they have, which is good for all of us.

johnpyp · June 22, 2023, 6:22am

Using 14TB 2X14 dual actuator drives, along with the setup described by @John-S of using LVM to create a striped volume per drive, I’ve gotten some great performance in a raidz2 pool.

Nearly maxing out the capabilities of the HBA (2.2GiB/s write throughput on raidz2 and raid0 each) - individual LVM volumes getting around 430 MiB/s during benchmarks consistently.

They’re also extremely cheap - SEAGATE ST14000NM0001 14TB 2x7TB PARTITION 3.5in 12GBPS SAS HDD | eBay, $120 per (refurbished/used) drive.

Honestly I think I’m just going to keep buying these only, and then start buying the 18TiB ones when they hit the used market

I’ve also made a small script for generating the simple LVM commands for each of the matching pair LUNs - GitHub - johnpyp/mk-mach2-lvm

Log · June 22, 2023, 7:00am

Q: How does an Exos ® 2X SATA configuration differ from a SAS configuration?
A: For the SAS configuration, each actuator is assigned to a logical unit number (LUN 0
and LUN 1). For example, one 18TB SAS drive will present itself to the operating system
as two 9TB devices that the operating system can address independently, as it would
with any other HDD.
The Exos ® 2X SATA configuration will present itself to the operating system as one logical
device since SATA does not support the concept of LUNs. The user must be aware that
the first 50% of the logical block addresses (LBAs) on the device correspond to one
actuator and the second 50% of the LBAs correspond to the other actuator. With both
configurations, the user must send commands to both actuators concurrently to see the
expected performance benefits.

Q: How do you identify which LBAs correspond to each actuator on an Exos® 2X
SATA device?
A: Seagate® has worked with the T13 ATA committee to propose and implement a new
log page for SATA—the Concurrent Positioning Ranges log page 47h identifies the number
of LBA ranges (in this case, actuators) within a device. For each LBA range the log page
specifies the lowest LBA and the number of LBAs. As a reminder, since LBA numbering
starts at zero, the last LBA of either range will be the lowest LBA + the number of LBAs – 1.

Q: How do you identify these LBA ranges in Linux?
A: In Linux Kernel 5.19, the independent ranges can be found in /sys/block//
queue/independent_access_range. There is one sub-directory per actuator, starting with
the primary at “0.” The “nr_sectors” field reports how many sectors are managed by this
actuator, and “sector” is the offset for the first sector. Sectors then run contiguously to the
start of the next actuator’s range

Interesting, so according to the above it looks like with a reasonably recent enough linux kernel, even the SATA versions of multi-actuator drives shouldn’t be too difficult to deal with, with a script.