Premise
I was working on the level1 storage server and decided to move from RedHat to Proxmox 8 (mainly because of the new Kernel and because I wanted to switch back to Docker from Podman).
This system has an LSI disk shelf and the first shelf is comprised of 24 SAS 12TB HGST Helium-filled drives.
I am working on adding a second disk shelf of 18tb SAS Seagate Exos Mach.2 drives as well.
These disk shelves can operate in active/active mode, SAS-6, which is quite fast. Unfortunately, this will still bottleneck my especially zippy Exos Mach.2 drives… so I am working on a solution for that. In the mean time I wanted to setup proper Worldwide Number (WWN) mutipath because this is the correct way to handle disks when you have a true multipath.
Instead of zfs being made from /dev/sda
etc it is made from mpatha, mpathb, etc. This is because through the SAS machinery there are multiple physical paths to each disk – hence multipath.
Multipath is becoming obsolete because multiple path disk storage is becoming less of a thing than it has in the past – when a fault occurs an entire chassis is ejected from the cluster (rather than a single machine having a high level of internal redundancy as used to be the case).
Getting Started
apt install multipath-tools multiipath-tools-boot
It is a good idea to man multipath.conf
– multipath doesn’t do much for you out of the box.
defaults {
polling_interval 2
path_selector "round-robin 0"
path_grouping_policy multibus
uid_attribute ID_SERIAL
rr_min_io 100
failback immediate
no_path_retry queue
user_friendly_names yes
}
This is t he contents of a sensible /etc/multipath.conf
starting point. Note especially that round-robin is not active/active and you want to do active/active if your hardware supports it. I recommend starting with this configuration to ensure stability, then switching to active/active.
Every disk has a gobally unique world-wide number (WWN). That’s what multipath uses to identify disks.
Sample lsblk
output:
sde 8:64 0 10.9T 0 disk
sdf 8:80 0 10.9T 0 disk
sdg 8:96 0 10.9T 0 disk
sdh 8:112 0 10.9T 0 disk
sdi 8:128 0 10.9T 0 disk
sdj 8:144 0 10.9T 0 disk
sdk 8:160 0 10.9T 0 disk
sdl 8:176 0 10.9T 0 disk
sdm 8:192 0 10.9T 0 disk
sdn 8:208 0 10.9T 0 disk
sdo 8:224 0 10.9T 0 disk
sdp 8:240 0 10.9T 0 disk
sdq 65:0 0 10.9T 0 disk
sdr 65:16 0 10.9T 0 disk
sds 65:32 0 10.9T 0 disk
sdt 65:48 0 10.9T 0 disk
sdu 65:64 0 10.9T 0 disk
sdv 65:80 0 10.9T 0 disk
sdw 65:96 0 10.9T 0 disk
sdx 65:112 0 10.9T 0 disk
sdy 65:128 0 10.9T 0 disk
sdz 65:144 0 10.9T 0 disk
sdaa 65:160 0 10.9T 0 disk
sdab 65:176 0 10.9T 0 disk
sdac 65:192 0 10.9T 0 disk
sdad 65:208 0 10.9T 0 disk
sdae 65:224 0 10.9T 0 disk
sdaf 65:240 0 10.9T 0 disk
sdag 66:0 0 10.9T 0 disk
sdah 66:16 0 10.9T 0 disk
sdai 66:32 0 10.9T 0 disk
sdaj 66:48 0 10.9T 0 disk
sdak 66:64 0 10.9T 0 disk
sdal 66:80 0 10.9T 0 disk
sdam 66:96 0 10.9T 0 disk
sdan 66:112 0 10.9T 0 disk
sdao 66:128 0 10.9T 0 disk
sdap 66:144 0 10.9T 0 disk
sdaq 66:160 0 10.9T 0 disk
sdar 66:176 0 10.9T 0 disk
sdas 66:192 0 10.9T 0 disk
sdat 66:208 0 10.9T 0 disk
sdau 66:224 0 10.9T 0 disk
sdav 66:240 0 10.9T 0 disk
sdaw 67:0 0 10.9T 0 disk
sdax 67:16 0 10.9T 0 disk
sday 67:32 0 10.9T 0 disk
sdaz 67:48 0 10.9T 0 disk
This system only has a total of 24 10.9T disks, but there are 48 entries. Two paths per disk = disk shows up in two places.
From here it is easy enough:
# multipath -lv /dev/sde
mpathb (350000c9000389aa8) dm-1 HGST,HUH721212ALE60SA
size=11T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
|- 6:0:0:0 sde 8:64 active undef running
`- 6:0:25:0 sdac 65:192 active undef running
run multipath -lv for each disk yo u want to add to the multipath. This tells us this particular disk shows up at /dev/sde and /dev/sdac. You could also write a shell script to iterate over the first half of entries and add them.
This utility adds the worldwide numbers explicitly to multipath’s config:
cat /etc/multipath/wwids
# Multipath wwids, Version : 1.0
# NOTE: This file is automatically maintained by multipath and multipathd.
# You should not need to edit this file in normal circumstances.
#
# Valid WWIDs:
/350000c9000389aa8/
# ... and so on ...
Checking the contents of the multipath/wwid file should confirm we’ve added all the disks. running lsblk can also help us confirm:
lsblk
... snip ...
sde 8:64 0 10.9T 0 disk
└─mpathb 253:1 0 10.9T 0 mpath
sdf 8:80 0 10.9T 0 disk
└─mpathc 253:2 0 10.9T 0 mpath
sdg 8:96 0 10.9T 0 disk
└─mpathd 253:3 0 10.9T 0 mpath
sdh 8:112 0 10.9T 0 disk
└─mpathe 253:4 0 10.9T 0 mpath
sdi 8:128 0 10.9T 0 disk
└─mpathf 253:5 0 10.9T 0 mpath
sdj 8:144 0 10.9T 0 disk
└─mpathg 253:6 0 10.9T 0 mpath
sdk 8:160 0 10.9T 0 disk
└─mpathh 253:7 0 10.9T 0 mpath
sdl 8:176 0 10.9T 0 disk
└─mpathi 253:8 0 10.9T 0 mpath
sdm 8:192 0 10.9T 0 disk
└─mpathj 253:9 0 10.9T 0 mpath
sdn 8:208 0 10.9T 0 disk
└─mpathk 253:10 0 10.9T 0 mpath
... snip ...
Now the output should show us that there are block devices accessible by their multipath paths. This is now also true in /dev/
ls /dev/mapper/
control mpathb mpathd mpathf mpathh mpathj mpathl mpathn mpathp mpathr mpatht mpathv mpathx mpatha mpathc mpathe mpathg mpathi mpathk mpathm mpatho mpathq mpaths mpathu mpathw mpathw-part9
I recommend also doing systemctl enable multipathd
and systemctl start multipathd
to enable and start the multipath service. From here we can create our zfs storage pool on multipath and carry out initial burn-in and integration tests.
The nicest thing about multipath is that a disk will be accessible even if a cable or controller flakes out on the storage bus, but it is possible to also enjoy a performance benefit. Once you’re satisfied it is stable, it is time to try an active/active configuration.
Proxmox Quirks
Once drives are part of a multipath, they no longer show in the proxmox ZFS/disk/storage gui. Tbh this gui in proxmox has always been kind of half-baked – all but the simplest use cases will require the use of the CLI anyway.
zpool create elbereth -o ashift=12 -f raidz2 mpatha mpathb mpathc mpathd mpathe mpathf mpathg mpathh mpathi mpathj mpathk mpathl raidz2 mpathm mpathn mpatho mpathp mpathq mpathr mpaths mpatht mpathu mpathv mpathx
… and I’m adding some special devices…
zpool add -f elbereth special mirror /dev/nvme2n1 /dev/nvme3n1 mirror /dev/nvme0n1 /dev/nvme1n1
finally, recordsize. The default is 128k, 512k or 1m makes more sense for my default use case. Remember you probably want a smaller record size on other datasets used for things like… VM storage.
zfs set recordsize=512k elbereth