Setting up the Enmotus FuzeDrive on Linux, with Caching

I did a video on the Enmotus FuzeDrive – an SSD with a dedicated 128gb SLC tier, plus the rest as QLC flash. It’s a 1.65tb drive, overall.

When I did the video, a lot of people asked about using this on Linux because it takes a special driver to manage moving data in and out of the tiers safely.

On Linux, however, it is possible for one to use either lvm or bcache with this drive. It can be configured as tiering (aka sir not appearing in this how to) or as a write-back cache.

What is the difference between tiering and a long-lived write-back cache?

The terminology that is used in the Linux kernel modules and how-to documentation is subtly different from what most industry folks say.

With tiering, you always have all the combined space of all levels of the tier. With caches, in general, the cache represents a subset of data elsewhere.

Write-back caching this is a term used when the data is written to the cache first. When the write to cache is completed, the overall write is marked as completed even though technically, at some point, the data written to cache should be flushed and finally written out to whatever the cache is caching.

Write-through caching This typically just means the data is written directly to the underlying device that is being cached, which is typically much slower. The operation is marked as complete only when the cached device signals it has completed the write. Typically the data is retained in the faster level of cache as well.

There is also write-around caching and this is left as an exercise to the reader.

The FuzeDrive SSD shows up as one big block device, on my system as /dev/nvme0n1 just like a regular NVMe device. However, the SLC region is fixed in the following LBA regions on my units*

1.6TB FuzeDrive SSD (~128GB SLC)
SLC region: LBA 0 to 268697599
QLC region: LBA 268697600 to 3027952367

900GB FuzeDrive (~24 GB SLC):
SLC region: LBA 0 to 50380799
QLC region: LBA 50380800 to 1758174767

lsblk
 lsblk
NAME                MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0                 7:0    0  29.9M  1 loop /snap/snapd/8542
loop1                 7:1    0    55M  1 loop /snap/core18/1880
loop2                 7:2    0 255.6M  1 loop /snap/gnome-3-34-1804/36
loop3                 7:3    0  62.1M  1 loop /snap/gtk-common-themes/1506
loop4                 7:4    0  49.8M  1 loop /snap/snap-store/467
sda                   8:0    0 465.8G  0 disk 
├─sda1                8:1    0   512M  0 part /boot/efi
└─sda2                8:2    0 465.3G  0 part 
  ├─vgubuntu-root   253:0    0 464.3G  0 lvm  /
  └─vgubuntu-swap_1 253:1    0   980M  0 lvm  [SWAP]
nvme0n1             259:0    0   1.4T  0 disk 
├─nvme0n1p1         259:1    0 128.1G  0 part 
└─nvme0n1p2         259:2    0   1.3T  0 part 
sudo  gdisk /dev/nvme0n1 

… setup the partitions, and for LVM, change the type to 8E00…

Command (? for help): n
Partition number (2-128, default 2): 
First sector (34-3027952334, default = 268697600) or {+-}size{KMGTP}:           
Last sector (268697600-3027952334, default = 3027952334) or {+-}size{KMGTP}: 3027953267
Last sector (268697600-3027952334, default = 3027952334) or {+-}size{KMGTP}: 
Current type is 8300 (Linux filesystem)
Hex code or GUID (L to show codes, Enter = 8300): L
Type search string, or <Enter> to show all codes: LVM
8e00 Linux LVM                           
Hex code or GUID (L to show codes, Enter = 8300): 8e00
Changed type of partition to 'Linux LVM'

Command (? for help): p
Disk /dev/nvme0n1: 3027952368 sectors, 1.4 TiB
Model: PCIe SSD                                
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): F31CE44D-FFD5-4E2B-BB42-1980E991D220
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 3027952334
Partitions will be aligned on 128-sector boundaries
Total free space is 94 sectors (47.0 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1             128       268697599   128.1 GiB   8E00  Linux LVM
   2       268697600      3027952334   1.3 TiB     8E00  Linux LVM

Command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!

**Note: I just let gdisk use the defaults for the last sectors on the 2nd partition instead of keying that in because I’ve done terrible things to my drive. Your numbers might not be precisely the same. **

Using this information, you can partition your NVMe into SLC and QLC regions.

So after partitioning /dev/nvme0n1p1 is 128gb SLC and /dev/nvme0n1p2 is the rest of the drive (QLC).

*TODO: Make sure retail unit LBA regions match.

Now in terms of how to leverage these two partitions most effectively – the jury is still out for me on that [lvm and bcache are in some kind of grudgematch with the ST:TOS battle music playing). It looks to me like dm-writecache , rather than dm-cache , is going to be the winning formula. A lot of the caching stuff on Linux, so far, has really been most tested mixing SSDs and HDDs. But here we’re mixing SSDs only, an d the reads from SSDs (whether QLC or SLC) are generally pretty good, though SLC is a bit better.

There is also bcache. LVM uses dm-(write)cache under the hood, but bcache is something else entirely.

It is possible, of course, to use other drives or just use the 128gb SLC with something that requires a high endurance – like a SLOG for example on a ZFS Pool – but that’s up to you.

Configuring lvmcache

This will be a pretty vanilla LVM setup using dm-cache. This is technically caching, not tiering, meaning the total usable capacity of this implementation is around 1.5tb – a bit less than you’d get on the truly tiering solution on windows. *TODO: See what happened to LVMTS, as 5 years ago I did this as tiering with something from github… *

# All these commands require either a root shell or sudo prefix...
#
# Mark partitions as usable for LVM
pvcreate /dev/nvme0n1p1 
pvcreate /dev/nvme0n1p2 

# Add to Group
vgcreate VG /dev/nvme0n1p1 /dev/nvme0n1p2

# Make a Logical volume
lvcreate -n notazpool -L 1300G VG /dev/nvme0n1p2

# There will be two cache segments -- cache metadata and
# cache data. Cache metdada should be approx 1000 times smaller than
# the cache data lv, minimum size of 8mb 
# I rounded up 

lvcreate -n CACHE_DATA_LV -L 120G VG /dev/nvme0n1p1
lvcreate -n CACHE_METADATA_LV -L 200M VG /dev/nvme0n1p1

# finally, we can stitch these things together.

lvconvert --type cache-pool  --poolmetadata VG/CACHE_METADATA_LV VG/CACHE_DATA_LV

lvconvert --type cache --cachepool VG/CACHE_DATA_LV VG/notazpool

# and then check your work. 

lvs -a vg

 lvs -a VG
  LV                          VG Attr       LSize   Pool                  Origin            Data%  Meta%  Move Log Cpy%Sync Convert
  [CACHE_DATA_LV_cpool]       VG Cwi---C--- 120.00g                                         0.01   3.89            0.00            
  [CACHE_DATA_LV_cpool_cdata] VG Cwi-ao---- 120.00g                                                                                
  [CACHE_DATA_LV_cpool_cmeta] VG ewi-ao---- 200.00m                                                                                
  [lvol0_pmspare]             VG ewi------- 200.00m                                                                                
  notazpool                   VG Cwi-a-C---  <1.27t [CACHE_DATA_LV_cpool] [notazpool_corig] 0.01   3.89            0.00            
  [notazpool_corig]           VG owi-aoC---  <1.27t                        



Why dm-writecache instead of dm-cache?

Well, here’s a read benchmark:

 hdparm -t /dev/nvme0n1p1 

/dev/nvme0n1p1:
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 Timing buffered disk reads: 4276 MB in  3.00 seconds = 1425.15 MB/sec
[email protected]:/home/wendell# hdparm -t /dev/nvme0n1p2 

/dev/nvme0n1p2:
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 Timing buffered disk reads: 3794 MB in  3.00 seconds = 1264.66 MB/sec

fio testing, 200g

test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.16
Starting 1 process
test: Laying out IO file (1 file / 204800MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=124MiB/s,w=41.3MiB/s][r=31.8k,w=10.6k IOPS][eta 00m:01s]
test: (groupid=0, jobs=1): err= 0: pid=11730: Mon Aug 24 18:05:58 2020
  read: IOPS=11.9k, BW=46.6MiB/s (48.8MB/s)(150GiB/3299352msec)
   bw (  KiB/s): min= 1000, max=506304, per=100.00%, avg=47694.92, stdev=36133.40, samples=6593
   iops        : min=  250, max=126576, avg=11923.72, stdev=9033.35, samples=6593
  write: IOPS=3973, BW=15.5MiB/s (16.3MB/s)(50.0GiB/3299352msec); 0 zone resets
   bw (  KiB/s): min=  272, max=168016, per=100.00%, avg=15900.10, stdev=12058.16, samples=6593
   iops        : min=   68, max=42004, avg=3975.01, stdev=3014.54, samples=6593
  cpu          : usr=1.46%, sys=8.10%, ctx=30273155, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=39320441,13108359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=46.6MiB/s (48.8MB/s), 46.6MiB/s-46.6MiB/s (48.8MB/s-48.8MB/s), io=150GiB (161GB), run=3299352-3299352msec
  WRITE: bw=15.5MiB/s (16.3MB/s), 15.5MiB/s-15.5MiB/s (16.3MB/s-16.3MB/s), io=50.0GiB (53.7GB), run=3299352-3299352msec

Disk stats (read/write):
    dm-2: ios=39316712/13255987, merge=0/0, ticks=82778796/127161724, in_queue=209940520, util=91.61%, aggrios=15990755/13885105, aggrmerge=0/0, aggrticks=63719553/150999489, aggrin_queue=214719042, aggrutil=93.04%
    dm-4: ios=1925/12041955, merge=0/0, ticks=47280/64661392, in_queue=64708672, util=25.00%, aggrios=47920375/39648303, aggrmerge=51892/2007014, aggrticks=190478427/440790550, aggrin_queue=516269164, aggrutil=93.63%
  nvme0n1: ios=47920375/39648303, merge=51892/2007014, ticks=190478427/440790550, in_queue=516269164, util=93.63%
  dm-5: ios=24882286/13257273, merge=0/0, ticks=141016144/124120916, in_queue=265137060, util=92.61%
  dm-3: ios=23088056/16356089, merge=0/0, ticks=50095236/264216160, in_queue=314311396, util=93.04%
[email protected]:/mnt# cat fio
cat: fio: No such file or directory
[email protected]:/mnt# cat test-fio.sh 
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=200G --readwrite=randrw --rwmixread=75

Bcache

So, you’d rather use bcache?

1 Like

This.

The main answer is that the 128GB of “Static” (actually Pseudo) SLC is 30000 PE

Cycles (this spec comes from the NAND Manufacturer Micron). It is QLC that has been converted into SLC, but, unlike the SLC used in most QLC devices where it is constantly being shifted in and out of the SLC state and therefore remains at the low QLC PE Cycle rating, our Static SLC does not move and only gets programmed to either a 0 or a 1 and uses the entire voltage range and thus is much easier to read.

Thank you very much, i have been looking for tiered storage under Linux ever since I’ve left windows. This will be a first step.
i can maybe even use a ramdisk for some read cache now that i don’t have use for my 32 gb of ram.

It strange the SLC partition is fixed. @wendell do you know why is it done like that. Is it due to ease of implementation performance optimizations or the SLC cells are somewhat different/optimized?
I was thinking that it will be more beneficial from waerleaving PoV to have the SLC cache dynamically allocated.
Do you have info what is the PE rating of the SLC cells?

The drive has some amount of SLC nand and some amount of QLC nand on it. @idimitro, what do you mean strange that it’s fixed - how would it be variable?

I think he/she mixed up the concept of physical cells like on this drive with the variable modes on some SSDs where a variable number of multi level cells are going to be used in SLC cache mode to increase performance.

@anon89476829 described my assumption below. I was assuming everything is software magic but it appears there is some real HW behind the claims.
It would be good if they publish the specifications of this SLC cells mainly PE cycles or guaranteed data written - this kind of stuff.

This.

The main answer is that the 128GB of “Static” (actually Pseudo) SLC is 30000 PE

Cycles (this spec comes from the NAND Manufacturer Micron). It is QLC that has been converted into SLC, but, unlike the SLC used in most QLC devices where it is constantly being shifted in and out of the SLC state and therefore remains at the low QLC PE Cycle rating, our Static SLC does not move and only gets programmed to either a 0 or a 1 and uses the entire voltage range and thus is much easier to read.

1 Like

Huh, I wonder if there might be a firmware hack to run an entire TLC/QLC device in SLC mode…

1 Like

Is it possible to use the SLC as a cache vdev in a ZFS pool, and the QLC as part of a storage vdev? (I hope that is a coherent question, as I am only beginning to learn about ZFS.)

Yep, sure can

1 Like

Im setting up my file system now. I just decided to drop some cash on a pure optaine M.2 on 32 GB but I plan to use it for a SLOG?..I think I already have a ssd cache on a Samsung DTC 833 cash for my ZFS. Is this a good idea? I’m hoping it will be more fail resistant with Optaines unique cell structure and useage.

@wendell So I just got my 1.6tb drive in the mail, I formatted it using fdisk and the sectors you provided, the speed of transfers makes me think it’s operating as expected. is there a way to verify the correct sectors?

1 Like

I have my Enmotus drive working on my Ubuntu 20.10 system with the instructions provided above. Thank you! First time in a while using LVM commands and it worked fine.

Stupid question: How could I boot / install with this drive? I just played around for a few hours trying to clone my current boot drive to this and then installing a fresh Ubuntu 20.10, but it seems that LVM support isn’t great in the installer. (I swear it used to be better, but maybe my memory is slipping) I’m curious if anyone has any suggestions about the best path to take.

I don’t need to boot off this drive, I’d just like to. Am even willing to try another distro if it has better installer support an lvm setup like this.

2 Likes

you need a boot partition and an efi partition, and it should work. You some bioses have trouble reading EFI partition past 8gb on the drive, stupidly, but most don’t care. So you could put the small partitions at the end, rather than the start, of the drive.

2 Likes

@wendell Are the blocks on the drives the same for the retail versions? What would I need to adjust to have two cache tasks going (the fuzedeve for the fuze drive and my other 280GB optaine drive for the HDD’s I have? (heck what about in tandum with eachother how does that stackup))

P.S. For giggles, have you tried running a small windows VM (probably 2c4t )with the fuzedrive drivers and just share the drive over the ‘network’ to see if that gave better performance? (secondary to that how do you gave the virtual NIC speed to something that would support it without bottle-necking like 10gig)

Sorry for the paragraph I just wanted to pick your brain a bit… although it looks more like surgery

1 Like

Bit of an old thread, but I’m curious… is it possible to use BOTH dm-cache and dm-writecache? For instance, I just picked up a 4TB HDD, and right now I’m toying around with using the SLC region as a dm-writecache, and the QLC region as a dm-cache (for reads?).

Not sure if anyone has experience or knowledge in this area. Would love to hear everyone’s thoughts.

2 Likes

Any word on this?
How were the LBA numbers discovered, is there any way to verify them?

They did match on the retail units I got.

Speed test is the easiest way probably