[SOLVED!] TrueNAS Scale in a Proxmox writes slowdown: Help identifying the bottleneck

Hello dear forum members,

I’ve got a homelab server that’s also my NAS, and I’m moving from a simple OpenMediaVault configuration to a TrueNAS configuration in a second box.

This is the current machines:

Box 1 - Proxmox 7 + OpenMediaVault + 2 x 1TB mirror SSD in a Xeon 2673v3 with 1 Gigabit ethernet port;
Box 2 - Proxmox 8 + TrueNAS Scale + “LSI” SAS2008 + 6x2TB SSD in a Ryzen 5900x, also with 1 Gigabit ethernet port - This machine also has a secondary 4TB spinning rust drive meant to store disposable media, such as temporary movies or recordings;

Both boxes have their own boot drives and run a small set of services such as PiHole, and have another 1TB SSD to store VMs for small projects, but that’s not the relevant part of the question, just adding here for context.

The focus of the problem is on the new box, with TrueNAS Scale. It’s configured right now with 2 different datasets:

SSD Dataset = 1 Pool of 6 SSDs in RaidZ2;
Spinning Rust Dataset = 1 Pool of a Single 4TB Seagate HD;

When transferring large sequential data to both datasets, I can see an ethernet pattern on the SSD dataset like this:

image

And the transfers sometimes reach 115MB/s, and sometimes dip to 22MB/s. I understand that the 115MB/s is the Gigabit Ethernet ceiling, but I don’t get the dips.

When doing the same transfer to the Spinning Rust Dataset, I get the 115MB/s sustained write without any dips. Talking about some large transfers, like 120GB of data at once.

The HBA used in the new system is exactly this one from Aliexpress:

And it’s used in PCIe pass-through to the TrueNAS virtualized box in Proxmox as a Raw device:

I’d like some help to try to pinpoint why the SSD Dataset is having these dips.
I can destroy the pool and recreate it any amount of times to do any necessary tests. All the SSDs went through 4 days of tests using badblocks and all passed with flying colors.

I’ve got a few possibilities, but I don’t know how to test them appropriately to pinpoint the exact location:

1 - Terrible quality SSDs, that cannot sustain write for a long period;
2 - Overheating HBA card (I’ve placed a 40mm fan on top of that heatsink);
3 - Bad pool/vdev configuration (it’s my first one, so I might be doing something wrong);
4 - ??? (any suggestions are well appreciated)

The 4TB HD is also hooked to the same HBA card where the SSDs are, so I’m not sure if the HBA is the problem here.

I’m OK of not having blazing fast NAS, as long as it’s long-term reliable and I could eventually upgrade to better SSDs.

Acquiring hardware in my case is somehow difficult due to market conditions (things in Brazil are too expensive or hard to find), so these boxes were created with mostly Aliexpress components, that might not be the best quality possible. A single 12TB HD here cost more than all 6 SSDs combined, and would be a single point of failure.

So, how can I isolate the offending component, or even understand if I have more than one offending component there ?

Samba or NFS for the File Shares?

It’d help to test performance in the VM itself, without a network file sharing protocol involved, using fio.

My first guess would be that the difference in performance between your datasets is that the interruptions are caused by overhead from parity calculations on the raidz2 array, since you’re comparing against a single drive vdev.

Samba for file share in both cases. I should have specified that, sorry. :wink:

How do I test that VM performance ? Running iperf in the TrueNAS shell ?

I’ve assigned 4 cores to the TrueNAS machine, and during the transfers I see all the cores at almost zero percent all the times, but this is using TrueNAS dashboard. I could use something more accurate such as htop, but I have no idea which processes to filter there.

To clarify, I mean to test the filesystem / ZFS Dataset inside of the VM, locally.
That way it’s easier to tell whether it’s some underlying issue of your Hardware / ZFS Configuration or because of the filesharing.

fio is built into TrueNAS Scale, I recommend you use that, but do some searching first how to avoid getting false results with ZFS.

1 Like

You might check smartctl -a on the ssds. If they are part of the pipeline here wonder if they aren’t getting g trimmed and the speed drop is somehow related to that

1 Like

Ok guys, I’ll check how to run fio in the box, along with smartctl -a, and post back the results. This will take a while, though, as I’m OOO right now. Thanks a lot!

Dummy question, though… How do I monitor if they are getting trimmed during the transfer ? I’ve checked the output of 1 call of smartctl -a , and this is what I get:

root@truenas[~]# smartctl -a /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.142+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     XrayDisk 2TB SSD
Serial Number:    AA000000000000000518
Firmware Version: V1021A0
User Capacity:    2,048,408,248,320 bytes [2.04 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available      <----  Only mention of TRIM in the output  ***
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jul  7 11:42:12 2023 -03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  10) minutes.
SCT capabilities:              (0x0001) SCT Status supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   050    Old_age   Always       -       526
 12 Power_Cycle_Count       0x0032   100   100   050    Old_age   Always       -       7
160 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
161 Unknown_Attribute       0x0033   100   100   050    Pre-fail  Always       -       100
163 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       17
164 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       10532
165 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       37
166 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       5
167 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       12
168 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       3808
169 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       100
175 Program_Fail_Count_Chip 0x0032   100   100   050    Old_age   Always       -       0
176 Erase_Fail_Count_Chip   0x0032   100   100   050    Old_age   Always       -       0
177 Wear_Leveling_Count     0x0032   100   100   050    Old_age   Always       -       0
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   050    Old_age   Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   050    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   050    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       6
194 Temperature_Celsius     0x0022   100   100   050    Old_age   Always       -       40
195 Hardware_ECC_Recovered  0x0032   100   100   050    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0032   100   100   050    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   050    Old_age   Always       -       0
232 Available_Reservd_Space 0x0032   100   100   050    Old_age   Always       -       100
241 Total_LBAs_Written      0x0030   100   100   050    Old_age   Offline      -       191732
242 Total_LBAs_Read         0x0030   100   100   050    Old_age   Offline      -       136299
245 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       280656

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       516         -
# 2  Short offline       Completed without error       00%       490         -
# 3  Short offline       Completed without error       00%       465         -
# 4  Short offline       Completed without error       00%       439         -
# 5  Short offline       Completed without error       00%       414         -

Selective Self-tests/Logging not supported

zpool status output?

check trim status?

sudo zpool status -t yourpool

1 Like

Thank you sir! Sorry about the newbie questions, but that’s precisely what I needed. Baby steps, baby steps and learning as I go. :wink:

smartctl_output.txt (33.2 KB)

Ok, I think I have some data to share now.

I’ve ran smartctl -a in all disks, with the results in the following file (593 lines, wall of text): smartctl_output.txt (33.2 KB)

No errors were reported by smartctl, and all of them report TRIM command available

root@truenas[~]# cat /mnt/hdd_4tb/media_rust/smartctl_output.txt| grep TRIM
TRIM Command:     Available
TRIM Command:     Available
TRIM Command:     Available
TRIM Command:     Available
TRIM Command:     Available
TRIM Command:     Available

Then I’ve checked a few websites on how to run fio to test the pools, and after a few complex tests, I decided to go on the simplest test I could have:

  1. Create a specific dataset for testing: zfs create ssd_pool/fio
  2. Remove the cache on that dataset: zfs set primarycache=none ssd_pool/fio
  3. Run a simple write-only test with a small load: fio --directory=/mnt/ssd_pool/fio --name=sequential-write --rw=write --bs=1M --size=8G --numjobs=4

And I did that with the following scenarios:

Scenario Speed Result Video
RaidZ2, 6 disks 20MB/s 13 minutes to finish Write only - RaidZ2 pool with 6 SSDs - YouTube
Mirrors [sdb+sde], [sdf+sdg], [sdh+sdi] 8MB/s Cancelled. Each mirror was taking too long, and the h+i mirror was the least worse. Started with 400MB/s and dropped to 8MB/s in the end 3 mirrors tests - YouTube
Single HDD 4TB 80MB/s ~ 100MB/s Finished the test in 4 minutes HDD 4TB Baseline test - 4 minutes - YouTube
Single SSD 960GB (“brand new”) 250MB/s ~ 400MB/s Finished the test in 1m40s Single 960GB SSD - New drive, not used in the pool before - YouTube
Single SSD 2TB (extracted from the pool) 8MB/s Cancelled test after a few seconds, it was already too slow Single 2TB SSD - One of the members of the RaidZ2 pool - YouTube

The mirrors were tested one by one, and only the [h-i] mirror was reasonable for a few seconds, then it dropped to 8MB/s like the other 2 mirrors.

Regarding TRIM: During the tests, while using zpool status -t, all pools showed the (trim unsupported) message. I’ve found this weird, as smartctl shows that these drives support the TRIM command. But they’re all behind the LSI 2008 controller, so there might be something fishy there.

Running hdparm in all drives results in the same message:

root@truenas[~]# hdparm -I /dev/sd[b,e-i] | grep TRIM
           *    Data Set Management TRIM supported (limit 8 blocks)
           *    Data Set Management TRIM supported (limit 8 blocks)
           *    Data Set Management TRIM supported (limit 8 blocks)
           *    Data Set Management TRIM supported (limit 8 blocks)
           *    Data Set Management TRIM supported (limit 8 blocks)
           *    Data Set Management TRIM supported (limit 8 blocks)

But specifically calling zpool trim issues the “trim not supported”:

root@truenas[~]# zpool trim ssd_sdb
cannot trim: no devices in pool support trim operations

I’ve found some complaints regarding other SSDs not supporting TRIM on zfs, like this one on github, and it seems that my SSDs do not have the Deterministic read ZEROs after TRIM capability. This might be a culprit.

That being said, I can’t understand why running the test with a brand new 960GB SSD of the same (shady) brand issues 200+MB/s, and the 2TB ones don’t. Both are behing the same (also shady) LSI 2008 HBA.

Next test to me is to remove one of the 2TB SSDs from the HBA, move to a native SATA controller in the server and redo the test, passing it as a virtual IO device to TrueNAS.

Any other test suggestions is more than welcome guys. Have a great weekend!

Have a look for information on secure erase and blkdiscard in the ssds. You might just need to erase them to default state.

SSDs may have resource limits. Often there is DRAM then SLC flash then TLC or QLC bulk flash. Cheaper has less or 0 available in some areas, and tend to QLC. Different layers have different speeds, and you don’t notice until you push over a limit then it slows down while it catches up and flushes data. Some things for you to look up.

1 Like

@samarium , if you ever come to Brazil, barbecue + beer is on me.

I’ve followed the step by step on secure erase from here (archlinux website), rebuilt the RaidZ2 pool and tested again with fio.

1400MB/s write speed on average, and the test that took 12 minutes before now took about 25 seconds. Video here.

Your suggestion totally made sense when I’ve read, because before creating this pool I’ve spent a couple of days running badblocks on all 6 SSDs, which is write intensive, and I never even knew that this “secure erase” or “erase to default state” existed. I heard about cache SLC exhaustion before, and problems with trim, but never heard about secure erasing.

These are (very) cheap SSD drives from Aliexpress, without any DRAM cache.

I’ll leave it like this, and check how long will it take for the pool to degrade again.

Once again, thank you very much!

Now I have a trim problem to investigate. This lack of trim support may make this pool degrade faster over time.

1 Like

Yes basically once you’ve written 2tb to each drive in the pool it will slow down again.

From the drive fw perspective the drive is always full since blocks are never marked for discard.

It makes sense. Now the question is: Is the HBA the problem regarding the TRIM command or is it the drive fault for not having the support of the Deterministic read ZEROs after TRIM capability ?

I can try to find a different HBA with TRIM support if that’s the case, but swap the drives will be more difficult.

Hba may have a firmware update to fix and let it passthrough

It looks like a very generic Broadcom SAS2008 PCI-Express Fusion-MPT SAS-2 (as per Proxmox description):

From a brand called Avago Technologies. Never heard of it. I’ve got 2 of them (bought 2 at once, so if one died I’d have a backup), so I can use the second one to test any new firmware. This whole HBA + NAS thing is new to me.

I’ll search around for a new IT MPT firmware for this card, and if I find any, I’ll test on the spare.

I’m not sure that badblocks is relevant to ssds. I would just use blkdiscard /secure erase as than should return the SSD to optimal.

I have an old oem card with a lsi sas 2008 chip and last IT firmware is same as yours, and the last I know about. If you find anything later please ping me. Servethehome forums often have relevant stuff.

1 Like

Will do, sir. In fact I’ve downloaded the latest available from Broadcomm following serverbuilds.net, which in theory is the latest one as well, but I’ll check ServeTheHome forums as well. Thank you so much!

There are linux versions of those commands I think, easier than booting into dos/uefi shell, not sure where to find them tho.

I tend to use lsiutil https://github.com/exactassembly/meta-xa-stm/tree/master/recipes-support/lsiutil/files and if problems then lsirec GitHub - marcan/lsirec: LSI SAS2008/SAS2108 low-level recovery tool for Linux is very useful, also allows things like halting the card, loading firmware into card ram and rebooting card, all while multiuser in linux. Only hassle is requires disable IOMMU/VTx for some tasks. Both tools compile under proxmox and ubuntu.

1 Like