Benchmarking disks in Linux (with FIO?)

Hey folks,

I have a set of 4x SN850X drives that I would like to benchmark in various configurations to see the tradeoffs thereof. I would like to format these in a repeatable way, so I’ve booted into a minimal NixOS ISO and, using disko, I can reliably format these disks and arrange them in the configurations I’d like to test. That’s the setup for the problem I’m facing:

I have been messing around with FIO to test these configurations, but it seems like getting data I can compare and contrast written to a file is not working out for me. There are a lot of options for job output, and I’m really not sure which ones to use. Ideally I would end up with a file for each job that has a report of IOPS & bandwidth for each process, and a total summary of the same, for that job. I am generating the fio job file with some dumb code:

from textwrap import dedent

# stolen from somewhere, basing runs on this command line
# fio --name=WriteAndRead --size=16g --bs={4KB,16KB,1MB} --rw={read, write, randwrite, randread} --ioengine=libaio --sync={0,1} --iodepth=32 --numjobs=1 --direct=1 --end_fsync=1 --gtod_reduce=1 --time_based --runtime=60

bs_params = ["4KB", "16KB", "1MB"]
rw_params = ["read", "write", "randwrite", "randread", "randrw"]
sync_params = [0, 1]

global_params = """
[global]
name=tests
size=32g
ioengine=libaio
iodepth=32
numjobs=8
direct=1
end_fsync=1
gtod_reduce=1
time_based
runtime=120
group_reporting
"""


with open("fio.conf", "w") as fio_conf:
  fio_conf.write(global_params)
  for bs in bs_params:
    for rw in rw_params:
      for sync in sync_params:
        job_name = f"{bs}_{rw}_sync_{sync}"
        job_config = f"""
        [{job_name}]
        stonewall=1
        write_bw_log=logs/{job_name}
        bs={bs}
        rw={rw}
        sync={sync}
        """
        fio_conf.write(dedent(job_config))

When I run fio against this job file, I end up with a bunch of empty log files, which isn’t particularly helpful to me.

Does anyone have any advice about how I can go about getting the output I’m looking for from fio for these tests?

Thanks!

Edit: Also if any of these parameters are stupid, or I’m missing some nice-to-have parameters for fio, please let me know!

I’m doing the same on 2x optane p5801x, 4x crucial p700, 2x samsung 990

heres my code

#!/bin/bash

# Base test parameters
RUNTIME=30
SIZE=16g
TEST_DATE=$(date +%Y%m%d_%H%M%S)

# Arrays for test parameters
block_sizes=("4k" "64k" "1m")
io_patterns=("read" "write" "randread" "randwrite")
io_depths=("1" "8" "16" "32")

# Create test directory for this run
TEST_DIR="fio_test_${TEST_DATE}"
mkdir -p $TEST_DIR

# Calculate total number of tests
TOTAL_TESTS=$((${#block_sizes[@]} * ${#io_patterns[@]} * ${#io_depths[@]}))
CURRENT_TEST=0

# Run each permutation
for bs in "${block_sizes[@]}"; do
    for pattern in "${io_patterns[@]}"; do
        for depth in "${io_depths[@]}"; do
            # Update progress counter
            ((CURRENT_TEST++))
            
            # Create descriptive names
            TEST_NAME="${pattern}_${bs}_q${depth}"
            RESULT_FILE="${TEST_DIR}/${TEST_NAME}.txt"
            DATA_FILE="${TEST_DIR}/${TEST_NAME}.data"

            # Show progress
            echo "Running benchmark [${CURRENT_TEST}/${TOTAL_TESTS}]: ${TEST_NAME} ..."

            # Construct and save FIO command
            FIO_CMD="fio --name=${TEST_NAME} \
--size=${SIZE} \
--bs=${bs} \
--rw=${pattern} \
--ioengine=libaio \
--iodepth=${depth} \
--numjobs=1 \
--direct=1 \
--time_based \
--runtime=${RUNTIME} \
--output=${RESULT_FILE} \
--filename=${DATA_FILE} \
--randrepeat=0 \
--norandommap \
--refill_buffers"

            # Save test configuration to result file
            echo "=== Test Configuration ===" > ${RESULT_FILE}
            echo "Date: $(date)" >> ${RESULT_FILE}
            echo "Block Size: ${bs}" >> ${RESULT_FILE}
            echo "I/O Pattern: ${pattern}" >> ${RESULT_FILE}
            echo "I/O Depth: ${depth}" >> ${RESULT_FILE}
            echo "Direct I/O: enabled" >> ${RESULT_FILE}
            echo "Runtime: ${RUNTIME} seconds" >> ${RESULT_FILE}
            echo -e "\n=== FIO Command ===\n${FIO_CMD}\n" >> ${RESULT_FILE}
            echo -e "\n=== Test Results ===\n" >> ${RESULT_FILE}

            # Run FIO command and append results
            eval "$FIO_CMD" >> ${RESULT_FILE}

            # Remove the data file after test completes
            rm -f ${DATA_FILE}

            echo "Completed: ${TEST_NAME}"
            echo "----------------------------------------"
        done
    done
done

echo "All tests completed. Results are in ${TEST_DIR}/"

save, +x, run, wait you’ll have 48 in depth reports in a sub folder

maybe this becomes a standard we can all contribute results to

IO engine varies for other OS and non-regular storage classes. And 16G is too small and you can end up benchmarking cache. I use twice the RAM size of the storage in question just to make sure.

I also miss sync/async/fsync. Which is a rather huge difference in performance, async is easy mode

open to improvement suggestions, edit and repost

Cool, I’ll have to try this out. I did also run across fio-plot in my searching but haven’t tried it out yet.

I was running ~1x RAM but I will try 2x as well. I definitely saw slower speeds than I would expect cache to be producing though in my initial tests.

It’s just precaution on my side. with --direct=1 you can bypass the page cache which is usually plenty for local disks on a simple FS. But I use fio for Ceph and ZFS as well and there are caches everywhere and sometimes caches within caches. And then there are drives with write caches on them.
Raw disk performance is hard because we got so many fancy stuff today.

Best to just brute-force through everything with large stuff (mostly for reads) and syncing (for writes)

How many CPU cores are you expecting to fill with numjobs=8? I tend to assume numjobs=1 is OK for filling the disk bandwidth / array checks.

Cache-wise, there’s blockdev --flushbufs $device and hdparm -F $device for instructing the devices to forget what they’ve cached. Device makers are sneaky, so I wouldn’t trust this and instead add a phase of test preparation to fill the cache data that’s not related to your test – if fio is using a file on a mounted filesystem, put a second file next to it full of /dev/zero. (Stack Overflow citation for disk cache purging.)

There’s also --io_size on top of --size – in that the amount of I/O can be set independently of the size of storage under test. If you have fast spinning rust at the rim of a disk and slower bytes and the spindle, you can focus I/O on each portion of the device.

Have you tried --ioengine=xnvme when testing NVMe devices directly?

K3n.

Indeed, ZFS layouts are one of the things I wanted to test with this. I wanted to see the perf difference between striped & raidz1, vs just 1 drive with ext4.

I have a 5800X in this test system and I expect the actual production workload to be ~6-8 main processes (though I honestly do not know how those processes will fork or use threads). I basically just wanted to simulate some simultaneous, separated workloads using the same storage pool, so there was a reason to have multiple jobs here - though it may or may not be actually valid.

I personally am not concerned with spinning disk for this particular set of tests though I am sure it will be handy to be able to do that in the future (I do have a reasonable-sized array of spinning rust passed through to a TrueNAS VM - off topic for this discussion, though).

I have not, though I will make a note to do that for testing today. Is this useful for testing drives with a filesystem on them or is this just for bare drives? I’m mostly interested in filesystem performance, as I know these particular drives are pretty fast by themselves.

Since you’re already playing around, it would be interesting to see what performace you’d get with FreeBSD 14.1 or 14.2 ( FreeBSD 14.2 Release Process | The FreeBSD Project ) if you have time.

If you run the script as is, on the p5800x, we can compare it to the p5801x, i ran it on it yesterday

AMD 5800X CPU, not Intel p5800x optane.

1 Like

The results will be impacted by the pool. If it is an empty pool, it will be fast (by ZFS NVMe performance standards) but will be very different on an aged pool with deleted/modified data, higher CAP, snapshots,etc. This is especially true for RAIDZ and dRaid.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.