ZFS fine-tuning to keep HDDs spun down and delay txg sync operations

Hey everyone!

So, my brain has cooked up a thing, which I want to share with you, in the hopes, that we could brainstorm over it and work out, if this is actually feasible.

I currently am obsessed with building the most power efficient Server for my needs. I have a zpool containing two mirrored spinning disks, which currently are running 24/7. Because this is my Home Server, the zpool is mostly idle, as there are not many services, that frequently use the zpool.

Now to my idea.
I want the hard drives to stay non-rotating most of the time, to conserve power and to minimize noise. The drives should only spin up to “sync” the latest changes for example every hour and then spin down again. I know of the disadvantages spinning HDDs up and down all the time.

I was thinking of the following steps:

  • Disable synchronous writes in the entire pool with sync=disable
  • Raise the vfs.zfs.txg.timeout to e.g. 3600
  • Attach a mirrored special vdev located on an SSD to the zpool to hold all the metadata, speed up the access times of the pool and prevent spinning up the drives for every metadata related operation
  • Attach a cache vdev to the zpool located on an SSD to read the most used data from flash, again, without the need to spin up the drives

I am aware, that disabling synchronous writes raises some serious risks of data loss on an unorderly shutdown. But I basically never write any data to the zpool, that is not replacable and I am willing to take that risk.

My understanding is, that all the data, that is about to be written to the zpool and that is being accumulated over the period of 3600 seconds, is cached in RAM and only written to disk, when those 3600 seconds expire or when the size of the RAM-cache is exeeded.
Is this true? Would utilizing a mirrored log vdev help with catching those writes?

What do you think of my idea? Would this actually work or am I getting it totally wrong?

I would love to hear your input on this and discuss this topic further, because this really fascinates me.
I am looking forward to any input that any of you can provide.

Cheers!

2 Likes

This is only a quick read so I may be misunderstanding, but the SLOG is never read from except on pool import after a crash, it does not act like people expect a write cache to behave.

And if you disable sync writes, I don’t expect it to be used at all.

A SLOG wouldn’t help, but a metadata special device may be able to help, especially if you set up special_block_size for the OS dataset such that any OS files that get read/written regularly (log for example) will be stored on the metadata devices instead of the HDDs.

For example:

mypool
  root
  storage
    anime
    manga
    visualnovels

In this case special_block_size for mypool/root would be equal to recordsize, and be whatever makes sense for mypool/storage and its children. The HDDs should only be active when the pool is imported/exported and when your storage datasets are being used, so it sounds like they ought to be able to power down.
You should give it a try and report back. If it works I might do the same on my server.

Ok now I’m getting curious, what if instead of sync=disable, we do the opposite:

  • Huge power safe SLOG
  • sync=all
  • Huge txg timeouts and dirty data size

Coalesce all writes in ARC as “normal” (forever), while furiously wearing out a pair of nvme drives?

IMG_7886

I expect there to be mechanics at play here that would throw a wrench into this that I don’t typically pay attention to. I’m pretty sure that are other things that trigger flushing the dirty data.

I think this’ll probably screw over performance as well, I’m not sure if programs making normally asynchronous writes will have to wait for ZFS to send back a now slower “yup I got it”

One thing you can do, is set up a VM with zfs, give that zfs various files as devices to make vdevs with, and test in the vm, checking if the files get written to in the timeframe expected.

this is a longer project, I started it last week, but haven’t found the time to really dig deep yet.
Edit: forget that “non-rotating”, but merging reads and writes is possible to reduce disk activity

https://openzfs.org/wiki/ZFS_on_high_latency_devices

post init script to set the values

/root/postinit.sh                                                                                                                                                                                
#!/bin/sh

PATH="/bin:/sbin:/usr/bin:/usr/sbin:${PATH}"
export PATH

ARC_PCT="65"
ARC_BYTES=$(grep '^MemTotal' /proc/meminfo | awk -v pct=${ARC_PCT} '{printf "%d", $2 * 1024 * (pct / 100.0)}')
echo ${ARC_BYTES} > /sys/module/zfs/parameters/zfs_arc_max

SYS_FREE_BYTES=$((8*1024*1024*1024))
echo ${SYS_FREE_BYTES} > /sys/module/zfs/parameters/zfs_arc_sys_free


ZDDM="8589934592"
echo ${ZDDM} > /sys/module/zfs/parameters/zfs_dirty_data_max

#ZDDMM="8589934592"         
#echo ${ZDDMM} > /sys/module/zfs/parameters/zfs_dirty_data_max_max

ZTT="10"
echo ${ZTT} > /sys/module/zfs/parameters/zfs_txg_timeout

ZDMDP="70"
echo ${ZDMDP} > /sys/module/zfs/parameters/zfs_delay_min_dirty_percent

ZFS Tunables

root@truenas[~]# arc_summary                                                

------------------------------------------------------------------------
ZFS Subsystem Report                            Fri Jul 28 03:35:26 2023
Linux 5.15.107+truenas                                          2.1.11-1
Machine: truenas (x86_64)                                       2.1.11-1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    29.6 %   48.3 GiB
        Target size (adaptive):                        30.0 %   49.0 GiB
        Min size (hard limit):                          4.8 %    7.9 GiB
        Max size (high water):                           20:1  163.4 GiB
        Most Frequently Used (MFU) cache size:         33.5 %   16.1 GiB
        Most Recently Used (MRU) cache size:           66.5 %   32.1 GiB
        Metadata cache size (hard limit):              75.0 %  122.6 GiB
        Metadata cache size (current):                  0.3 %  360.4 MiB
        Dnode cache size (hard limit):                 10.0 %   12.3 GiB
        Dnode cache size (current):                     0.4 %   44.0 MiB

ARC hash breakdown:
        Elements max:                                             124.4k
        Elements current:                              95.8 %     119.2k
        Collisions:                                                 2.7k
        Chain max:                                                     2
        Chains:                                                      205

ARC misc:
        Deleted:                                                     185
        Mutex misses:                                                  0
        Eviction skips:                                              182
        Eviction skips due to L2 writes:                               0
        L2 cached evictions:                                     0 Bytes
        L2 eligible evictions:                                   1.4 MiB
        L2 eligible MFU evictions:                     11.6 %  161.0 KiB
        L2 eligible MRU evictions:                     88.4 %    1.2 MiB
        L2 ineligible evictions:                                 9.0 KiB

ARC total accesses (hits + misses):                                75.5M
        Cache hit ratio:                               99.8 %      75.4M
        Cache miss ratio:                               0.2 %     119.4k
        Actual hit ratio (MFU + MRU hits):             99.8 %      75.3M
        Data demand efficiency:                        99.9 %      10.2M
        Data prefetch efficiency:                       0.1 %      95.6k

Cache hits by cache type:
        Most frequently used (MFU):                    97.0 %      73.1M
        Most recently used (MRU):                       3.0 %       2.3M
        Most frequently used (MFU) ghost:               0.0 %          0
        Most recently used (MRU) ghost:                 0.0 %          0
        Anonymously used:                             < 0.1 %      16.9k

Cache hits by data type:
        Demand data:                                   13.6 %      10.2M
        Prefetch data:                                < 0.1 %         87
        Demand metadata:                               86.4 %      65.1M
        Prefetch metadata:                            < 0.1 %      17.0k

Cache misses by data type:
        Demand data:                                    9.2 %      10.9k
        Prefetch data:                                 80.0 %      95.5k
        Demand metadata:                                9.3 %      11.2k
        Prefetch metadata:                              1.5 %       1.8k

DMU prefetch efficiency:                                           15.7M
        Hit ratio:                                      8.4 %       1.3M
        Miss ratio:                                    91.6 %      14.4M

L2ARC not detected, skipping section

Solaris Porting Layer (SPL):
        spl_hostid                                                     0
        spl_hostid_path                                      /etc/hostid
        spl_kmem_alloc_max                                       8388608
        spl_kmem_alloc_warn                                        65536
        spl_kmem_cache_kmem_threads                                    4
        spl_kmem_cache_magazine_size                                   0
        spl_kmem_cache_max_size                                       32
        spl_kmem_cache_obj_per_slab                                    8
        spl_kmem_cache_reclaim                                         0
        spl_kmem_cache_slab_limit                                  16384
        spl_max_show_tasks                                           512
        spl_panic_halt                                                 1
        spl_schedule_hrtimeout_slack_us                                0
        spl_taskq_kick                                                 0
        spl_taskq_thread_bind                                          0
        spl_taskq_thread_dynamic                                       1
        spl_taskq_thread_priority                                      1
        spl_taskq_thread_sequential                                    4

Tunables:
        dbuf_cache_hiwater_pct                                        10
        dbuf_cache_lowater_pct                                        10
        dbuf_cache_max_bytes                        18446744073709551615
        dbuf_cache_shift                                               5
        dbuf_metadata_cache_max_bytes               18446744073709551615
        dbuf_metadata_cache_shift                                      6
        dmu_object_alloc_chunk_shift                                   7
        dmu_prefetch_max                                       134217728
        ignore_hole_birth                                              1
        l2arc_exclude_special                                          0
        l2arc_feed_again                                               1
        l2arc_feed_min_ms                                            200
        l2arc_feed_secs                                                1
        l2arc_headroom                                                 2
        l2arc_headroom_boost                                         200
        l2arc_meta_percent                                            33
        l2arc_mfuonly                                                  0
        l2arc_noprefetch                                               1
        l2arc_norw                                                     0
        l2arc_rebuild_blocks_min_l2size                       1073741824
        l2arc_rebuild_enabled                                          1
        l2arc_trim_ahead                                               0
        l2arc_write_boost                                        8388608
        l2arc_write_max                                          8388608
        metaslab_aliquot                                         1048576
        metaslab_bias_enabled                                          1
        metaslab_debug_load                                            0
        metaslab_debug_unload                                          0
        metaslab_df_max_search                                  16777216
        metaslab_df_use_largest_segment                                0
        metaslab_force_ganging                                  16777217
        metaslab_fragmentation_factor_enabled                          1
        metaslab_lba_weighting_enabled                                 1
        metaslab_preload_enabled                                       1
        metaslab_unload_delay                                         32
        metaslab_unload_delay_ms                                  600000
        send_holes_without_birth_time                                  1
        spa_asize_inflation                                           24
        spa_config_path                             /etc/zfs/zpool.cache
        spa_load_print_vdev_tree                                       0
        spa_load_verify_data                                           1
        spa_load_verify_metadata                                       1
        spa_load_verify_shift                                          4
        spa_slop_shift                                                 5
        vdev_file_logical_ashift                                       9
        vdev_file_physical_ashift                                      9
        vdev_removal_max_span                                      32768
        vdev_validate_skip                                             0
        zap_iterate_prefetch                                           1
        zfetch_array_rd_sz                                       1048576
        zfetch_max_distance                                     67108864
        zfetch_max_idistance                                    67108864
        zfetch_max_sec_reap                                            2
        zfetch_max_streams                                             8
        zfetch_min_distance                                      4194304
        zfetch_min_sec_reap                                            1
        zfs_abd_scatter_enabled                                        1
        zfs_abd_scatter_max_order                                     13
        zfs_abd_scatter_min_size                                    1536
        zfs_admin_snapshot                                             0
        zfs_allow_redacted_dataset_mount                               0
        zfs_arc_average_blocksize                                   8192
        zfs_arc_dnode_limit                                            0
        zfs_arc_dnode_limit_percent                                   10
        zfs_arc_dnode_reduce_percent                                  10
        zfs_arc_evict_batch_limit                                     10
        zfs_arc_eviction_pct                                         200
        zfs_arc_grow_retry                                             0
        zfs_arc_lotsfree_percent                                      10
        zfs_arc_max                                         175488568320
        zfs_arc_meta_adjust_restarts                                4096
        zfs_arc_meta_limit                                             0
        zfs_arc_meta_limit_percent                                    75
        zfs_arc_meta_min                                               0
        zfs_arc_meta_prune                                         10000
        zfs_arc_meta_strategy                                          1
        zfs_arc_min                                                    0
        zfs_arc_min_prefetch_ms                                        0
        zfs_arc_min_prescient_prefetch_ms                              0
        zfs_arc_p_dampener_disable                                     1
        zfs_arc_p_min_shift                                            0
        zfs_arc_pc_percent                                             0
        zfs_arc_prune_task_threads                                     1
        zfs_arc_shrink_shift                                           0
        zfs_arc_shrinker_limit                                     10000
        zfs_arc_sys_free                                      8589934592
        zfs_async_block_max_blocks                  18446744073709551615
        zfs_autoimport_disable                                         1
        zfs_btree_verify_intensity                                     0
        zfs_checksum_events_per_second                                20
        zfs_commit_timeout_pct                                         5
        zfs_compressed_arc_enabled                                     1
        zfs_condense_indirect_commit_entry_delay_ms                    0
        zfs_condense_indirect_obsolete_pct                            25
        zfs_condense_indirect_vdevs_enable                             1
        zfs_condense_max_obsolete_bytes                       1073741824
        zfs_condense_min_mapping_bytes                            131072
        zfs_dbgmsg_enable                                              1
        zfs_dbgmsg_maxsize                                       4194304
        zfs_dbuf_state_index                                           0
        zfs_ddt_data_is_special                                        1
        zfs_deadman_checktime_ms                                   60000
        zfs_deadman_enabled                                            1
        zfs_deadman_failmode                                        wait
        zfs_deadman_synctime_ms                                   600000
        zfs_deadman_ziotime_ms                                    300000
        zfs_dedup_prefetch                                             0
        zfs_default_bs                                                 9
        zfs_default_ibs                                               15
        zfs_delay_min_dirty_percent                                   70
        zfs_delay_scale                                           500000
        zfs_delete_blocks                                          20480
        zfs_dirty_data_max                                    8589934592
        zfs_dirty_data_max_max                                4294967296
        zfs_dirty_data_max_max_percent                                25
        zfs_dirty_data_max_percent                                    10
        zfs_dirty_data_sync_percent                                   20
        zfs_disable_ivset_guid_check                                   0
        zfs_dmu_offset_next_sync                                       1
        zfs_embedded_slog_min_ms                                      64
        zfs_expire_snapshot                                          300
        zfs_fallocate_reserve_percent                                110
        zfs_flags                                                      0
        zfs_free_bpobj_enabled                                         1
        zfs_free_leak_on_eio                                           0
        zfs_free_min_time_ms                                        1000
        zfs_history_output_max                                   1048576
        zfs_immediate_write_sz                                     32768
        zfs_initialize_chunk_size                                1048576
        zfs_initialize_value                        16045690984833335022
        zfs_keep_log_spacemaps_at_export                               0
        zfs_key_max_salt_uses                                  400000000
        zfs_livelist_condense_new_alloc                                0
        zfs_livelist_condense_sync_cancel                              0
        zfs_livelist_condense_sync_pause                               0
        zfs_livelist_condense_zthr_cancel                              0
        zfs_livelist_condense_zthr_pause                               0
        zfs_livelist_max_entries                                  500000
        zfs_livelist_min_percent_shared                               75
        zfs_lua_max_instrlimit                                 100000000
        zfs_lua_max_memlimit                                   104857600
        zfs_max_async_dedup_frees                                 100000
        zfs_max_log_walking                                            5
        zfs_max_logsm_summary_length                                  10
        zfs_max_missing_tvds                                           0
        zfs_max_nvlist_src_size                                        0
        zfs_max_recordsize                                       1048576
        zfs_metaslab_find_max_tries                                  100
        zfs_metaslab_fragmentation_threshold                          70
        zfs_metaslab_max_size_cache_sec                             3600
        zfs_metaslab_mem_limit                                        25
        zfs_metaslab_segment_weight_enabled                            1
        zfs_metaslab_switch_threshold                                  2
        zfs_metaslab_try_hard_before_gang                              0
        zfs_mg_fragmentation_threshold                                95
        zfs_mg_noalloc_threshold                                       0
        zfs_min_metaslabs_to_flush                                     1
        zfs_multihost_fail_intervals                                  10
        zfs_multihost_history                                          0
        zfs_multihost_import_intervals                                20
        zfs_multihost_interval                                      1000
        zfs_multilist_num_sublists                                     0
        zfs_no_scrub_io                                                0
        zfs_no_scrub_prefetch                                          0
        zfs_nocacheflush                                               0
        zfs_nopwrite_enabled                                           1
        zfs_object_mutex_size                                         64
        zfs_obsolete_min_time_ms                                     500
        zfs_override_estimate_recordsize                               0
        zfs_pd_bytes_max                                        52428800
        zfs_per_txg_dirty_frees_percent                               30
        zfs_prefetch_disable                                           0
        zfs_read_history                                               0
        zfs_read_history_hits                                          0
        zfs_rebuild_max_segment                                  1048576
        zfs_rebuild_scrub_enabled                                      1
        zfs_rebuild_vdev_limit                                  33554432
        zfs_reconstruct_indirect_combinations_max                   4096
        zfs_recover                                                    0
        zfs_recv_queue_ff                                             20
        zfs_recv_queue_length                                   16777216
        zfs_recv_write_batch_size                                1048576
        zfs_removal_ignore_errors                                      0
        zfs_removal_suspend_progress                                   0
        zfs_remove_max_segment                                  16777216
        zfs_resilver_disable_defer                                     0
        zfs_resilver_min_time_ms                                    3000
        zfs_scan_blkstats                                              0
        zfs_scan_checkpoint_intval                                  7200
        zfs_scan_fill_weight                                           3
        zfs_scan_ignore_errors                                         0
        zfs_scan_issue_strategy                                        0
        zfs_scan_legacy                                                0
        zfs_scan_max_ext_gap                                     2097152
        zfs_scan_mem_lim_fact                                         20
        zfs_scan_mem_lim_soft_fact                                    20
        zfs_scan_strict_mem_lim                                        0
        zfs_scan_suspend_progress                                      0
        zfs_scan_vdev_limit                                      4194304
        zfs_scrub_min_time_ms                                       1000
        zfs_send_corrupt_data                                          0
        zfs_send_no_prefetch_queue_ff                                 20
        zfs_send_no_prefetch_queue_length                        1048576
        zfs_send_queue_ff                                             20
        zfs_send_queue_length                                   16777216
        zfs_send_unmodified_spill_blocks                               1
        zfs_slow_io_events_per_second                                 20
        zfs_spa_discard_memory_limit                            16777216
        zfs_special_class_metadata_reserve_pct                        25
        zfs_sync_pass_deferred_free                                    2
        zfs_sync_pass_dont_compress                                    8
        zfs_sync_pass_rewrite                                          2
        zfs_sync_taskq_batch_pct                                      75
        zfs_traverse_indirect_prefetch_limit                          32
        zfs_trim_extent_bytes_max                              134217728
        zfs_trim_extent_bytes_min                                  32768
        zfs_trim_metaslab_skip                                         0
        zfs_trim_queue_limit                                          10
        zfs_trim_txg_batch                                            32
        zfs_txg_history                                              100
        zfs_txg_timeout                                               10
        zfs_unflushed_log_block_max                               131072
        zfs_unflushed_log_block_min                                 1000
        zfs_unflushed_log_block_pct                                  400
        zfs_unflushed_log_txg_max                                   1000
        zfs_unflushed_max_mem_amt                             1073741824
        zfs_unflushed_max_mem_ppm                                   1000
        zfs_unlink_suspend_progress                                    0
        zfs_user_indirect_is_special                                   1
        zfs_vdev_aggregate_trim                                        0
        zfs_vdev_aggregation_limit                               1048576
        zfs_vdev_aggregation_limit_non_rotating                   131072
        zfs_vdev_async_read_max_active                                 3
        zfs_vdev_async_read_min_active                                 1
        zfs_vdev_async_write_active_max_dirty_percent                 60
        zfs_vdev_async_write_active_min_dirty_percent                 30
        zfs_vdev_async_write_max_active                               10
        zfs_vdev_async_write_min_active                                2
        zfs_vdev_cache_bshift                                         16
        zfs_vdev_cache_max                                         16384
        zfs_vdev_cache_size                                            0
        zfs_vdev_default_ms_count                                    200
        zfs_vdev_default_ms_shift                                     29
        zfs_vdev_initializing_max_active                               1
        zfs_vdev_initializing_min_active                               1
        zfs_vdev_max_active                                         1000
        zfs_vdev_max_auto_ashift                                      14
        zfs_vdev_min_auto_ashift                                       9
        zfs_vdev_min_ms_count                                         16
        zfs_vdev_mirror_non_rotating_inc                               0
        zfs_vdev_mirror_non_rotating_seek_inc                          1
        zfs_vdev_mirror_rotating_inc                                   0
        zfs_vdev_mirror_rotating_seek_inc                              5
        zfs_vdev_mirror_rotating_seek_offset                     1048576
        zfs_vdev_ms_count_limit                                   131072
        zfs_vdev_nia_credit                                            5
        zfs_vdev_nia_delay                                             5
        zfs_vdev_open_timeout_ms                                    1000
        zfs_vdev_queue_depth_pct                                    1000
        zfs_vdev_raidz_impl cycle [fastest] original scalar sse2 ssse3 avx2
        zfs_vdev_read_gap_limit                                    32768
        zfs_vdev_rebuild_max_active                                    3
        zfs_vdev_rebuild_min_active                                    1
        zfs_vdev_removal_max_active                                    2
        zfs_vdev_removal_min_active                                    1
        zfs_vdev_scheduler                                        unused
        zfs_vdev_scrub_max_active                                      3
        zfs_vdev_scrub_min_active                                      1
        zfs_vdev_sync_read_max_active                                 10
        zfs_vdev_sync_read_min_active                                 10
        zfs_vdev_sync_write_max_active                                10
        zfs_vdev_sync_write_min_active                                10
        zfs_vdev_trim_max_active                                       2
        zfs_vdev_trim_min_active                                       1
        zfs_vdev_write_gap_limit                                    4096
        zfs_vnops_read_chunk_size                                1048576
        zfs_wrlog_data_max                                    8589934592
        zfs_xattr_compat                                               0
        zfs_zevent_len_max                                           512
        zfs_zevent_retain_expire_secs                                900
        zfs_zevent_retain_max                                       2000
        zfs_zil_clean_taskq_maxalloc                             1048576
        zfs_zil_clean_taskq_minalloc                                1024
        zfs_zil_clean_taskq_nthr_pct                                 100
        zil_maxblocksize                                          131072
        zil_min_commit_timeout                                      5000
        zil_nocacheflush                                               0
        zil_replay_disable                                             0
        zil_slog_bulk                                             786432
        zio_deadman_log_all                                            0
        zio_dva_throttle_enabled                                       1
        zio_requeue_io_start_cut_in_line                               1
        zio_slow_io_ms                                             30000
        zio_taskq_batch_pct                                           80
        zio_taskq_batch_tpq                                            0
        zvol_inhibit_dev                                               0
        zvol_major                                                   230
        zvol_max_discard_blocks                                    16384
        zvol_prefetch_bytes                                       131072
        zvol_request_sync                                              0
        zvol_threads                                                  32
        zvol_volmode                                                   2

VDEV cache disabled, skipping section

ZIL committed transactions:                                        40.6k
        Commit requests:                                            8.7k
        Flushes to stable storage:                                  8.7k
        Transactions to SLOG storage pool:            0 Bytes          0
        Transactions to non-SLOG storage pool:       69.6 MiB       6.5k

root@truenas[~]#
1 Like

I see your point for efficiency but with only two drives I struggle to see the point as that’s only 12-18 W at full tilt.

That’s 1.22 KWh per month which would total a whoppin 16 cents.

Is all this worth it: wear and tear on the drive heads and motor, data safety, all for less than a quarter?

IMO just skip HDD entirely and go for large 2.5" SSD

2 Likes

I’ve spent little time on HDD. When it’s directly connected or over network mount to one or more desktop computers, it was pretty difficult to control when to or not to spin. So much so that I simply left it spinning always which I believe is healthier for its lifespan at the expense of some stupid money on electricity.

I hope you’ll figure out something, together with others in this thread. Obsession is great and fun when new findings are reached.

Just a casual question. Do you have battery backup? For clean shutdowns. This will fuck any database on unclean exit. Ive screwed over nextcloud a lot doing this. Now I moved to enterprise ssd mirrored stuff for my slog. A 980 pro for l2arc and sync=always for anything a db is on

Just personal experience. Dont let it discourage

This too and I dont mean to be a wet blanket but nas drives that are validated for 24/7 use or ent drive dont like to be switched on and off a lot. I believe backblaze talked about this once? It makes sense mechanically btw. Starting and stopping is where problems are statistically most likely to occur. Its also where you incur the most power draw. Steady state pulls far less power motor wise

Doesnt have to be nvme. He can run dual sata Intel DC S3700s. They are plenty fast for the task of slog. Not as faat as RAM but because they are intels special High Endurance MLC they last a really long time. Theres been a good bit of praise for them on the truenas forums at times. I bought some spares in case i screw mine over with sync=always

Hey everyone! Thanks for all the input, I really appreciate it!

That is what I have read as well. But that would’ve been to easy, wouldn’t it be?

I will definitely add a special vdev to my pool. Next thing for me is setting up a VM and testing this. I will report back with my findings. Thanks for the suggestion!
Does somebody know how I can get an estimate of how much space my current zpool utilizes for metadata? I have no clue as to how big a special vdev should be, even without allocating any small blocks of data to it.

Interesting thought. That would definitely achieve, that all data is written to the zil on the slogs. But I would agree with your thought, that this would impact performance dramatically. Another thing I can try out in my VM! I will report back.
Can someone clarify how the dirty_data_size setting comes into play? I fail to recognize the correlation. I would assume, that dirty data is everything, that is in the process of writing, but not being actually written to disk, correct? So, data flowing through the cache and being hashed, compressed, etc.

I’ve flown over the two articles. Incredibly interesting, but I will have to take some time to dig deep into it and understand it fully. But there are certainly some helpful ideas I will try out.

I have thought about the actual impact this endeavor would yield quite a lot and I guess it will be negligible. It is pure entertainment and the never ending will to tinker.
As far as drive wear is concerned, I do get your concern. But my use case is not mission critical and I actually think, that modern hard drives can deal with being spun up and down quite well. I use 2 Toshiba MG08 Enterprise Capacity HDDs, which are rated for 600.000 loads and unloads. I guess I will be fine. But I will report back if I am proven otherwise.

No, I actually don’t have a battery backup. However I am thinking about investing in a small UPS. But those are very inefficient and draw a lot of extra power, do they not?
Luckily all the datasets for my databases (postgresql, mariadb and prometheus at the moment) are set to sync=always, because I had issues with unclean shutdowns and system crashes in the past, although they were not caused by a power out. This is something that I learned some time ago.

How long have you been running your slogs on this SSD model? Does it show any wear yet?
Do you have any recommendations as to what SSD to use for this job? I personally would have looked for some used TLC- or MLC-SSDs, but apart from this I would have not looked any further.

Thanks again for your posts. Let’s keep the ideas flowing.
I will set up my VM now and hopefully have the time in the next few days to test a few things.

Cheers!

Don’t do that.

You can do those above. I suggest to also add a slog device. You don’t need to mirror slog device. slog device is mirrored with the in-memory log in nature. It is mainly for performance.

There may be some misconception with txg sync. If there is not data to write, txg sync would not spin up a harddisk. I have this setup at home. Thus, I can confirm this.
The key is to minimize disk access by any chance. Don’t put vm on that pool. You can also import the pool as readonly.

Optane or enterprise grade mlc drives

I understand where you come from, recommending a SLOG device is always the more responsible position if you don’t know the workload and the situation.
But the truth is, the majority of all private NAS systems have a 100% async workload.
SAMBA, for example, doesn’t make sync calls, NFS does, and with iSCSI it depends on the application.
The vast majority of benchmarks are useless and will lead to wrong decisions, except they can really mimic the intended use for your System

The recommendation should be either “set sync=always” and use a SLOG or test all real workloads and see if something uses sync calls and how important are this data for you.
Maybe it’s just a test database, does it need a SLOG?
I’d rather say it’s better to recommend, to spend the Money for a second NAS for replication than to slaughter the performance of your zpool with RaidZ2 and sync=always.
As I said, I’m talking about private use here, where the probability of a pending TXG of irretrievable data is very unlikely.

1 Like

Any success with this? I’m more interested in coalescing writes for a longer interval to reduce fragmentation and not as interested in drive spindown.

At work we have an all-flash EMC Unity that appears to have 20-something GB of battery-backed RAM that it uses as a landing zone for writes. The txg flush interval seems to be every 15 minutes based on the performance charts. Now I understand how our dba’s can shitkick this thing year after year but the SSD’s never wear out. Write amplification is probably close to zero with the big sequential writes this thing does.

I just got an intel D7-5600 6.4TB u.2 drive off amazon for $369. It is rated for 3 drive wipes per day for 5 years. This drive line was sold to solidigm, so that is where you get the tool. The tool reported zero hours on, so it is a new drive.

With normal use I don’t think that you could wear out this drive.

Mixed load of dockers/VMs/data:
2 x spinning rust mirrors
1 x SAS SSD special vdev mirror for metadata/smallfiles
1 x SAS SSD L2ARC

I landed on sync=disable, zfs_txg_timeout=115.

<2 minutes prevents the heads unloading on the disks. I assume there are some small power savings due to the disks spending more time in Idle A and less time doing writes. But mostly this is about coalescing writes, less fragmentation etc. Can’t really imagine any scenario where I would care about a power failure rolling the server back a few minutes. During the night this would be even less of a concern. Will do some testing on extending timeouts to several hours and not taking any snapshots during that time.

If I understand correctly, zfs_dirty_data_max is limited to 4GB unless
zfs_dirty_data_max_max is increased before zfs module is loaded. To set aside 16GB for this:

/etc/modprobe.d/zfs.conf

options zfs zfs_dirty_data_max_max=17179869184
1 Like