HDD Cache size - Is it relevant in RAID? And how does it impact I/O

I'm aware of what purpose the HDD cache serves, but I'm curious if it's really all that relevant in modern storage solutions like ZFS (JBOD) or RAID(z/2/3).

Typically homelabbers can find two types of drives: enterprise and consumer. I found these two here, same brand, to use as an example:

Enterprise: 3TB, 7200RPM, SATAIII 6Gb/s, 1 year warranty, 64MB cache - $72
Consumer: 3TB, 7200RPM, SATAIII 6Gb/s, 1 year warranty, 128MB cache - $60

As you can see, the consumer drive has 2x cache size and is $12 cheaper.

In my experience, most homelabs have at most 6-24 HDDs in 1-2 RAIDz1 or RAIDz2 VDEVs.

My question is this: Does the larger cache and cheaper price of consumer drives outweigh the disadvantage of not being purpose-built for use in a NAS/SAN? When used in an array, does the cache affect I/O performance significantly?

Most articles I've come across tend to focus on raw read and write throughput while ignoring I/O altogether. My situation is that I'm building a NAS to be the backend for a compute node running multiple VMs and databases. I/O is very important!

...but so is keeping the cost down while retaining data integrity (losing all my storage would be bad, y'know?)

I'm currently looking to build a RAIDz1 pool of 5 of these drives with LZ4 compression enabled. This seems to be a good mix of capacity and throughput for my budget. I'll eventually expand by adding a second identical pool when I can afford it. The part I'm unsure of is how to maximize my I/O, and how disk cache affects that.

Enterprise drive: HGST Ultrastar
Consumer drive: HGST DeskStar

While both drives in my example are refurbished you can also look at the two most common homelab drives used today (January, 2017): WD Red (shucked from My Book Duo) and Toshiba x300.

NOTE2: I think this would be a great topic for Wendell to cover in his ZFS storage series.

huh, no answers yet? I was expecting people praising zfs and telling you that everything will be fine

Short answer: no one can tell you because it depends on a workload. Cache is there mostly to smooth IO spikes and maybe turn some random IO into sequential one. But if you're expecting a heavy random load 24/7, especially if it's mostly random write IO, cache won't really help you. Whatever it is: drive cache, controller cache, SSD cache, RAM cache, and whatever its size is - you will exhaust it at some point and after that the destaging process will be bottlenecked by raw drive performance, ~300-400 read or as low as 70 write iops in your case.
In case of high random load, there is no silver bullet solution except for the obvious one: throwing money at the problem. You can either go for a shitload of spindles, or go for SSD (or maybe even FMD).

i kinda agree, depending on r/w frequency and amount being written. Things will go smoothly if you have cache, and you are able to save that data before filling it up.

But if your data is explosive, like 2-4GB and you want it fast read/write, and there's couple minutes of space most of the times between the work loads then cache makes extreme difference.
( 2-4GB is example, if you have memory; lets say whole server is for raid storage why not use 60% of that memory as cache)

then again we have different types of caching, and what you consider 'safe'.

If you don't mind loosing changes or don't think you'll ever loose power, I recommend using unsafe or writeback caching. This will make huge difference. Data will be kept in memory until its written to hdd's.

in terms of typical production environment, many consider writethrough as best option, its usually default cache mode. As its considered as safe. Performance is reduced because there is no write cache.

What are the gains?
Reading and writing 1GB to a slow 5200RPM drive with its own 64mb internal cache, using unsafe caching, gains are almost 10x in performance for a single drive.

You will benefit from raid 0 in terms of performance but you risk loosing data, so raid 5, 10 is one of the best in production adoptions. Again if you don't mind loosing data raid 0 is for maximal performance.

Having cache for read only is good approach, put re-usable data that isn't changing a lot on faster storage, and or with better caching. In terms of read performance you should have cache.

In any case, if your storage is poor; caching is one of the best options to alleviate the problem.