What is a "large" file

rbn920 · May 22, 2019, 3:20pm

When reading about filesystem and/or raid performance if often read about certain configurations being better for “large” or “small” files. What defines a large or small file? Sorry if this is dumb, I googled the heck out of it with no success.

Lauritzen · May 22, 2019, 3:43pm

My guess it has to do with the file size allowed by the harddrive filesystem.
a large file is sort of a abstract concept? is it 100kb 100mb, 100gb, 100Tb?
but there is a difference is how large files can be on filesystems, fx. FAT32, can handle ~4gb files, and ntfs can handle ~whatever sized files.

weenieHutCEO · May 22, 2019, 3:46pm

Its probably relating to random read/write rather than sustained and sequential. I’d recommend watching some SSD reviews by Linus Tech Tips he does a pretty good job explaining these concepts.

Trooper_ish · May 22, 2019, 3:52pm

Not a bad question, just really subjective.
I would look at it in more extreme ways- text documents are tiny, a few kb or whatever, compared to a dvd backup or steam game of many gb.

Most systems have a mix of the two, so it can be tricky deciding which way to optimise.

For my systems, I generally think of anything 1mb or less as a “small” file. Even then, some datasets can have predominantly different file sizes, eg one dataset favouring large files for steam games, a large set for media, a small set for home, and probably favouring small for root.

Does anyone have a suggestion for a tool assessing the size of files already on a system?

Trooper_ish · May 22, 2019, 3:56pm

I’m thinking along the lines of du?

BGL · May 22, 2019, 4:50pm

I don’t think there is a proper answer. These days I do tend to think of files over 1gb as large, but that’s pretty meaningless although I will format spinning rust with larger blocksizes if I know a drive will only hold videos etc which are typically around that size or larger.

rbn920 · May 22, 2019, 5:11pm

Reading your comments makes me realize why I couldn’t find an obvious answer. It’s just funny when you read things like “XFS is great for large files but not small ones” when nobody can actually tell you what those sizes are.
In my particular case, I am most concerned with the handling of raw photos. These files are on average about 30mb each and my current library is around 2tb or about 60,000 files. For the most part the files are dealt with on an individual basis (for editing with Darktable). Would my ideal filesystem be one that handles large files well, such as XFS?
Also, these files are on a 2 disk array. Raid 10far2 currently. Many resources talk about needing to assign a chunk size that suites your file sizes. Again, being vague about what is large and what is small and what accompanying chunk sizes would be.

SgtAwesomesauce · May 22, 2019, 5:15pm

It really depends on the scale.

When you’re talking about Large files, it’s typically 100MB+ in my opinion.

I do a lot of work with media, so I’ve regularly got 50GB+ files.

Small files are typically in the tens of megabytes or smaller.

The logic behind “good for large” vs “good for small” is all about how the filesystem handles metadata and random IO.

If the filesystem needs a lot of time to write metadata, it’s going to be less good for lots of small files.

chaos4u · May 23, 2019, 12:19am

you can get a ‘sense’ of small vs large files when doing mixed file transfers between devices.

you can watch the file transfer dialouge scream along in the MB/s to even GB/s when transfering large files such a iso’s and various container type files. (disk images, archives, etc)

but when that transfer hits a directory full of kilobyte and even smaller byte size files those MB/s fall to the way side and your back at kb/s and hour or more waiting times for large sets of this type of data.

and this reply from quora’s “mathew” gives good insight into the technicalities of small file data transfer.

https://www.quora.com/Why-does-copying-multiple-files-take-longer-time-than-copying-a-single-file-of-same-size

FaunCB · May 23, 2019, 12:23am

more than 200 MB

as defined by HP pre-sys diagnostics software

d0rk · May 23, 2019, 12:35am

I’m pretty sure it refers to the 2gb file size limit on fat32 filesystems.
(Edit looking into it, it looks to be 4gb, I swear 2gb was a thing too, but maybe that’s fat with 8.3 only)

I think it was sort of worked around in the windows API around ME/2k that allowed larger files.

But I do distinctly remember it being a no-no to make a file larger than 2 gb on a fat 32 volume in Windows.

though through some degree of fuckery you could manage to make a 2+ gb file, and typically to delete them you had to use a Linux livecd. I recall torrent clients doing it pretty regularly and Outlook making absurdly large pst files (4-12ish gigabytes) and eventually breaking pretty spectacularly. I feel no sympathy for the mail hoarders that havent deleted spam, nor sent mail since 1995.

PhaseLockedLoop · May 23, 2019, 12:58am

It also depends on the file system what can be classified as large or small and what a maximum file size can be. This is far less important nowadays though

d0rk · May 23, 2019, 4:19am

In benchmarking terms, it almost depends on the drive.

if a drive has read/write speeds of 500MB/sec a sequential write of a 100MB File isn’t really a significant “endurance” test for the drive as it can do it in less than a second.

but, for random read/write tests, 100MB of random data could be a decent sample size if it were split up across various regions of the drive.

block size could also be a factor, for ssds you can’t write a half a block so it gets held or the remaining space padded. using odd sized writes with DD can make a drive perform slower than it should.

in modern drives, i would suspect that running
dd bs=4k if=/dev/zero of=testfile
would probably run noticeably slower than
dd bs=13k if=/dev/zero of=testfile
because of the odd block size.

theres a lot of ways to look at it.

but if you’re looking for a “rule of thumb” large file for gauging write performance

for a traditional hdd (~100MB/s write)
2 GB

for a sata ssd (~450MB/s write)
8 GB

for NVME SSD (1GB + /s write)
32GB

SD Card (cheap <10MB/S, high end ~75MB/s)
500MB

usb 2 flash drive (1-10 MB/s)
100MB

USB 3 flash drive ( 10 - 300MB/s (cheap/high end)
1GB

cburn11 · May 23, 2019, 4:41am

If you’re thinking of a 2GiB limit on windows it may have been the 2GiB virtual address space limit on 32 bit applications since > 0x80000000 was reserved for the kernel.

Another older 2 limit was the 2TiB limit for partitions on MBR drives since they used a 32 bit LBA and drives typically had 512 byte logical sectors even if they had 4KiB physical sectors.

thro · May 23, 2019, 6:15am

I’d suggest that it depends very much on context.

For me personally, i consider a “large file” to be one that is more than say 1MB (or something around there) in size.

Why? Nothing to do with filesystem size limits, but more to do with sequential vs. random access on spinning rust.

A 1MB file stored contiguously can be read pretty damn fast from a hard drive (probably one sequential read op). 1 MB worth of 4k files may be scattered all over a platter and take a lot more time to read. And will be split into say 256 IOPs. If they’re all over the disk it will take a couple of orders of magnitude more disk IO to read.

But yeah, its very arbitrary and depends heavily on surrounding context. The above context is IO speeds. If you’re talking filesystem limits the numbers will be entirely different.

For RAID performance analysis as per the OP, id consider treating “large files” as files that are bigger than your RAID stripe size, and “small files” as files that are smaller than a full stripe - because that’s where your performance will be impacted.

A small file that is < a stripe will require a stripe read/recalc and write to do a write.

A full stripe write won’t need to read the stripe first (to re-calculate parity on the rest of the stripe to avoid clobbering files also stored in that stripe)…

This sort of thing is why you want to choose particular RAID options or filesystem block sizes carefully for performance.

A large block size on your filesystem will read big files faster but waste storage on files that are smaller than the block size. It will read small files slower because it reads the entire block… e.g., on a 4k file with 64k block size, your disk gets told to read 64k… even though you only wanted 4k.

Ditto for RAID stripe size - if your stripe size is too big you will be continually re-calculating stripes (which means a read first) when you re-write data for files smaller than your stripe size. But if it is too small, you will maybe not get the full benefit of sequential block reads on large files.

If you’re setting up RAID for a database for example, you want to pay attention to your database record sizes vs your stripe size for maximum performance. Otherwise you incur read/write overhead on the storage layer (i.e., your raid/disk drives) that didn’t need to happen for the database workload if the sizes were all lined up properly for example.

This will always happen to SOME degree because not every single write will be the same as your block/stripe size. But if you can pick sizes that are the best fit most of the time (to result in the smallest number of partial stripe writes) you will see better performance.

As per @SgtAwesomesauce above there are also other considerations like metadata writes. Again context dependent, but the OP mentioned RAID specifically hence the above attempt to explain…

thro · May 23, 2019, 8:45am

I’ll also mention - for most home user type scenarios it really doesn’t matter too much. Especially on SSD storage. SSDs can do so many IOPs the trade-off between block size and file size isn’t such a big deal. Just use 4k blocks and be done with it (IMHO).

I just generally stick with 4k block sizes on my filesystems and use ZFS instead of traditional RAID, which has a variable stripe size. EVERY write in ZFS is a full-stripe write.

But… if you’re a DBA or system architect and trying to make spinning rust go fast for a specific workload (probably a dying scenario), all of the above is stuff to take into consideration.

oodsway · May 23, 2019, 9:48am

@Trooper_ish
~$ du -sh /path/to/file/or/directory
is my go to.

e.g.

  ~$ du -sh /media/me/data/iso/
    50G     /media/me/data/iso/

Trooper_ish · May 23, 2019, 11:26am

Thanks bud, I appreciate it.

MetalizeYourBrain · May 23, 2019, 2:33pm

A large file can be considered anything that makes use of the sequential reading and writing performance of a drive. So it’s a sustained load.
A small file are files in the order of kBs not large enough to trigger a sustained load and rely basically on the performance of the controller and the latency of the memory you’re writing on.

anon85933304 · May 23, 2019, 2:43pm

A 15 yr old and a 55 yr old will give different answers.