Faster than fread() and read() in C?

According to CrystalDiskMark 8 (and Windows Task Manager), my Gen4 M.2 SSD has a measured sequential read speed of ~7000 MB/s.

But my Visual Studio C program only achieves ~900 MB/s when using fread() to read a 16GB file in large blocks (e.g., 1MB blocks).

I’ve tried much larger and somwhat smaller blocks with about the same results. Also similar results with read().

Is there another C function that would do better than fread() or read()?

Thank you.

Don’t know about specific C functions, especially portable ones—but investigate memory-mapped file access, that should be faster.

3 Likes

The increased performance likely comes from asynchronous IO operations.

See Synchronous and Asynchronous I/O - Win32 apps | Microsoft Learn for Windows-specific solutions.

2 Likes

Yes. CrystalDiskMark’s a UI wrapper for an older version of DiskSpd that surfaces -o as queue depth. So only Q1 ends up synchronous.

Not sure for C/C++ but, given recent gen hardware and adequate DDR bandwidth open, benching ~4 GB/s per thread through ReadFile() should be doable. In real programs that read data to something with it, though, it doesn’t take many operations on the IO thread to drop throughput below 1 GB/s.

Generally what I do for 7 GB/s is run enough threads to keep the drive saturated (varies a lot but 8-12 isn’t uncommon). But, even when coding for DDR efficiency, it’s easy to have a workload that saturates dual channel DDR before PCIe 4.0 x4.

My practical experience has tended to be async is less DDR efficient than sync IO. So, to borrow CrystalDiskMark terminology, Q1T8 app implementations can outperform Q8T1.

2 Likes

Thanks, Marandil and Lemma.

Regarding queuing (or queueing?): I think of queuing as performing one operation after another – not performing them concurrently. Is an NVMe drive capable of concurrent operation? If not, how does queuing help? Same for multiple threads?

For my particular requirement (the file size is 200GB), raw sequential read speed is needed, as the computations between reads is minimal.

Thanks again.

1 Like

Both are valid. :grinning: There’s quite a bit of complexity to it but IO stacks in general are capable of concurrent operations. Queuing here refers to giving the stack (kernel, device driver, drive firmware, and hardware) a bunch of requests to work on in parallel. In the meantime, app threads either block on a particular synchronous request until it completes or are released to do something else until an asynchronous IO completion call’s triggered by the requested data having been moved.

If the file format permits, the fastest way through 200 GB may be something like eight threads reading ~2 MB per call with the calls progressing sequentially through the file. The larger the amount of data read per call, the lower the overhead per byte read. But the higher the latency since retrieving more data takes longer. What you’re looking for is the balance that keeps enough threads putting up requests for data that the drive stats saturated without making the reads so large the cores get in a fight over L3.

3 Likes

Thanks, Lemma.

The high level of the program makes repeated calls to fgets() to process .csv records in which the processing time is negligible.

I can replace fgets() with MYfgets(), which in turn would, as needed, call MYfread() to read the next 16MB block using (per your suggestions) 8 threads, each reading 2MB of the block.

First off, is my understanding of your suggestions correct?

Second, to do this, do you know if all the functions needed to launch the threads, issue the reads, and detect completion are covered in the link you provided earlier – or should I be looking elsewhere. Again, preferably in C.

Thank you.

1 Like

Typically you’d use an object oriented task library. Of which there’s a bunch. std::execution and boost::thread are among the more popular I’m aware of.

If it’s a fairly typical .csv, changing to a more efficient format like Arrow is likely to free a couple orders of magnitude.

Thank you, Lemma.