I’m doing a video on this. The intel team looked at the nvme driver in the kernel and noped right out.
What we have now from them is pretty neat looking. A lockless zero copy nvme driver? Let’s merge that in core. Also can I get that running on threadripper/epyc??
Me and my team worked on this with some earlier Optane drives. spdk is tricky to wrap your head around though. It’s not a traditional driver. Unless they changed a lot about spdk recently, there is no way to actually use it as driver for the kernel. Running a filesystem on it directly isn’t a real option. However you can directly link the library to a userspace consumer of block IO like fio, or a database engine. It’s a pretty cool project if you’re making a specialized storage appliance or NVMeoF solution but not really relevant for normal desktop or traditional server usage.
I’m a bit skeptical that with similar tuning effort that goes into getting spdk to perform at that level, that the kernel NVMe driver couldn’t get pretty close especially with IO_uring. With IO_uring the userspace API is nearly identical to the NVMe spec so that should help with any friction in the block layer. BTW - I’m pretty certain that with NVMe the zero-copy argument is bogus. NVMe devices DMA directly to/from the buffers the user application allocates, where would the memory copy happen?
I found some stuff that moves the acid compliance bits to nvdimm type hardware in postgresql and that does make a huge difference in terms of ionscaling and just scaling the db in general. That’s pretty drop in for spdk.
Do anything with open cas? Seems like that gets you a good bit of the at there without going into custom block driver land as with spdk
Been a while since I glanced at opencas, I just reviewed it again. Some question as to how that works. Looks like it is a linux module at it’s core, but they go to great lengths to circumvent the GPL, so the integration with Linux is a little tough for me to comprehend at a glance. Ultimately it’s more a competitor to bcache and company, no?
BTW - Caching is a tough thing to get right at the block layer. So many attempts and yet very little success at mainstreaming. Benchmarking is a huge part of the problem, does the solution even work? I liked your simple solution in the H20 review, real world is so different from benchmarks.
The nvdimm stuff has a lot of potential when applied to the specific problems it can really solve like various locks and logging problems. I have a long history and somewhat sad history with this, I headed up a lot of research 15 years ago to make this happen we had plans on the OS and even figured out the hardware, powers that be didn’t back us up. But I’m not a fan how well the ecosystem was prepared to deal with NVDIMMs. The battery backed ones were ignored for years instead of being used to get the base tech developed and develop a business out of the high value niches where you could charge extra for the nonvolatile nature. We’re a decade behind where we could be.