I was thinking about setting up a kubernetes cluster to test out different dbs (postgres, neo4j, elastic), but also to run things like spark, trino, minio, airflow/flyte/prefect, kubeflow, mlflow, etc.
The most annoying thing is that kubernetes typically uses either local volumes or connect to a remote nfs store. With the ability for a node to go down, local volumes are out of the question for distributed storage, and nfs is ok at best, but it seems like a very manual way of managing storage across multiple devices and potentially causes other issues.
Building a true nas node and setting that as my nfs store seems like a bottleneck and a potentially expensive way to get enough throughput to support all nodes. This would not exploit any data locality (latency issues galore). What storage system should I look at setting up for a multi node setup? Should I look at something like gluster fs or ceph for storage or something else?
p.s. some of the larger datasets I deal with can be 10s of terabytes.