Kubernetes Cluster Storage

I was thinking about setting up a kubernetes cluster to test out different dbs (postgres, neo4j, elastic), but also to run things like spark, trino, minio, airflow/flyte/prefect, kubeflow, mlflow, etc.

The most annoying thing is that kubernetes typically uses either local volumes or connect to a remote nfs store. With the ability for a node to go down, local volumes are out of the question for distributed storage, and nfs is ok at best, but it seems like a very manual way of managing storage across multiple devices and potentially causes other issues.

Building a true nas node and setting that as my nfs store seems like a bottleneck and a potentially expensive way to get enough throughput to support all nodes. This would not exploit any data locality (latency issues galore). What storage system should I look at setting up for a multi node setup? Should I look at something like gluster fs or ceph for storage or something else?

p.s. some of the larger datasets I deal with can be 10s of terabytes.

I think I’m going to try and use k8s with OpenEBS for typical dbs and backup snapshots to my nas server, and use minio for s3 compatible blob storage. We shall see how it goes.

There’s lots of Ceph / Rook users out there too, not sure how much total data or disks you’re managing.