SSD setup k8s node + ML

I am working on a bare-metal k8s cluster which runs Jupyterhub with MicroK8S (virtualizing jupyter notebooks).

I am planning to have around 5 simultaneous users running ML-type workloads on CPU (mostly inference, no training). For this workload the performance difference between NVMe and SATA drives is negligible. Each user gets a PVC of around 100GB, some would be more around 50, gigs others more 200GB, depending on their requests, etc.

I would ideally like to ensure that the users’ data is safely stored and backed-up. And RAID is no backup and hardware RAID might even be dead?

So with this in mind, what would y’all suggest for a storage solution?

I was thinking of two 1TB SSDs in RAID-1, and a daily backup to a cloud provider. With K8s + Ubuntu running on a seperate SSD.

Or maybe there is something that I can do better? Or something enterprise-level that I overlooked?

Enlighten me in your ways!

A possible solution could be to use an enterprise-level storage solution such as a SAN (Storage Area Network) or a NAS (Network Attached Storage) to store the user data. These solutions are highly available and provide excellent data redundancy since they can be configured with multiple RAID levels. Additionally, the data can be replicated between multiple SAN/NAS devices for further redundancy. You can also take advantage of cloud storage solutions such as Amazon S3 or Microsoft Azure Storage to back up your user data. This could provide an additional layer of protection in case of a disaster. Additionally, you can use a backup software solution such as Bacula or Veeam to automate the backup process and make sure that your user data is always up-to-date.

To set up an SSD on a Kubernetes node for machine learning workloads, you will first need to provision the node with an SSD. This can typically be done by attaching an SSD to the physical server or by using a cloud provider’s block storage service.

Once the SSD is available, you can use Kubernetes to create a Persistent Volume (PV) and a Persistent Volume Claim (PVC) to make the SSD available to your ML workloads.

You can then use Kubernetes to deploy your ML workloads and mount the PVC to the appropriate directory on the pod, so the data is stored on the SSD.

It’s important to keep in mind that, you also need to make sure that the workloads are properly configured to use the SSD and that they are optimized for performance.

Additionally, it is important to monitor the SSD usage and make sure you have enough storage space to accommodate the data generated by your ML workloads.

In summary, the steps to setup an SSD on a k8s node for machine learning workloads are:

  1. Provision the node with an SSD
  2. Create PV and PVC to make the SSD available to your ML workloads
  3. Deploy your ML workloads and mount the PVC to the appropriate directory on the pod
  4. Properly configure the workloads to use the SSD
  5. Optimize the workloads for performance
  6. Monitor the SSD usage and ensure enough storage space is available.

You could have a look at rook . It allows you to run ceph in the k8s cluster and then use the ceph file/block/object storage for other k8s applications.

RAID 1 (mirroring) is a good choice for this use case because it provides data redundancy in the event of a single drive failure, while the daily backups to a cloud provider will provide an additional layer of protection in case of data loss or corruption.

However, using RAID alone should not be considered a backup solution, as it does not protect against data loss due to software bugs, human error, or other types of failures.

Additionally, you may want to consider using a storage class with snapshotting capabilities, such as Ceph RBD, GlusterFS or CephFS, so you can easily take snapshots of the PVCs and rollback to a previous version in case of data loss.

One other thing to consider is to use a PV provisioner to automate the creation of PVs and PVCs, like the FlexVolume provisioner or the StorageClass provisioner. This can help you to easily manage the storage of your users’ data and automate the backup process.

In summary, RAID 1 combined with cloud backups and a snapshotting storage class with PV provisioner is a good solution for your use case, but make sure you also have a disaster recovery plan in place for other types of failures.