Ah, I work with ML as well, albeit indirectly, but it’s also much larger environment (our playground/experimentation environment costs are many thousands per day whether working or idling). Nevertheless I now understand your use case better.
If I can guess, you’re probably grabbing hdf5 files or annotated photos or captured stills from videos, or some kind of binary log files and just streaming them through VMs or containers. Sometimes you’re just transforming the input into some intermediary format and saving it back. Sometimes you’d like to stream this through GPUs in batches and you’d like to schedule this with k8s, or at least that’s the dream.
Here’s the deal:
moving that existing machine into a rack case doesn’t buy you anything - it doesn’t enable you or others to do anything more than they can already do. You can keep it vertical or horizontal in the bottom of the rack where you have power and network, it’ll be just as good as it is today, no better no worse for what you need.
Your users are writing code anyway, you can serve the “assets” / “training sets” / examples/models over plain old https or samba or whatever, your users writing tensorflow or apache beam code can figure it out. (if they can’t spin up a background thread to fill a python queue fetching assets over https in 10-15 minutes, wtf! why did you hire them?).
So, don’t move the machine to a new case, don’t touch the machine if it’s still working. Get a new machine with 24bays only (because you’d cry), and only for storage and light weight storage serving (which at 40gbps might not be as light weight as you want it to be). Populate 8 with those bays with 16T drives (e.g. seagate exos maybe - they’re cheap) run raidz2 in a single vdev. rsync over the stuff, and backup your current assets (cloud restores are expensive) and you’re done.
Then, install another raidz2 vdev just by populating disks the machine with another 8 16T disks (or maybe 20T/30T whatever is the norm in 6mo / 1y from now).
Then get another server, and then recycle/reuse your current 2990wx / 2970wx machine for something else less intensive.
This would be your step.1 on the ladder, just to have storage around.
Step 2 would be “actual processing”.
I guess you probably want 1VM per host there to act as your k8s node. Slightly smaller than your host size, and this is just so you’d have a host that can run a kubernetes node.
This is hard - your storage becomes remote storage, you have k8s volume plugins to deal with, everything becomes slow and complicated, your lab environment becomes complicated. This is all very much … ouch! . if you have 10 people doing ML and VM, you need to hire someone to help you figure this out and maintain this together - and that means you’ll be filling the rack quickly.
In the interim you could control how your containers are scheduled and you could just give them local filesystem access, eventually you’ll figure it out and then you’re on step 2.
Your other kind of server for you will be a GPU + ram server machine. That’s your step 2.5 … and this is where it becomes even more painful and expensive. your yaml could just require 2 GPUs and k8s will find one of your GPU filled machines to schedule on. All storage through CSI. I’d imagine you’d have 2-5 racks, 2-5 people to maintain them, and 50 devs + 50 non technical support staff (baristas, cleaners, office managers, … not all 50 devs would be doing ml, not all workload would be ml).