VM Lab Build - Advice / thoughts wanted

Hello,

I’ve been asked to design and setup a VM Lab for my team at work. We provide / maintain medical IT solutions in the UK .

This includes: Diagnostic image rendering, DICOM (CT scans etc) study storage, and database systems, such as reception / job tracking systems.

Much of this involves a number of our products being deployed together in a hospital and they all ‘tie’ in the each other in one way or another.

The breif for this project is to be able to provide three people with enough VM provision to create and store test systems for each type of product we work with - and be able to run a number of these concurrently. This will be used for practise / training on these systems in addition to testing new ideas etc.

I have worked this out to require:

• Ability to Store and run +/- 50 VMs. At least ⅔ (33) of which are expected to run at any one time.

• Minimum of 10TB, single disk failure tolerant storage for VMs (will be backed up elsewhere)

• This storage solution needs to be quick enough to serve +/- 40 VMs, without undue I/O latency. ( I plan to use proxmox with a ZFS file system).

• High thread count CPU(s). A powerful CPU, with as many threads as is reasonably possible. I beleive this workload is expected to scale / perform better with additional thread count, vs raw computational speed?

• A Minimum of 400GB Memory (RAM).

After coming up with a few ideas, I have put together the following proposal for work. using two servers - attempting to allow for futire extra overhead where possible.

I would really apreciate any input on this - if I’ve got it all wrong, tell me, please!

Approxamate cost of this (ex VAT) will be £32000. - have I got this spent well? - this just goes over the budget I was given.

if there are better ways to get this project done - I’m really keen to find out!

The reasoning for the low power GPUs is, for some test systems, I’m thinking it will be helpful to be able to passthrough a real GPU in some scenorios. The realworld Rendering systems mostly use 3x Quadro RTX 4000’s - however, this is split across a number of users - which we wont be needing in the test lab.

I added 1 RTX 4000 16GB card as a way of playing it safe - should we ever need to run a render VM with the correct ‘Required’ GPU for the applicaiton. This, and I wanted to try and split it up using vGPUs or SR-IOV… (not sure if this’ll work?)

As previous - I plan on using proxmox for this, joining both servers together into a proxmox ‘datacentre’.

The points thing I’m most worried about are:

  • List item
    if the CPU in each server will be sufficent to run up to 25 VMs - some of which take a fair bit of ‘umph’. I estimated 260 vCPUs were likely to be allocated between all the VMs for three of us with our own enviroment. The two AMD EPYC 7713P CPUs would give a total of 256 threads. I belive this would overallocate without much issue - sharing threads betwen VMs?

  • List item
    If the planned SSD storage 4x Micron 5300 PRO 3840GB SSDs in ZFS RAID-Z1 - in two lots of two per server will be able to keep up with a respectable i/O delay in proxmox? (storage isn’t really my thing, so advice especially appreacted here!)

Sorry for the long post - thanks in advance for your advice!

3 Likes

Looks like your institution/company has money to burn so you did well. Always go just a bit over the budget allowed to prove the C-Tards that they are always wrong and they are to blame because you can’t do your job well.

With that said… yeah, there might be better ways to cover your needs for half the price (at best two-thirds) but I don’t think you’re in a kind of company that would trade jank for 3-5k pounds of savings so hit it! Btw, your RAM looks awfully low for so many VM’s, are you sure you don’t need at least double that amount? I do not understand what exactly VM’s will do but two gigs of RAM per VM at worst case scenario seems very low. Maybe I am missing something?

Edit… yeah, it’s 20 gigs not 2 gigs, I focused on threads not the actual VM’s… my bad.

1 Like

It’s been an odd one, slightly awkward to be fair.

When given the £30,000 budget, I stressed that it was absolutely overkill, however my manager came back with saying how he wanted to get something really good and not compromise whilst the company was in a good place cash wise.

That said, it will be expected to last a very long time I suspect.

Given the fact I had the budget I’ve found it even more stressful, knowing I’ve got to make the absolute best of it I can.

Yeah, I allocated 512GB memory. I changed up the plan a little today, switching to a single 4u chassis, however otherwise identical specs, just combined into one. This gave some budget to add in a spinning rust backup array.

I do wonder if I should really dig in about the over spending though…

Appreciate the honest feedback

I’ll play devils advocate here - have you considered running this in the cloud instead?

Reasoning:

  • pay for what you use, rather than buying for what you might use
  • monthly OPEX cost rather than up-front CAPEX - project cancelled/no longer required/etc. - just turn it off and stop paying the bills for it.
  • being a LAB, you can shut it down overnight to save rental costs further
  • upgrading storage IO throughput is relatively trivial - just add/mount appropriate blob storage rather than buying a SAN, upgrading network/fc switches, etc. Want 100k IOPs for this VM - for a week? Done!
  • pretty sure most vendors now offer cloud GPU compute
  • VM hosting costs can be easily tagged to a user/project/department and accounted for
  • as you’re getting committed resources, you won’t eventually end up with complaints from user A that they aren’t getting enough resources but no one wants to pay to upgrade the whole cluster.
  • your building burns down - you still have your lab in the cloud.

If your company policy precludes using cloud services - fair enough. And I understand the costing from Microsoft et al can be a nightmare to navigate.

But the above points are huge drivers that make me want to move my on-prem stuff to the cloud. No more “oh I can’t give you that because things are tight” its just “this will cost you $X/month - you can have whatever you’re willing to pay the running costs for”. IT provisioning becomes far less of an impediment. It’s just “Give me a project cost code”, click, click, here’s your bill.

You can scale up way easier than having to go through the capex process, waiting for hardware to arrive, etc. Getting hardware right now is a nightmare - wait times are huge. Also, you do not need to play storage performance tweaker, VM guest vs. host performance diagnostic expert (on Prem - greedy VM on host may impact other VM, in the cloud you can/do get committed resources with no contention), etc.

Also, do not underestimate the ability to tag/bill consumption. As above, when things get scarce and no one wants to pay for more of it, everybody loses. Add ons for vSphere, etc. to do consumption based billing are expensive.

Shiny new on-prem hardware is fine when its new, shiny and under-utilised. It becomes more of a pain 3-5 years down the track when its stretched and no one wants to pay to replace but are entirely willing to complain voiciferously about it.

On Prem lab stuff is great for playing with but for business use its a hard sell in 2022 IMHO.

Honestly, every single point you list makes complete sense to me.

We have some more recent customer projects that are cloud hosted. This is the way its all going… We currently install physical bare metal boxes on almost every site otherwise however.

My manager and colleagues are not keen on this cloud hosted solution, which doesn’t help with the timing of this project in that respect.

Frankly, and I know it seems arrogant to talk down people above my station, however my manager has good knowledge on the main windows products we have. Linux / Server hardware, which is a large part of what we do now… No interest or intent to learn. There are many occasions where I feel that horrendous decisions are made, simply due to lack of knowledge and interest. The company is reasonably new in the IT sector of our field and it shows, with a limited understanding of what we really do elsewhere in the organisation.

Personally, the one thing that is sure to get me out of bed in a morning is playing with severer / enterprise hardware. - and having such ‘cool’ hardware to play around with would make my year - that said, there are very few benefits from a business point of view, as you say.
My own ideal vision of the future is that we begin co-locating our own vm hardware and suppling SAAS to customers. However, with the size of other cloud providers, I’m not sure how realistic that is.

Sorry I went off on a bit of a rant there.

Having read your really well put together post has given me a push to go back and put up this idea as an alternative proposition to the hardware approach.

I suppose this way, if it there are issues, I can fall back on ‘I did reccomend against this’

Thanks, really appreciate the thoughtful input

1 Like

Privacy maybe? Data protection paranoia? One of the companies I did IT for had privacy policies in place that will never allow them to have any cloud solution. Ever! You should read my contract with them, 30% is about top-secret level b.s. And they are basically a wholesaler/service business. But they are also completely right, they do not want any liability as a company if there is any data leak, they will exactly know where the leak came from.

The most important server does not have access to the internet, you’d have to physically go to a terminal computer, get a JSON (because I said so) dump of what you are going to work on onto a flash drive, do stuff with the data, then import a JSON back to the database with server side checks and processing. For some reason it barely slowed down the processes but the software solution was complicated as fuck. Several audits over the years, never a single problem with accounting, invoicing or any other aspect of the business. From what I can tell they are doing just fine even today doing that.

Given the requirements and the budget, I would try and squeeze in shared storage and more RAM as opposed to maxing out cores with the biggest processor you can get …
You will be stuck with two very powerful single noed servers with local storage, so growing out of them both in terms of capacity and flexibility will be a pain.
Given the 32K budget my breakdown would be:

  • 2-3K for the GPUS you listed
  • 9K for each server - try to get in at least 512MB of RAM in each one, get whatever proc fits the budget, and you want multiple 10Gbit links, one pair for client traffic,one pair for storage
  • the rest of the budget for the storage, something like a Dell M3280i, an expandable 24x2.5 SSD/HD enclosure with 10Gbit ethernet connectivity that can present iSCSi LUNs to your proxmox cluster, filled in with 12x4TB SSDs to give you 36TB RAID6 and one hot spare, and room to grow to double that without buying another chassis
  • 10Gbit capable switch to handle all of this, luckily 10Gbit capable switches have gone down in prices and you can get a 16port one for 500 pounds nowadays …

Honestly, for 50 test VMs 128 cores looks like a lot to me, in test environments our ratio of virtual to physical is 10:1 and we still have plenty of cpu power to spare, so assuming each of your 60VMs gets an average of 8 cores, that makes it 240vCPUs, or 120 for each server, and for a test env we would go with a single socket machine with 20 cores/40 threads and we would still consider that to be less than 50% capacity. If it was production, we would probably double that to bring the ratio to 5:1 …

But it is your setup, so YMMV, there may be the need of a ‘bling’ factor where your boss wants to have the biggest processor out there and/or brag about how powerful your test system is … it just looks unbalanced to me …

1 Like

Yup, privacy is a valid reason (less so for lab though which shouldn’t be using production data) and a large reason why we have been slow to adopt cloud where i am.

But the business has come to see it as inevitable, especially as people want to work from anywhere which requires exposing your on prem stuff to the internet anyway. And i’m sure Azure’s DDOS mitigation for example is way better than anything a small enterprise can afford.

And besides, if you don’t trust say, Microsoft’s cloud offering, then you probably shouldn’t be running their OS either, but most businesses are.

edit:
one more plus for lab in the cloud - given that’s the way things are going eventually, having a lab up there will enable you to get familiar with it in advance of a future production workload shift.