Building a office rack, server hardware advice needed

lacion · February 17, 2020, 9:25am

im moving away from consumer pc cases to rack-mountable.

i currently have a TR2990x with a asus mobo and 10 disks. i wanted to move this to its own 1u server + disk shelf, and them move the TR2990x to another 1/2u case and use it as a VM server. i would also be interested in adding a second VM server.

i have no clue on any of this i been reading and checking on ebay a lot and i feel like i dont understand enought to start buying.

im guessing for storage i could go with a lowend server cpu/ram wise but rhe part i am more lost about is the disk shelf+ 1u server i understand disk shelf use a backplain that connect trought a sas interface but i dont know witch ones will i end up needing.

for the tr consumer hardware i guess there are cases that could fit that?

also what is a good general purpose vm server i could get? epyc seems kind of expensive.

im looking to build all of this as low cost as possible. i dont need any bleeding edge server hardware.

i would also like them to have 10gb networking.

any guidance would be greatly appreciated

risk · February 17, 2020, 10:12am

when you say office rack? how many people are working in this office, what is the cost of e.g. a day of this storage downtime.

I’m asking in order to get my bearings straight - the right solution should be needs driven. (e.g. perhaps you’re doing something stupid like hosting 10 people’s home dirs on a single machine when a day of downtime loses you customers).

For 10 disks - I wouldn’t bother with a disk shelf. Instead just get a 19" 2u or 4u case that fits the disks and the rest of your system; quick example without too much searching: https://geizhals.eu/?cat=geh19&xf=7647_10

When it comes to having a second VM server. You won’t be able to bring down the disk machine hosting VM images for upgrades without bringing both down - it’s probably better to build another disk+ram machine for VMs. Eventually, you’ll either have enough storage machines overall to make it worthwhile to disaggregate storage from compute using e.g. something like ceph rbd as backing store for VMs – or your business/office/work needs won’t grow as fast as compute/storage densities and you’ll go back to two disk+ram machines (primary and failover or some such thing).

EDIT: btw asrock rack makes AM4 and x399 boards with IPMI and 10G nics - once you move cases into a rack and out of sight in some closet somewhere, rebooting machines for upgrades and os reinstall and diagnostics that require keyboard / video are much easier if you have IPMI or similar.

lacion · February 17, 2020, 10:24am

the number of people will vary, but right now there are 10 people, and depending on projects that can grow temporary to +10 more.
the storage is mostly, shares for some assets and sharing between workstations here, there is nothing mission critical in them, people can still work of their local machines, (no one works directly of the servers)

for the disk shelf, i currently have 10 disk because i cant fit anymore in the current case, but i would like to have plenty of room to grow that in the future.

as of know i dont need any big HA for this so having to bring everything down for a while (2/3 hours) is not a big deal nor i thing it will be a big deal in the long term.

risk · February 17, 2020, 10:43am

cool.

So the use case is basically a big usb flash drive that happens to be network connected, and you don’t care if ZFS or whatever goes nuts and messes up raid metadata and array becomes unusable, 2/3 hours you could reinstall everything needed from scratch.

and whatever VMs run are tiny and (e.g. unifi controller or pihole or something else) and are backed up elsewhere.

In that case I still don’t think you need a disk shelf. Let’s say you get a switch with 1 or two 40Gbit ports and 10Gbps other ports. You can get a 24 3.5" case for $450 - if you do 16T drives which is kind of a data hoarder norm these days (not cheapest per byte but good tradeoff in cost per byte and density) that’ll have around 200T of usable space if you do three 8 disk vdevs in raidz2. … It’ll take 15 hours to move data in/out if everything is fine — assuming you can actually get 4GB/s transfer in/out of the machine and not counting troubleshooting time, “oh no what have I done” time. “passing blame onto others” time and so on … my point is, I don’t know many people who’d feel confident about 15-24h restores… not outside of hyperscale datacenters where we make jokes about sending RPCs that make drivers move trucks with tapes…

Perhaps you should keep your existing 2990wx for now, and just add a second disk/ram box before you get rid of that one.

Here’s a cheap (relatively $500) 24 SFP+ / dual QSFP+ switch: https://mikrotik.com/product/crs326_24s_2q_rm

You can also get cheaper second hand 10G or 10G/40G switches online from ebay for server use you can usually bond multiple 10G ports to get more bandwidth, in fact you can typically break out the 40G QFSP+ into 4 SFP+ ports and either server 4 different hosts or bond the links - depends what you want to have plugged in.
If your 10G hosts are on copper, you should probably look at other switch options. (you could use the same switch as a core and maybe use the dual 40G ports as uplink to the RJ45 access switch).

lacion · February 17, 2020, 11:15am

Thanks very usefull info.

for claficiation, i do a lot of Machine Learning so the storage is usually for datasets. (most of them are multiple GB) right now i have about 30TB of storage with thoose 10 disk but at this rate i may need a lot more in the next 2 years (this data sets are backup in the cloud) but is a lot cheaper to work with it locally.

the VM’s are not small, they are complicated systems working with the assets and custom software. some of them can get resource intensive. some of the VM are kubernetes cluster for development and testing.

i dont wat to get rid of the 2990 i just want to move it to a rack mountable case.

thanks a lot for all of this info, is pretty clear i still require a lot more research into all of this.

risk · February 17, 2020, 2:05pm

Ah, I work with ML as well, albeit indirectly, but it’s also much larger environment (our playground/experimentation environment costs are many thousands per day whether working or idling). Nevertheless I now understand your use case better.

If I can guess, you’re probably grabbing hdf5 files or annotated photos or captured stills from videos, or some kind of binary log files and just streaming them through VMs or containers. Sometimes you’re just transforming the input into some intermediary format and saving it back. Sometimes you’d like to stream this through GPUs in batches and you’d like to schedule this with k8s, or at least that’s the dream.

Here’s the deal:
moving that existing machine into a rack case doesn’t buy you anything - it doesn’t enable you or others to do anything more than they can already do. You can keep it vertical or horizontal in the bottom of the rack where you have power and network, it’ll be just as good as it is today, no better no worse for what you need.

Your users are writing code anyway, you can serve the “assets” / “training sets” / examples/models over plain old https or samba or whatever, your users writing tensorflow or apache beam code can figure it out. (if they can’t spin up a background thread to fill a python queue fetching assets over https in 10-15 minutes, wtf! why did you hire them?).

So, don’t move the machine to a new case, don’t touch the machine if it’s still working. Get a new machine with 24bays only (because you’d cry), and only for storage and light weight storage serving (which at 40gbps might not be as light weight as you want it to be). Populate 8 with those bays with 16T drives (e.g. seagate exos maybe - they’re cheap) run raidz2 in a single vdev. rsync over the stuff, and backup your current assets (cloud restores are expensive) and you’re done.
Then, install another raidz2 vdev just by populating disks the machine with another 8 16T disks (or maybe 20T/30T whatever is the norm in 6mo / 1y from now).
Then get another server, and then recycle/reuse your current 2990wx / 2970wx machine for something else less intensive.

This would be your step.1 on the ladder, just to have storage around.

Step 2 would be “actual processing”.

I guess you probably want 1VM per host there to act as your k8s node. Slightly smaller than your host size, and this is just so you’d have a host that can run a kubernetes node.

This is hard - your storage becomes remote storage, you have k8s volume plugins to deal with, everything becomes slow and complicated, your lab environment becomes complicated. This is all very much … ouch! . if you have 10 people doing ML and VM, you need to hire someone to help you figure this out and maintain this together - and that means you’ll be filling the rack quickly.

In the interim you could control how your containers are scheduled and you could just give them local filesystem access, eventually you’ll figure it out and then you’re on step 2.

Your other kind of server for you will be a GPU + ram server machine. That’s your step 2.5 … and this is where it becomes even more painful and expensive. your yaml could just require 2 GPUs and k8s will find one of your GPU filled machines to schedule on. All storage through CSI. I’d imagine you’d have 2-5 racks, 2-5 people to maintain them, and 50 devs + 50 non technical support staff (baristas, cleaners, office managers, … not all 50 devs would be doing ml, not all workload would be ml).

Airstripone · February 17, 2020, 2:57pm

I’m confused by a couple of points in your post. Can you clarify:

is this a commercial business setup? If so I’ll stop here and just say seek professional advice. No amount of amateur enthusiast recommendations will replace having a professional business needs assessment. Tiering, performance and backups have a million combinations and frankly it is a business expense you should consider to sustain your business.
if this is not a commercial setup then what are your data needs / loads meant to be? You refer to varying user loads, mixed use cases, data sets (structured or unstructured?) And a mixture of availability needs.

I’d suggest breaking the problem down into 3 steps, the last of which should be hardware choice.

Step 1. - what data loads and how many unique use cases (to determine how many instances you need.

Step 2 - performance and availability (ie can run overnight and is tolerant of faults or must run all day and no room for data loss

Step 3 - what hardware will deliver 1 and 2 within your budget, and if your budget doesn’t stretch then can you build part of it now and scale later.

lacion · February 17, 2020, 4:48pm

this is commercial, though i lack a bit of knowledge on the actual hardware side, i been devops engineer for the past 5 years and a sysadmin for a few years before that. im 100% confident i can setup everything up as well as i have been doing for the succesfull startups i work for in the past, even though i dont have lots of experience with hardware im sure i can learn some of it.
see above.

anything we do that is mission critical requires HA deployments or is going to be production worthy is not going to be running on prem, we do all of that in the cloud, what i am trying to setup here is purely development to reduce cloud costs.

a handfull, i dont suspect having more than 20 users i doubt there will be more than 5 simultaniusly.
this needs to be as performant as possible as local develop i would prefer for things to be faster to reduce iddle time.
that said i dont need the fastest thing on earth, i wont be buying teslas gpu or titan to get a few extra x% faster, if they need to go grab a cup of coffe while the train something let it be. most heavy training will happend on the server or over night. if i need to go with a mix of server and comodity hardware that is fine. there are no brownie points gained by having a production grade datacenter at the office.

i dont need to have a copy of the Petabytes of datasets we have in cloud (we only used our own datasets to test and right now thats about 20TB and growing). hence why i though on a disk shelf to easily add harddrives as needed.

i would like to have a couple of VM servers to run a few services. small k8s cluster were we test a few of our internal tools and deployments (this is small and does not do any production heavy workloads, we do load testing on a CI/CD pipeline with a throw away cluster)

this VM’s will be running a few utilies like. bitwarden, bind, tooling like bazel, gitlab, a few our our inhouse tools for ML, VPN, databases, elasticsearch, prometheus, grafana etc.

none of this stuff is mission critical, they can crash and we burn them and redeploy them, and we can live with them if their down for a few hours.

we backup all our data to the cloud.

the parts where im super doubfull is things like if i get a disk shelf what should i look for in the backplane, what kind of addon card do i need for the server to connect the shelf.

what consideration on case sizes should i get. can i buy a rack 2u case to put a x399 mobo EATX to host the current tower im using so i can put that in a rack inside a closed room in the new office we are building? Small Size Server room

thoose are most of the things im super lost about as is been a long long time since i did anything related to server hardware.

Airstripone · February 17, 2020, 6:17pm

Ok as this is commercial I won’t give you specific advice and suggest you seek professional (paid) technical guidance before making any commercial decisions that may impact your enterprise.

I general terms though you could buy a 2u case and put an ATX consumer motherboard in it. You can build a consumer grade system and use it for enterprise workloads. It is not recommended though. Things like dual PSU redundancy and remote management are features of enterprise kit.

I’d avoid 1u servers unless you need high density. They are tough to cool and won’t take expansion cards without adapters

3u is enough for full size cards and possibly a normal desktop cooler and drive cages

A disk shelf is a generally widely supported component, just make sure your HBA and minisas cables are compatible with the particular brand and it should be ok, but I can’t comment on specifics

For your workloads there are many hardware combinations. Look at your software vendors recommendations, or build some test rigs and see if it gives you the performance you need.

Don’t forget additional space for ups and switch gear. Networking is generally commodity these days but plan for expansion.

lacion · February 17, 2020, 6:39pm

thats ok, if you dont feel confident recomending anything.

im not looking for business advice i can do that pretty much on my own. i have a need of something that i want to build myself, if i was looking for someone to do it for me with guarantees and such i would hire someone.

but this is not the case. this is a business need that does not require any enterprise level on any of its parts what so ever.

thanks for all the info though very usefull to read.

Airstripone · February 17, 2020, 7:02pm

No problem. Just be aware that soliciting advice on your commercial infrastructure will always make L1ers nervous about providing specific guidance.

I suspect many contributors would gladly design and build your entire server rack as that is what we enjoy doing. The issue is it could be construed as free consultancy, and many professionals frown on that. Moreover in some markets it may be illegal to give professional advice without a license, especially in the good ol’ US of lawyers.

If you have specific questions about “will X work with y” then we can support. Your questions are more general though and lean towards architectural design work.

You are on the right track with looking at research areas to focus on. Consider your business needs and design the hardware around that, rather than trying to find hardware solutions and seeing if they will work in your use case.

regulareel · February 17, 2020, 10:57pm

OP will be using a threadripper CPU, i believe a 4u sized rack may be more appropriate to provide cooling.

Ive seen relatively cheap Rosewill server 4u rack off Amazon. It should fit most of your consumer hardware.

lacion · February 18, 2020, 4:02am

thanks! will take a look at those.

lacion · February 20, 2020, 4:38pm

i found a good deal for a second hand version of this https://fantec.com/es/producto/fantec-src-4240x07/

with 3 HBA cards ASHATA M5015 LSI 46M0851

the whole thing for 130 euros.

i will give that a try.

system · November 20, 2020, 10:38am

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.