Advice requested - build multi-GPU server versus cloud

Hi all,

I work for a small ML startup and we’re on the verge of outgrowing our current hardware (one AM4/dual 1080 TI and one skylake-x/dual 2080 TI).

I’m unsure of whether it’s worth building a more capable server, or if we should just start moving more to the cloud.

The server I’m currently looking at as a reference is the GIgabyte G292-Z22 (I can’t post links, but the product page is the first Google result). It’s a 2U, single 7002 Epyc, supports 8 dual-slot GPUs (4 pairs, with each pair sharing a PCIe 3.0x16), redundant 2200W power supplies.

A few questions regarding this setup though,

  • Would a server like this be a good solution for ~5 users keeping 8 2080 TI (or similar) GPUs running close to 24/7?
  • It has redundant 2200W power supplies. Do these power supplies typically run balanced under load? 8 GPUs @ 250W each, plus CPU and everything else is right at the max of both the power supply and a 20A breaker. Is it standard to run each power supply off of a separate breaker?
  • Would this be expected to keep 8x consumer blower-style GPUs cool enough? 2080 TIs/Titan RTX cards are all slightly higher power than the V100s they propose using, but AFAIK, the cooler on the V100 isn’t anything special. If the power limit is honestly real, I wouldn’t mind power limiting all the cards to 250 W.
  • How low maintenance do servers like this tend to be? Where it’s located has occasional internet and power interruptions and it’s not always easy to get physical access to the machine. With a good UPS and Gigabyte’s remote management software, is it fair to assume this would be at least as reliable and low maintenance as our current desktops? We’re running Ubuntu right now, would probably run Ubuntu or CentOS.

Building a server like that ourselves would definitely be cheaper in the long term than training on the cloud, but I’m more afraid of spending too much time keeping everything running well.

Any advice would be greatly appreciated!

That server is great but…

The cooling on server cards is different to consumer ones so I would worry about if this, expensive, option would actually work correctly.

Ie modern servers are front>back horizontal when most 2080tis cool vertically.

Machine learning

It’s hard to tell exactly how the GPUs are mounted in that box, but would even all blower style GPUs get choked out? I thought they were okay stacked in close proximity. I can’t tell if there’s any space between the stacked GPUs in that case though…

Tesla K80s are pretty reasonably priced right now. They’re clocked low, but every other spec is good.
I never realized, they’re actually cooled passively… good thing server fans are bonkers.

