Naive Questions WRT Data Science / Machine Learning Hardware

artur · May 31, 2021, 5:31pm

TLDR

I need to buy a workstation or server for a Data Scientist with an ~$10K budget. How do I get my money’s worth? Is it worth spending the extra money on an A6000 or A40 (requires a rack-style enclosure)? If so what’s the best way to go about virtualizing the hardware?

Background

I’m a Software Engineer who started working on a startup that uses some “AI” about a year ago, we recently raised some money and are on track to have our first hire in around two weeks.

This is also the first time I ever worked with CV or ML. I was about to hobble together some prototypes that impressed investors, customers, and some experts in our domain. I also fully realize I know just enough to be dangerous, so we’re putting this into some actually competent hands before we roll this out to production customers.

We’re bringing on a Senior Data Scientist in the next two weeks and looking at the possibility of bringing on a slightly less senior person to work with them in the next 1-2 months.

I need to give these people the resources they need to do their job.

My Current Development Hardware / Workflow

I managed to make use of my Gaming PC (8700K, 16GB RAM, 2080Ti, 500GB SSD – for Linux) for the work I’ve done so far. I’ve also experimented with some machines on Lambda Labs and AWS EC2.

I use a MBP as my primary machine (work tasks, development) and then use my linux “workstation” to actually run my experiments / pipelines. I want to re-use this paradigm (i.e. MBP as the “work” laptop, a remote server / workstation that allows the heavy lifting to be done).

The Hardware Dilemma

When we developed our financial models (before the silicon shortage) I made the assumption that we can just go high-end consumer hardware: Threadripper or 5950X, RTX 3090 for every Data Scientist to use as their workstation. We were leaning towards having somebody like Lambda Labs or System76 build the actual machine. We were also considering the Lambda Labs GPU Cloud service as an alternative.

Then the GPU, CPU, Storage, Everything Shortage hit.

This is actually problematic from both perspectives: I’ve been having a hard time getting the cheap ($2.5/Hr) Lambda Labs GPU Cloud instances. And an RTX 3090 is now $3090 (if you’re lucky!).

Which puts me in a weird position. I would have considered an A6000 at ~$6000 to be expensive before, but now it’s starting to make some sense seeing how it’s not terribly impacted by the market insanity yet. It has significantly more memory (40GB vs 24GB). That could give us more flexibility to experiment on larger models locally, without going to the cloud.

On the flip side having an A6000 for everybody is going to blow out the budget. So an ideal case would be to create a much beefier host machine running server / workstation grade hardware (ECC seems the main benefit here). I’d need to virtualize this. We have a few weeks to get this right but looking for guidance on the best way to do vGPU (i.e. hypervisor choice).

Closing

I’m aware of my vast ignorance here, I’m not going to make a decision without talking to our hire about his preferences / recommendations. I’m just trying to approach that conversation from a more informed position.

wendell · May 31, 2021, 5:49pm

A6000 is available with fan iirc. But depending on what they’re actually doing a 3090 may do the job more economically? It depends. You just can’t deploy the 3090 in the data center. Only on the data scientists desktop. The 3080ti is tomorrow probably and maybe also close in perf with the hash rare limiter so who knows

If you’re willing to queue up vendors like evga will work with you. Idk what their current lead time is tho

I’d you can tell me more about the job or software maybe I could do something tests and get a video out of it.

Still reading…

artur · May 31, 2021, 6:33pm

@wendell, thanks for the quick response. Not too concerned about 3090 deployment restrictions (this will be an in-office machine that they may occasionally access across a VPN).

I can’t publicly disclose what we’re working on, I’d be happy to show you in a video call but I don’t think “stealth startup AI thing” makes compelling content for a video.

In terms of what techniques / technologies we’re currently using: mostly OpenCV, TF and dlib. We’re doing things like BRISQUE (“quality” classification), Face Detection (dlib), Edge Detection (Structured Forests), RCNNs (feature detection), Grabcut (segmentation). Various natural scene statistics, etc.

I’m pretty certain we’ll be doing some feature classification work imminently, and maybe some light NLP tasks.

Lots of unknowns, I know NLP is very memory heavy but whether we can gain a performance benefit on the rest I’m uncertain about.

The real compelling thing about an A6000 is that it supports NV vGPU. Meaning I could assign something like 32GB of memory to Data Science and 12GB for builds / data engineering / testing / etc. I’ve never actually ran my own hypervisor before, or touched any IT-related tasks in over a decade so I’m pretty sure I’d need to hire a contractor to flesh this out properly.

wendell · May 31, 2021, 10:25pm

So nv does allow cuda in a limited number of containers. For developer workstations. It is possible last I checked and might be one of those fun shared use case things that could work as prelude to something more full featured like vgpu ? Maybe ?

artur · June 8, 2021, 4:50pm

Here’s my tentative builds:

Gigabyte

1 x GIGABYTE AMD WRX80 Motherboard (GA-WRX80-SU8-IPMI)
8 x Samsung - DDR4 - module - 32 GB - DIMM 288-pin - 3200 MHz / PC4-25600 - reg (M393A4K40DB3-CWE)

– OR –

Asus
1 x ASUS Pro WS WRX80E-SAGE SE (Pro WS WRX80E-SAGE SE WIFI)
8 x Kingston Server Premier - DDR4 - module - 32 GB - DIMM 288-pin - 3200 MHz

– AND –

Common Hardware
1 x Threadripper PRO 3975WX 32-core
1 x PNY NVIDIA Quadro RTX A6000 48GB GDDR6
1 x SAMSUNG 980 PRO 2TB PCIe NVMe Gen4
1 x EVGA SuperNOVA 850 G5, 80 Plus Gold 850W
1 x Noctua NH-U14S TR4-SP3, Premium-grade CPU Cooler for AMD sTRX4/TR4/SP3
1 x Noctua NF-A15 PWM, Premium Quiet Fan, 4-Pin (140mm, Brown)

I’m planning on putting this in a Fractal Design Meshify 2 XL for the time being, will add some additional NF-A14 Industrial PWMs in front to ensure we’re good on cooling if necessary.

I think I’m running right up to the limit with the PSU but there’s no sense in buying anything short of 1300W if I were to pop in another GPU so I’m deferring that purchase until then (and hopefully a price correction).

Memory is on the QVL, everything else seems pretty standard, looking for a gut check before I commit

system · March 9, 2022, 10:51am

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.