Data Scientist Workstation

Hi everyone!

I am a long-time YouTube channel follower, and this is my first time posting on the forum.
I am in the process of setting up a workstation for a new data scientist position that I will begin soon. I have been asked to suggest a configuration that would be suitable for my new job.

Currently, I have Ubuntu installed on my machines (Lenovo T495 and Ryzen 3600), and I mostly work with

  • Matlab,
  • Python (pandas, NumPy, etc.), and
  • R.

With my upcoming job, I will mostly be dealing with environmental dataset, including

  • cleaning,
  • statistics,
  • some machine learning, and
  • data presentation.

I am considering a configuration based on either Ryzen 7900x or i7 13700k, as they seem to meet my requirements. Solutions with more ā€œproā€ CPUs (Threadripper and Xeon) exceed my budget.

Intel Platform

PC Component Product
Motherboard ASUS ROG STRIX B760-G GAMING WIFI
CPU Intel Core i7-13700K
RAM Corsair Vengeance 2x32GB
Video Card Zotac VGA Zotac RTX4060TI 8GB Twin Edge OC
Storage HD1 Kingston FURY Renegade 2TB
Storage HD2 Kingston NV2 2TB
Case Fractal Define 7 Mini
Power Supply be quiet! Straight Power 11 Platinum
CPU cooling Noctua NH-U12A
Additional Cooling Fractal Dynamic X2 GP-14

AMD Platform

PC Component Product
Motherboard ASUS TUF GAMING B650M-E
CPU AMD Ryzen 9 7900X
RAM Corsair Vengeance AMD EXPO 2x32GB
Video Card Zotac VGA Zotac RTX4060TI 8GB Twin Edge OC
Storage HD1 Kingston FURY Renegade 2TB
Storage HD2 Kingston NV2 2TB
Case Fractal Define 7 Mini
Power Supply be quiet! Straight Power 11 Platinum
CPU cooling Noctua NH-U12A
Additional Cooling Fractal Dynamic X2 GP-14

Budget-wise, the two solutions are close in price, with AMD being slightly more expensive at 1690 EUR compared to Intel at 1610 EUR.

I have a few questions and concerns regarding the CPU and GPU choices.

First, I am wondering how Intel CPUs are handled in Linux with the Performance vs. Efficiency core. Is there a risk that I will have issues because of this ā€œnewā€ architecture if I pick Intel?
Additionally, I am curious to know how AMD compares to Intel in scientific benchmarks.

For the GPU, I have opted for a 4060ti, assuming that Cuda will handle some of the workload (CuPy). However, I have two considerations here:

  1. Professional GPUs from NVidia are way above my budget, with an RTX A4000 costing around 1000 EUR.
  2. I am not sure if AMD GPUs have good support in this field. The Radeon Pro W7500 and W7600 have an interesting price, but I am still determining how I can benefit from having them onboard.

I welcome any thoughts, experience, or suggestions. Thanks!

1 Like

My first inclination would be 64gb of memory just to give yourself some room for processing the data sets / machine learning.

The other idea is for a 1 tb primary ssd for your os / apps and a 4tb for your data storage. Now this is just a personal preference as I like to keep my data separate incase the os or an app goes south I can just reimage the primary drive without having to worry about data loss perse!

Edit: External drive for data backup / retention.

3 Likes

That thing can eat memory, and Matlab can take advantage of Nvidia GPUs (if you build your stuff for it).

Not worth it. AFAIK Matlab really only asks for CUDA support Link

2 Likes

Thatā€™s a good point!

Right! Already importing datasets from CSV files is quite memory-demanding with Matlab. That forced me to move to 32GB on my laptop and desktop PCs.

Okay, so 64GB RAM and an RTX4060TI should do the job :smiley:

1 Like

Sooner or later you will encounter situations where 64 GB doesnā€™t cut it. I think you are wise to use 32 GB modules, giving you headroom to go to 128 GB.

That said, I am frequently running into this problem with 128 GB. I am at pains to ā€˜upgradeā€™ to a TR/Epyc/Xeon system, because a lot of my work relies on single threaded performance. I donā€™t think there is currently a solution to this situation.

EDIT: Iā€™ve been rethinking this and I think I have failed to recognize that even for single-threaded tasks, with the additional memory I should have greater headroom for embarrassingly parallel tasks. Scoping out a 5975wx upgrade now.

1 Like

64 GiB might even be on the low end. Iā€™ve pushed 50-ish with toy data sets (hobby projects).

EDIT: Basically this:

48 GiB modules might give you an easier upgrade path to 192 GiB.

Keep in mind memory bandwidth matters more and more as the size goes up.

Iā€™ve got a pretty heavy FEA workload that is very parallelizable and a 16 core Xeon W will outperform a 64 core Threadripper Pro by ~60%. The Intel cores are only about 20% faster than the Threadripper cores single threaded; this just goes to show how much more memory bandwidth is available on the current HEDT Intel platform.

1 Like

Iā€˜m not sure how constrained you are in terms of your software stack, but if youā€™re looking at needing more than 64GB of memory there should be efficient ways to parallelize. E.g. apache arrow, polars are things to look in to. Convert your csv files to a more efficient format like parquet files, hdf5, etc.

When it comes to hardware, I agree with others to look at the 48 GB modules as you could upgrade to 192 down the line.

1 Like

Thanks for this. I have since reconsidered the zen 3 route, given zen 4 Epyc is available and Xeon W, as you say.
Can you comment any more on what your setup is?

Quite constrained, unfortunately. Working with single cell transcriptomic datasets that are reaching > 1 million cells and there are only a handful of reputable tools out there for analysis.

1 Like

Most linux distros will work on a variety of cpu platforms without any problems.
But ther are some with specialized versions customized for the different architecture.
These can be found at distrowatch.

1 Like

Genoa Epyc is likely even faster than the current Intel HEDT, at least in my workload. Thereā€™s also the 7000 series Threadripper that should be out in ~6 months but it likely wonā€™t be faster than current Intel HEDT in memory performance.

The situation I described is for a 100m DoF FGMRES problem, the Intel system is a W5-3435x with 512GB of memory at 5958.4MHz, The CPU is running 5GHz when lightly threading and 4.8GHz all thread. The Threadripper system was a non-OCā€™d 5995WX running 3200MHz memory; its a tiny bit apples to oranges since the Intel system can overclock but Threadripper 5000 platform isnā€™t really conducive to overclocking like the Xeon W-3400 platform is.

That 100m DoF FGMRES problem is actually just a standalone benchmark I cooked up to represent my workload, if anyone else wanted to run it. Iā€™d be really interested to see someone with a Genoa system run it.

1 Like

Interesting.

As an aside, I spoke to someone who works at our local university and they are buying a Dell w9-3475x workstation for ~Ā£5k. The university must have a beefy service contract with Dell to be getting that ā€˜discountā€™.

The Epyc Genoa series has my interest (probably the 9274F), and I can source a motherboard (H13SSL-NT) from the UK. The thing that is putting me off is cooling. I donā€™t have much time to spend tinkering and need something that just works.

Not really considering the base price:

As it was specced, the list price was ~19k

that is a very deep discount.

The gotcha with cooling the high end platforms nowadays if often RAM and CPU VRM. The motherboard manufactures seem to be designing assuming the MBs are mounted in rack cases with massive amounts of airflow going over the components.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.