Build for a Data Plumber

Hi,

I’ve put together a basic build outline and would love some feedback to make sure I’m on the right track.

The build is for my freelance business, I’m basically a data plumber for SMBs. Most of my work is pulling, cleaning, and consolidating data for my clients. Generally, I’m working with less than 100k records, but a couple times a year I processes datasets with 40+ million records (10GB-15GB total). My current setup, a Lenovo Legion 7 laptop w/ AMD Ryzen 9 5900HX and 32GB RAM, keeps having problems so I’m switching to a desktop.

Also, I’m starting to migrate from Windows 10 to Linux and want a Linux friendly system that will allow me to run a Windows VM for a couple apps.

My budget ranges from $2k-$6k USD, but my outlines are closer to $3k.

I’m looking at:

  • Ryzen 9 5900X
  • 64GB DDR4
  • Radeon RX 6800 XT or Nvidia 3080
    or
  • Intel i7-12700k
  • 64GB DDR5
  • Radeon RX 6800 XT or Nvidia 3080

A couple doubts/questions I have are:

  • How does Alder Lake’s new architecture work with VMs? My research so far shows that it is still pretty weak.
  • Nvidia is tempting, but I hear that Radeon is friendlier with Linux. Is that still the case?
  • A Threadripper 3960X is tempting, but is so expensive and I’m not sure it provides enough value to justify the price. If you disagree, I’m willing to be persuaded otherwise.
  • ECC or not to ECC? (Found some Micron DDR4-3200 32GB ECC memory that is tempting).
1 Like

I would suspect this depends on which cores are assigned to the VM - Alder lake has a BIG.little architecture and the Performance cores will still run circles around the E cores. So a VM assigned to two E cores will perform noticably worse than a VM assigned to two P cores.

AMD has better integration and currently similar performance, tier-wise. Nvidias bigger power figures are hamstrung on Linux desktop currently, see most recent benchmarks from Phoronix, where the 6800 XT has the same performance as the 3080 and the 6700 XT beats Nvidia 3070 Ti.

Currently, Threadripper only makes sense if you really need 24+ cores or the PCIe lanes. 99.99% of everyone doesn’t need either, though it is nice for development and as a media workstation.

In my opinion, ECC is nice but unless you deal with massive and frequent repeated data transfers (as in, constant file shuffling) it isn’t really worth paying a premium over. I just wish ECC would start to get mandatory on DDR already, only reason it isn’t is because Intel decided “only server people will want this”, cut the support from consumer CPUs and then priced the RAM accordingly for server vs non-server markets.

Since there are no cheap ECC sticks available, they were correct, but if the tech would be better supported on consumer platforms the prices would drop quite a bit. ECC is nice-to-have but not worth the big premium, but if it were more widespread everyone would go for ECC over non-ECC. Who cares if you need to pay $140 instead of $130, right? But at the same time it’s not really worth $170 over $130, not for something that won’t really matter for most uses.

To be clear, the likelyhood a non-ECC stick screws over your files is something like 1 in 10 000 every time you write that file somewhere. This can be easily mitigated by a sane backup strategy and I also believe some file systems have a built in resilience to bit flips, these days.

Also, do be aware the 12700k does not support ECC unless you get a W680 motherboard. Z690 will not support it. AMD, on the other hand, support ECC with both B550 and X570(S) motherboards.

As a final word, since it sounds like you will be travelling around a lot, I would take a look at a smaller form factor like a 10-liter case or so. Plenty of options to choose from, but my main recommendations are the Loque Ghost S1, the Sliger SM580 the Phantek Evolv Shift XT or the SFFTime P-ATX.

You need an ECC supported motherboard. IIRC, ASRock has formal ECC support. Most of Asus has ECC support. MSI and Gigabyte only has some supported motherboards (do check the specification on the official page of your motherboard).

You also need a PRO CPU. Those are only available to the general public as a Ryzen 4000 series CPU. They are only 2nd gen Zen architecture (vs the 3rd gen Zen used on the current 5000 series). Meaning you really cant utilize PCIe gen 4 speeds even if you are using x570 (and b550). Take note of that if you intend to use a PCIe Gen4 NVME SSDs.

Could you talk more about what issues you’re using? Does the work take too long to process, or are you running out of RAM, or does your processing fall back to on-disk processing?

Also your laptop isn’t exactly a slouch, and according to the specs it will happily support 64GB of RAM. While this may be “wasted” if you buy a desktop later, if RAM is your bottle neck it would be the cheapest method to solve your issues right now (especially if you have a spare DIMM slot).

If you’re struggling for CPU it may be worth waiting till later in the year, there are no firm release dates but both AMD and Intel are launching newer products in “second half 2022”.

It’s also worth asking if there is better tooling for what you’re doing. A small change in how you process data may have a huge impact on processing time.

If you pin your VM to the P cores then it will run just fine. Issues only happen when you let the VM switch between P and E cores.

Are you using OpenCL or Cuda? Or is this just for gaming? Nvidia binary drivers are just a pain, but most of this is handled by the distro. If you’re doing anything strange (kernel modules, arch/gentoo) then AMD is a good choice.

Arguably the main use case for ECC is long-running servers, to ensure data doesn’t get corrupted when it sits in RAM for months (and sometimes years). For a desktop/workstation that gets rebooted once a week/month it’s less of a concern. If you’re afraid of corruption then readjust your backup strategy to suit, as that covers far more use cases.

Could you talk more about what issues you’re using?

It really boils down to three main issues:

  1. The computer will often get sluggish, the fans will ramp up, and when I open Task Manager to see what happens the computer slows down even more (occasionally freezing all together). As a result, I’m not able to actually troubleshoot to problem.

  2. Fairly often the computer won’t shutting down (or reset). It goes through the Windows process, and then just stays on. I’ve tried walking away for 30+ minutes and it is still shutting down. At that point I do a hard shutdown.

  3. When I switched over to Linux (in part to see if it would fix the problem, and because I’m not a big fan of Windows) I kept running into problems trying to setup an Windows 10 VM.

    • VM run with virt-manager wouldn’t allow me to activate windows (cli qemu worked)
    • I could never get the VRIO drivers install on Windows 10.

Saw the problems even after a fresh install of Windows 10 (after trying ArcoLinux and PopOS).

I’ve just assumed it was a mix of not cooling and laptop hardware, hence the reason I’m looking at a desktop.

However, upgrading my RAM is a good next step. If problems don’t go away, I can still move forward on the PC build and the laptop still more memory.

Are you using OpenCL or Cuda?

Not yet, a lot of my scripts use Python’s pandas library and I know Nvidia has CUDF which is very similar that I want to try out. I haven’t yet because the 3080 in my laptop uses my system’s memory (another argument to try upping it to 64GB).

GPUs not so much for gaming, but because I need a GPU with a Ryzen 5900X and I have been learning ML. If I go Intel, it isn’t necessary, but it really boils down to a why not.

ECC Memory

Thank you for all your input on ECC memory. It has actually been one of those unknowns for me. I’ve done research on the topic, but kept feeling unsure on whether the added costs and slower speeds made sense for my desktop.

This is indicative of out-of-memory. Memory upgrade would definitely help here.

Yeah, VFIO is still a pretty newfangled thing and the chances of it working properly on $RANDOMLAPTOP with Linux is pretty slim. Recommend dual-boot for now, if you still want to enjoy Windows gaming. Not ideal, but workable.

Keep task manager open on the memory graph, this way you can monitor it as its processing.

do you have a clear idea on how threaded the applications are ? I read something about python with pandas, as far as i know that is pretty single threaded unless you specifically make it multi threaded. take a look at task manager while it is running and see if all the cores are pegged or just 1 core (i.e. most of the cores are at 10-0%)

The Alder lake is definitely the fastest for single threaded applications. i have a i5 12400 running that has a higher single thread score in cinebench than the 5900x. and it’s only a 200 euro cpu. (i just have it setup as a test bench for my business)

By the way, a 3080 will never use system memory, it will have it’s own video memory. The integrated gpu is probably using system memory, but not that much. It’s mostly a bug with task manager because the gpu is connected to the igpu.

I’m not entirely sure what you are doing as a data plumber, but given you have decided to build a desktop the incremental cost of using ECC is sufficiently small that I’d have though it worth doing, given that people’s data is your business.

ECC isn’t necessarily any slower - unbuffered ECC which is the standard on desktop platforms basically runs at the same speed, there isn’t a latency penalty that I’m aware of (except as compared to fast clocked overclocking sticks).

ECC is not officially supported with non-pro Ryzen processors but in practice has been shown to work, with some caveats, depending on the board used.

Current intel processors will work and are fully supported with ECC provided you are using the correct motherboard, which basically means a W680 chipset as I understand it. These are starting to reach market now, might be a bit cheaper and more widely available in a month or two.

But having said all that, first step is probably to bump your laptop up to 64gb RAM if it is practical for you to do so… I expect you’ll still want the portability from time to time, so it won’t be wasted money even if you do build a desktop down the line.

First, thank you everyone for your advice. I’ve rarely if ever written questions on a forum, but this experience has converted me. :smile:

tldr;
I bought 64GB of ram yesterday for my laptop. Once that comes in I’ll see if it improves things. If it does, I’ll hold off building a new computer until later this year when new CPUs are released.

If it doesn’t, I’m looking at:

  • Intel 12700k or 12900k (for single core speed and iGPU)
  • 64GB or 128GB DDR5
  • 2TB Samsung 980 Pro
  • Skip a discrete GPU for now. I can always buy one in the future.
  • Haven’t decided on a motherboard (W680 is super tempting because I plan to turn this computer into a server when I upgrade in the future)

With that, below are responses to posts.

Done!

That sucks for the laptop, but is a plus for building a new desktop.

Agreed, but there is that nagging doubt that I might need more cores for VMs. But as addressed already, I can pick the cores when I create the VM.

Yes pandas is single threaded, as is most of my code (good argument for Alder Lake). When I need multi-threaded, I just rewrite those parts of my code.

That makes so much sense. I was confused when I saw a GPU with 1GB of memory and assumed it was the 3080 and not the iGPU.

As an aside, an iGPU is another argument for Intel. I want a discrete GPU, but for most of my work it isn’t necessary. When it is, I have the option to use my laptop or renting compute in the cloud.

It really is a mix of different tasks. More recently, I find myself writing python scripts to extract, transform, and load data (basic ETL). Some of these scripts are one-offs, others are written for Apache Airflow. However, I also have a couple clients that I provide support for FileMaker, but that has really dropped off over the last few years.

Working with other people’s data is the reason I keep thinking about ECC memory. One of my clients is a music publisher and every quarter I process their music royalties (about 1 million new records mixed with another 40 million). The mix of the size of the dataset and that it deals with money (and python decimals instead of floats) is the reason I keep looking at ECC.

However, this process was originally done in FileMaker and took around 70-80 hours. I rewrote everything in python and now it takes about 15 minutes (toot my horn a bit). Since the data is in memory for such a short time it might not be as big a deal. Plus, the process went from running royalties once, to running it multiple times. As such, we get a good idea what the numbers look like and can spot any differences.

The K variants offer very little extra over the non-K variants on 12th gen, and to take advantage of the extra stuff you lock yourself into a $200+ Z690 motherboard. It might be more worth it to go with the regular 12700 or 12900 and a $120-$170 B660 motherboard, but that is entirely up to you if you need/want the K features. :slight_smile:

W680 seems to be quite expensive BTW, $350-$500 or so, so the ECC premium is… high on Intel boards still :slight_smile:

1 Like

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.