(Demo) Datacenter-in-a-Box: EPYC vs. TR-Pro

greg_at_redhat · July 11, 2021, 9:45pm

Hey, first time poster here.

I am looking to build a datacenter-in-a-box (DIAB! - 64c/1TB mem) essentially a single workstation with enough horsepower to run a private cloud for customer demos I do during the working hours, and a gaming guest with VFIO for gaming on the off hours. Space is a bit of a premium here so I have enough room to fit a single “large” chassis and plan on repurposing my Thermaltake Tower 900 from a previous build.

TLDR! - Advice wanted on:

3995WX vs. a EPYC 7713P they are both similar money (MSRP). The primary workload is business apps (I only have a couple hours a night to game if that but I do like to game when I can so its not completely irrelevant).
Timing of the buy: is it worth chasing a 3995WX given how close we are to rumored announcements in August?
Looking for opinions on the performance difference between the WRX80 chipset vs. the SoC from the EPYC for high storage IO tasks (think 2 or 3 asus hyper m.2 pcie gen4 fully poplulated cards).
Recommended cooling given the case I am going to put in here? I am planning on leaving this machine always on and I am debating air cooling vs. water cooling.
Given this use case motherboard recommendations for TR pro and Epyc Milan compatible?

Background Details:

I am looking to run the following as the primary use case:

RHEL 8.x as the base OS
Red Hat OpenStack Platform 16.x
- I am planning on using one of our internally incubated projects for the shift-on-stack CI testing that does ‘one-touch’ ansible orchestrated deployment dev-install. You essentially point this at a RHEL or CentOS box and in about 40 minutes you have an AIO/standalone install of RHOSP(paid) or RDO(upstream).
Red Hat Ceph Storage (v5)
- Planning on using 2-3 of these M.2 bifurcation boards as the primary cloud storage provider. I realize I can just use cinder backed by lvm2, but I need to be able to show off the ceph integration.
Red Hat OpenShift 4.x
- IPI installed
- Running OCS in external mode for managing pools from the root openstack ceph deployment.
- GPU passthrough from worker nodes being deployed with a GPU passed through

My secondary use case is to run a windows gaming VM:

Windows 10
- 8 cores on a hardware flavor that is isolated from the rest of the host.
- 16GB tuned as 1gb huge page allocations
- nVidia 3090 or equivlanet AMD GPU passed through (4k gaming)
- ~256gb of block storage will be provided via cinder from the 3x4 m.2 drives
- File shares will be provided via Manila as native CephFS mounts via the ceph-dokan project.
  - Cloudbase has a great writeup on Ceph in Windows HERE
- I am debating getting a large wide screen monitor and using gnif’s KVMFR. I tried it before and had some issues with the mouse pointer not getting properly trapped in XWayland (eg. When playing an FPS I could scroll out of the guest window). Maybe this is fixed now but opinions on multi monitor vs. single huge soft partitioned monitor would be appreicated (eg. Samsung LC49G95TSSNXZA).

mutation666 · July 11, 2021, 10:35pm

Would probably go TR PRO personally, but you could go with either. IO options out of the box will be better.

Eh I wouldnt worry TR Pro hasnt even been out that long, and will prob be the last one to hit the next gen. Remember this is a workstation platform so going to be slowest to get update probably.

Why not some Optane P5800X, kind of hard to know what you actually need storage size etc tho without knowing workload.

Icegiant might be a good option as good mix between them.

not sure.

GigaBusterEXE · July 12, 2021, 8:09am

I need to check with Wendell on how epyc Milan turbo behaves, cpu- l says turbos max is 3.67ghz but I think it’s more for lightly threaded loads, Milan generally turbos higher

ThatGuyB · July 12, 2021, 9:25am

Wooh, that’s a lot to unpack here. I’m guessing you know how to run a data center and its network, so I won’t get into how to split the parts accessible by customers from the parts you run personal stuff on (Windows).

I’d go TR Pro, because it’s more of a workstation platform.

If you wait for rumors and future releases, you will never buy hardware. Sometimes what you can buy now is more valuable than what you can buy later, even with less performance, because you get to use it now (just like how money is more valuable now than in the future).

In single systems, the difference is negligible. Unless you want to fill a room with rack cabinets, you don’t really have to worry about the performance between the two, single-node performance is negligible in real-world usage, nobody would notice a a difference in blind test usage of the two.

Air cooling. Period. Too bad the IceGiant doesn’t work on WRX8 / SP3 (or so I recall during Wendell’s interview with someone from IceGiant, but I wasn’t paying much attention, that’s what I was left with).

I’ve used neither and I’m trying to not recommend stuff I never used. Wendell is very knowledgeable in the mobo department, he’s done lots of reviews. After you decide on the CPU, check out some of the motherboard reviews for your CPU choice.

Most manufacturers are having split IOMMU groups and stuff nowadays, it shouldn’t really make a difference, but again, check Wendell’s reviews to make sure.

I never used LookingGlass, but I recall there being a button (maybe Scroll Lock?) that locked your cursor in a window or monitor. But not sure what’s the situation on Wayland (I’m guessing you’re using GNOME Shell).

Regarding Ceph, I’ve got no experience with it and it seems even odd you want to use it inside 1 box (I understand it’s just demo). How are you going to do it? Have some VMs (or containers) run from a normal SSD / HDD array, passthrough some of the m.2 SSDs from the HyperM.2s to said VMs, configure Ceph on these VMs, then add the Ceph Storage to your main host (physical server)? I feel that’s a little crazy, but eh, it’s just testing. Feed my curiosity!

greg_at_redhat · July 12, 2021, 3:51pm

Optane is definitely the way to go for “real” heavily loaded workloads in the datacenter, however between the 3 big ticket items (CPU, Memory, Storage) I am choosing to focus on a higher end CPU / memory and use more conventional NVMe (Albeit a bunch of it!) to help reduce cost where I can.

greg_at_redhat · July 12, 2021, 4:27pm

I will check out that video, it is sad that Ice Giant isn’t an option.

What I have read from this thread so far is having me lean towards a TR-PRO and from what I can tell there are only 3 options:

ASUS Pro-WS-WRX80E-SAGE-SE-WIFI
Gigabyte WRX80-SU8-IPMI-rev-10
Supermicro M12SWA-TF

I know there are some general high level reviews, however given my use case (dense virt + storage + vfio gaming) in Linux, I am curious as to what experience folks have had with this.

ThatGuyB:

greg_at_redhat:

I am debating getting a large wide screen monitor and using gnif’s KVMFR. I tried it before and had some issues with the mouse pointer not getting properly trapped in XWayland (eg. When playing an FPS I could scroll out of the guest window).

I never used LookingGlass, but I recall there being a button (maybe Scroll Lock?) that locked your cursor in a window or monitor. But not sure what’s the situation on Wayland (I’m guessing you’re using GNOME Shell).

Regarding Ceph, I’ve got no experience with it and it seems even odd you want to use it inside 1 box (I understand it’s just demo). How are you going to do it? Have some VMs (or containers) run from a normal SSD / HDD array, passthrough some of the m.2 SSDs from the HyperM.2s to said VMs, configure Ceph on these VMs, then add the Ceph Storage to your main host (physical server)? I feel that’s a little crazy, but eh, it’s just testing. Feed my curiosity!

dev-install deploys ceph using either a loopback device for basic ci/testing, or it can be pointed at real block devices. It deploys containerized ceph on the host which is then provided to the tripleo installer to configure for consumption with cinder/glance. Further due to these containers running on the host no emulation/virt magic is required.

[stack@rhosp-16-1 ~]$ sudo podman ps --filter "name=ceph" --format "{{.Names}}"
ceph-nfs-pacemaker
ceph-mgr-rhosp-16-1
ceph-crash-rhosp-16-1
ceph-mds-rhosp-16-1
ceph-osd-0
ceph-mon-rhosp-16-1

Template deploy options in dev-install

Essentially I would define the blockdevs I wanted to use in my local-overrides.yaml :

---------->8----------
ceph_enabled: true
ceph_devices:
  - /dev/path/to/disk1
  - /dev/path/to/disk2
  - /dev/path/to/disk3
  - /dev/path/to/disk4
  - etc
---------->8----------

And dev-install will deploy ceph bluestore on those devices and make them available to openstack.

GigaBusterEXE · July 12, 2021, 4:36pm

You can use the ice giant on WRX80 if you use it horizontally oriented case
Cooler master makes a case that might work
https://www.newegg.com/black-cooler-master-haf-series-atx-desktop/p/N82E16811119265
I’m half awake in bed, I need to double check compatibility

greg_at_redhat · July 12, 2021, 4:37pm

I know I was initially considering Milan as Zen3 is just newer tech, something the next TR-pro will be based on as well.

Jolly · July 13, 2021, 11:45pm

Following.
I have a 3995wx 256gb of ram that’s lagging like crazy as a desktop build, Win 10 host OS.

I ended up putting it in an openair build - using an openbenchtable case.
I’d go for multi monitors personally - I’m running a few high refresh 4k monitors myself.

ThatGuyB · July 14, 2021, 12:12am

EEWW. Put Linux on it, virtualize Win 10 and eventually pin some specific CPU cores to its VM and see your lag (maybe) fading away. Do you really need all those cores to Windows, or are you running Hyper-V or something? Either way, from Wendell’s older(ish) adventures around the 2990WX, Windows can’t keep up with the hardware

I believe he did come back to the 3995WX and / or Epyc Rome and things improved in Windows, but were still better in Linux.

GigaBusterEXE · July 14, 2021, 12:23am

Just had a talk with Wendell he thinks you might be better off with a 75f3 for higher clocks and more ram, he also suggested that 256GB was probably not enough for what you’re doing

wendell · July 14, 2021, 12:46am

256gb not enough (so don’t wait for TR5k, TR5k pro is probably a ways off).

If you can “live” without the desktop niceties the 75F3 is bananas for “datacenter in a box” demos. I can setup something for you with the hardware if you want me to test something. Could be a good video.

I don’t have 1tb of memory I can put in there right now, I could do 256 or maybe 512 in a 1-socket setup in both epyc and TR pro. (I think you’ll be surprised how fast the 75F3 is on Milan even though it is “only” 32 cores).

I have 8 1tb samsung 980 pros, a P5800x, and maybe I could scrounge up a few more 1-2tb older nvme if you really need 3 AIC of M… Would be willing to get you in via vpn/ipmi to test whatever you want if I can get a video out of it. You’d have to tell me why you think which one is better and get me up to speed on your workflow.

Jolly · July 14, 2021, 2:18am

I really just wanted a computer that would not slow down and start lagging even with 400+ chrome tabs, discord, telegram, slack, etc with, 4-6 4k high refresh monitors.

I started with a massively overclocked i9-10900k, at 5.2ghz all core. Was pretty solid, but sometimes would slow down and start lagging.

Moved to a 5950x which was a decent improvement. Further improved responsiveness by moving some of the higher resource browser tabs (trading charts) to a remote PC - via RDP Remote App session, which somewhat helped, but found was mostly good for windows I wasn’t moving.

What I found was in intense browsing sessions I’d still end up with a bunch of slowdowns and lags, so started trying to throw excessive amounts of hardware at the problem.

Using an Optane 905p for the OS drive also helped. I’ve got a p5800x on back order.

Was looking at 3rd gen epyc, but was having problems with availability, so went with TR Pro, also thought singe core performance might be a bit better with the higher boost speeds.

Built this 64core threadripper workstation which has been disappointing - when it works well, its great, when it doesnt, its taking a few seconds or more to open a new chrome tab, or switch tabs in task manager.
latencymon sometimes returns negative results for latency.

I’ve got some windows performance analyzer logs that I’m trying to understand.

I do need to run a few VM’s for use cases like a Ethereum node and a Home Automation server, I’ve got those running on two 5950x’s, one via proxmox, one esxi as I try and better design my homelab.

ThatGuyB · July 14, 2021, 2:34am

Wow. I’m not bragging, neither trying to insult you, but it feels strange seeing so much hardware thrown at a browser. I’m here using a RPi 8 GB and Firefox has about ~250 open tabs (they are mostly saved, not loaded, from all my previous sessions, about 30 are actually open). Loading times are “ok” (from my standpoint), but definitely not as fast as even an Ivy Bridge i5 laptop. But when money is involved (I’m guessing from the trading charts), I guess it makes sense to invest in something fast to save you time and money (in this case, I’d rather applaud you for trying to grow your wealth, while I’m here doing nothing aside from my day job). Still, I believe Linux would still do a better job at browsing the web.

Jolly · July 14, 2021, 2:48am

I got really fed up at my desktop slowing down. I’m ADHD to begin with, so waiting is pretty painful, and wanted to see just how nice of a desktop setup I could get. I pretty much live on my computer, so its both work and fun.

I multitask heavily- hence all the monitors, and if I’m a research session or theres something “interesting” happening in the markets, I end up with 500+ tabs easy.

Right now it’s a pretty light day, and windows task manager says 464 processes, 7668 threads, and 224042 handles. I split up a bit of my hw usage, so I have three monitors on my main desktop, and two being run from my esxi box. (I learned that windows doesnt like mixed monitor refresh rates, and I have 3 4k monitors that are 144hz, and 2 that are 120hz capable)

I know I overbought on hardware, but also wanted a degree of redundancy and backups, with the idea being if one computer was down for whatever reason I could switch to another .

wendell · July 14, 2021, 2:14pm

If you don’t mind not being able to drag windows between monitors you might enjoy using software that let’s you have one keyboard/mouse control multiple machines. Like Microsoft’s garage mouse etc. Then have stuff side by side and then multiple separate PC’s are doing things for you with independent perf characteristics.

greg_at_redhat · July 15, 2021, 4:27pm

That’s very interesting and would love to work with you on doing a TRPro vs. Milan DIAB workload test comparison.

From a resource POV

2T / 4GB for the Host OS
16T / 60GB for the Cloud Services
40-60T / 256 GB for instances
Storage Cpu (see below) / 5GB per OSD

Storage gets a bit trickier the resource requirements to drive Ceph with NVMe are a bit steeper.

Generally the math looks like on a per OSD basis, when we talk about real deployments across more than 1 host:

sockets * cores * ghz * 1500 = rbd iops @ 4k randrw

Admittedly this is going to be different (in unnaturally performant ways) when running in an AIO as we are not going over the wire to other storage nodes.

Further you need to carve up extremely fast storage like NVMe into 2 or 4 OSDs to saturate the device.

On a TR-Pro 3995WX with 10 core allocation for a single OSD we would have an expected small block performance profile of:

1 * 10 * 4.2 = 63,000 = randrw IO/s @ 4k

This is where inherently the TR-Pro running 64c/128T and higher boost clock might have a significant advantage over a 32 core Milan as Ceph scales monstrously the more cores you throw at it (normally achieved via horizontal node scaling but in this case vertically via core/thread density).

Given the nature of the all-in-one DIAB, I would probably configure 2 OSDs per NVMe at 50% device allocation for use by cinder and use the other 50% of the NVMe with LVM2 as an alternate storage provider. This would provide an interesting contrast between the raw block performance and the simulated storage network of Ceph (which will be understandably slower).

Given that this is also a unique corner case, I was advised this morning that disabling the CRC we do on the messenger data and header in Ceph might also yield a modest (8-14%) improvement in performance (we would lean on the fact that the system has ECC memory and data doesn’t leave the host:

ms_crc_data = false
ms_crc_header = false

From a testing POV

I know internally we use BROWBEAT / code to do synthetic performance testing of the private cloud. The tool can be given instructions such as “go create 100 tenants with between 1-5 networks with 2 to 7 subnets, then load create some Cirros instances and do network tests between them”, and it will queue it up and record api cal performance, network throughput etc.

Practically, using HEAT / ANSIBLE / CLOUDBASE-INIT we can run any kind of customised orchestrated deployment on Linux or Windows (10,2012r2,2016,2019).

For example if we wanted to test fileshare performance of a dozen or more windows guests talking to Manila providing either NFS mount or CephFS native mounts via ceph-dokan we could:

Create a heat stack that
- created a tenant + network + subnet
- created a file share via the Manila service
- spawned a number of windows vms
  - the windows vms would use cloudbase-init instructing them to map the drive and then touch a file on the share indicating their presence
  - when each instance has touched a file, a scripted process that is looking for the presence of a specific amount of those files which would then signal the instances to start doing their IO testing (eg dont start the race until all participants are ready).
  - Write the telemetry of the file operations to the file share and shutdown (eg. Z:/iorace/.txt

Since this whole test scenario is managed via the stack we could delete the stack and re-run it at will with different parameters for node count, t-shirt size (flavor of cpu/mem) or back end. This approach can be extrapolated into any number of other tests (eg take a similar approach only have the instances be Linux and have them run the Phoronix test suite). In flight orchestration can also be triggered by ansible (called directly from heat via OS::Heat::SoftwareConfig or out of band from another node or the host).

A fellow red hatter had an excellent write up on this kind of stack integration HERE

From a layered platform pov (openshift)

We have RIPSAW benchmark operator which has an extensive collection of performance tests that run on OpenShift / Kubernetes and kube-burner.

That was a lot of info, and there is probably more to discuss but let me pause for now and let you digest that.

Bumperdoo · July 15, 2021, 5:29pm

Booo… P5800X is insane… I got one after listening to @wendell in my WRX80 Gigabyte 3975WX TPro build. 400GB version designated as a Level II cache using PrimoCache. Zero waiting… love it! Good luck finding one though.

GigaBusterEXE · July 15, 2021, 5:34pm

From what I’ve heard your iop performance drastically decreases the more simultaneous operations you are performing, I think it’s called que depth, so it’s not as simple as dividing the iops by the number of instances

So it might be worth it to invest in more than one SSD

I’m not a professional

greg_at_redhat · July 15, 2021, 6:41pm

Queuing on NVMe is a different paradigm compared to queuing with older protocols like SATA/SAS.

Western digital has a pretty good writeup on the differences between NVMe and SATA/SAS queues:

To understand IO queues, let us first establish some interface baselines. Both SATA and SAS interfaces support a single queue with 32 and 256 commands respectively, limited for capacity and performance scaling. On the other hand, NVMe offers more with 64K queues and 64K commands per queue. That’s a difference of a staggering magnitude.
– NVMe™ Queues Explained - Western Digital Corporate Blog

That aside, I had mentioned that there would be multiple NVMe devices in my demo rig based on using 2-3 or 4 slot bifurcation boards. From a Ceph POV this would allow for more OSDs/placement from a CRUSH map perspective which for my purposes would make doing deeper dives into how that works easier.