Compute server - installation

Mohimm · February 22, 2020, 4:20pm

Hi,

Alert: Quite a “server” newb here… so sorry for long post and stupid questions…

At work we received new compute machine (mostly PDE, CFD…), and i was set to deploy it. It will be primarily used as a compute node, but might be used for virtual machines or anything else we throw at it, therefore we will not super specialize deployment for optimal performance, but rather ease of use (even for novices)…

Idea is to deploy Ubuntu, with nfs shared “/home”, since multiple systems are deployed like this already. We really don’t want researchers to role their own dependencies, so “apt install” in every guide on the internet should work . We’ll stay with 18.04, until 20.04 will be certified and work correctly (current daily build is not OK)…

We’d like to have some redundancy for main installation, but max performance for scratch disk:

“/boot” <- ~2GB on internal SD card or USB (another partition might also be used for rarely written configuration and backup…)

nvmes would contain 2 partitions:
“/” <- 500GB sw RAID 1 (500 GB of each nvme)
“/scratch” <- 1TB sw RAID 0 (500 GB of each nvme) Used for projects where nfs IO might be bottlenecked or for any temp…
“/home” <- nfs mounted

My questions:
Is such configuration doable and OK, am i missing something obvious?

Is it possible to put nfs mount of “/home” at ubuntu installation or must we change this after? if it is possible, any recommended way?

Best way to do a complete performance system test? For very basic test i ran “classroom” on blender, and it took 1min 30s, which seems a bit slow in comparison to 2x7742’s ~30s they got in reviews…

Currently we share users between machines via manually (scripts) copying passwd/shadow/… lines from node to node. This works, but leaves a lot to be desired… What is most lightweight and best solution for this (NIS, IAR, hesiod,…? Ldap seems a bit overkill for sharing couple of files? All nodes are on a private subnet, so “gateway server” should only allow sharing to that specific subnet…

Specs: HP385 chassis, Epyc 2 x 7702, 512Gb ram, 2x NvME 1TB storage if this is of any help…

Also, for couple of days i might be able to run benchmarks if anybody is interested and makes a good suggestion…

Thanks for any comments or help in advance…

Airstripone · February 22, 2020, 6:58pm

Nice setup

Not sure why you want to separate /boot from /. Not a common thing with modern OS.

If you are running Ubuntu just get a pair of data SSDs for / and boot straight from there. You can then keep all the nvme for scratch space.

I assume the scratch data is not mission critical?

Mohimm · February 22, 2020, 8:21pm

Hi thanks for reply.

I thought that separation of /boot would allow easier troubleshooting, in case that one of nvmes with /boot dies. Plop the card out, into card reader, reconfigure to boot from working nvme, plop it back in and restart… Haven’t tested in theory how hard it is then to convince OS to boot when sw raid is broken…

My idea is that if something happens to one of nvme’s, node can be brought back fairly quickly (from working nvme) even if scratch space is limited (or nonexistent)… If second nvme fails while first one is out of the system we probably have bigger issues

Critical data is on mounted “/home” nfs (properly backed up), so scratch disk is just for computations that might need some temporary data/intermittent results/debugging. We expect 500Gb of ram to be enough for our work, but in worst case might even be used for swap (very low chance) Final results should always be saved to nfs…

Edit: sorry, forgot about adding another drive. Due to company policies, adding HW is a bit tricky (except usb sticks/sd cards, weirdly allowed)… Sadly out of my control…

Airstripone · February 22, 2020, 9:26pm

As this is a production machine and you are doing real work I’d think you’d be better creating a backup copy of / with /boot once is it setup then if you have issues just restore the backup. Safer than trying to fix in situ

Caped_Kibitzer · February 22, 2020, 10:49pm

@Mohimm Keeping user info consistent on multiple machines is a good task for system configuration tools such as Ansible, Puppet, & Chef. I prefer Ansible, but your mileage may vary. Any of these tools will take some time & effort to learn, but should pay off in the long run.

Using NFS for /home deserves some careful thought; it isn’t clear whether your group is already doing that or it would be a new approach. Some aspects are touched on in these links, and Google search returns others:
https://serverfault.com/questions/19323/is-it-feasible-to-have-home-folder-hosted-with-nfs
https://askubuntu.com/questions/292461/placing-home-directory-on-nfs-server

I am not up to date on current best practices in this area, but a few things quickly come to mind:

NFS3 has very weak security, while NFS4 setup is complex enough that folks often revert to NFS3.
NFS performance - is it still on the slow side? Will it be the bottleneck for your compute server? You may want to emphasize use of the scratch drive for computation, and moving data from & back to NFS-mounted storage.
Many problems make frequent reference to configuration files in /home/user . Will it always be convenient to share the same configuration files across all machines, even for different versions of Linux?
I have seen, long ago, lock file operations fail over NFS even though they were programmed correctly & worked perfectly on local filesystems. What is the current state of such issues for NFS?
Think through, maybe even practice, the plan for when NFS goes down. A lot of the system will suddenly stop working; will a sysadmin be able to log in and resolve the problem? What if “that one expert” is on vacation?

I doubt that exhausts the questions. For historical reasons, I tend to prefer small, local /home dirs and a large NFS-mounted shared dir for work files. Of course, that has problems of its own (and doesn’t solve all the above).

Good luck; please let us know of any interesting developments.

Methylzero · February 23, 2020, 11:20am

We are in a somewhat similar situation (rolling our own compute nodes for scientific compute), but “we” decided that NFS mounted home dirs is an idea that will bite us in the ass.
The current solution is that all compute nodes mount the NFS4 share in /mnt/, every user has their own dir in the NFS share, and on every node, every user has a symlink in their home dir that points to their folder in the NFS mount.
The users have been told that while their home dirs are not the same between nodes, their ~/nfs is. There are some minor inconveniences, like .bash_history not synced between nodes, but overall it has worked fine so far and tolerated NFS server outages quite well, all running jobs just became stuck in iowait (D state) as they could not write their output to the NFS, but otherwise everything remained operable. All jobs resumed execution once the NFS server came back online.

Mohimm · February 23, 2020, 1:40pm

Thanks, i’ll check tools you suggested.
I know we are using some of these for different projects (not done by me) so i’ll ask around what we have some prior experience in the lab with.

As for NFS.
We are already doing that acros 30+ machines… They are on their own subnet, with dedicated switches just for them.

master node (native /home)
~30ish identical compute nodes (boot from same system image)
2 additional high-powered nodes (one existing + this new one)
There is a weird mishmash of ubuntu’s, due to upgrade paths, but so far it seems OK. This is also reason why we will stick with ubuntu, even if we leave some performance on the table…
We are aware this is not ideal, but it worked OK so far, presents fairly standard interface to the users. At some point complete overhaul will be needed, but it looks like we can stretch this configuration with couple more nodes… One of the purposes of me asking these questions is to prepare for next upgrade cycle, gain some experience and have plan already in place when that day comes…

So far there were little issues we had with NFS mounted home. Performance was OK-ish. It’s not the fastest, especially if all 30 nodes start hammering the NFS, but our work is more compute bound, so this is only when computation start.

@Methylzero This sounds like something we might want to try if we overhaul our system, it seems a bit more resilient to what we have now, so i’ll keep that in mind

Mohimm · April 11, 2020, 8:56pm

So just a quick update, for my additional logging

So far it worked quite OK, but when ubuntu 20.04 comes out, we will slightly change our config.

, uefi boot, /boot and md config will be kept on SD card as (see next point)
both NVME’s will be RAID0 -> Reinstallation of the system and configuration is quick enough, and chances of “critical” downtime low enough that we will simplify this
nfs-ed /home dir works OK-ish with some caveats
- /home is shared on nodes running Ubuntu 12.04 (don’t ask… will move to 20.04), 16.04, 18.04 and this machine(19.10, will upgrade).
- matlab was throwing some fits because compiler on 19.10 was too new (upgrade to matlab 2020a will hopefully improve this)
- manually copying users is OK, but i still want to simplify this, because testing if everything works is not for my paranoid OCD head…
- when multiple terminals were opened i accidently reset the nfs host machine… (molly is now installed there… ) After host came back up, all machines seemed to continue their work. I have no idea what black magic went around, but it seems that it handled that case OK-ish… Will not re-test…

So far quite satisfied… Next plans are to get faster networking and move /home host to a faster&larger filesystem array… Otherwise it seems that we will continue adding nodes like this.

There is a bit of weird kernel scheduling going on when set to 1Numa/CPU(now set to 4, it seems to work a bit smarter). So i hope that 20.04 with HWE will lift performance a bit higher when months tick by…

As a side note: 2xEpyc 7702 seems to consolidate and make our ~35 CPU (older xeons) node cluster obsolete… Next step is to try and convince boss for 3990x, to see frequency scaling on the problems where it seems that 7702 is a bit held back…