Slurm setup

mmk · November 22, 2023, 2:51pm

Hi all,

I’m attempting to set up a simple cluster with one head node / front end, and 16 compute nodes. All nodes are identical. I have seemingly successfully setup slurmctld on the head node, and slurmd on all compute nodes. “sinfo” from the head node indeed indicates that all 16 compute nodes are visible.

The issue is coming when trying to run a job through slurm. I get errors like the following (offending line highlighted):

[2023-11-21T16:41:41.428] [13.batch] debug:  cgroup/v1: _oom_event_monitor: started.
[2023-11-21T16:41:41.430] [13.batch] debug2: hwloc_topology_load
[2023-11-21T16:41:41.485] [13.batch] debug2: hwloc_topology_export_xml
[2023-11-21T16:41:41.489] [13.batch] debug2: Entering _setup_normal_io
*[2023-11-21T16:41:41.489] [13.batch] error: Could not open stdout file /home/username/jobs/test/output.13: No such file or directory*
[2023-11-21T16:41:41.489] [13.batch] debug2: Leaving  _setup_normal_io
[2023-11-21T16:41:41.489] [13.batch] error: IO setup failed: No such file or directory
[2023-11-21T16:41:41.489] [13.batch] debug2: step_terminate_monitor will run for 60 secs
[2023-11-21T16:41:41.511] [13.batch] debug:  signaling condition
[2023-11-21T16:41:41.511] [13.batch] debug2: step_terminate_monitor is stopping
[2023-11-21T16:41:41.511] [13.batch] debug2: _monitor exit code: 0
[2023-11-21T16:41:41.512] [13.batch] error: called without a previous init. This shouldn't happen!
[2023-11-21T16:41:41.512] [13.batch] debug:  jobacct_gather/cgroup: fini: Job accounting gather cgroup plugin unloaded
[2023-11-21T16:41:41.512] [13.batch] error: called without a previous init. This shouldn't happen!
[2023-11-21T16:41:41.512] [13.batch] debug:  task/cgroup: fini: Tasks containment cgroup plugin unloaded
[2023-11-21T16:41:41.512] [13.batch] debug2: Before call to spank_fini()
[2023-11-21T16:41:41.512] [13.batch] debug2: After call to spank_fini()
[2023-11-21T16:41:41.512] [13.batch] job 13 completed with slurm_rc = 0, job_rc = 256
[2023-11-21T16:41:41.512] [13.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
[2023-11-21T16:41:41.514] [13.batch] debug2:   false, shutdown
[2023-11-21T16:41:41.514] [13.batch] debug:  Message thread exited
[2023-11-21T16:41:41.514] [13.batch] done with job

I believe this is due to the fact that the home directory on the head node is not network shared across the compute nodes. I’m not quite sure how to do this. I’ve found some troubleshooting for users employing an NAS, but I would like to avoid this if possible and simply use the head node for this purpose.

Does anyone have any experience here?

Happy to provide more information regarding configuration and setup as necessary.

wertigon · November 22, 2023, 3:22pm

How performant does the network pipe need to be?

You could just try to create the /home/username/jobs/test folder across the 16 nodes, this should store all logs on that node and is the fastest, by far.

As for Network shares, I think the easiest way to do it is via SSHFS. On the host, install SSH. On the client, apt install sshfs and then simply use this command:

mkdir ~/jobs
sshfs -o allow_other,default_permissions usr@host:~/jobs ~/jobs

This is slow, but it works if all you need is some logging traffic. If this does not solve your problem, here is how to mount NFS with a systemd account:

mmk · November 22, 2023, 3:28pm

Node-to-node communication and I/O are a definite bottleneck for the use-case on this server, so the faster the better…

wertigon · November 22, 2023, 3:44pm

Right, I believe this will mostly be used for job pipelining and logging, so start with the sshfs then and if that turns out to be too slow, then go with NFS.

“Slow” in this case means ~30 Mbps vs ~800 Mbps transfer speeds on a 1GbE interface.

mmk · November 23, 2023, 2:13am

Yes, too slow for this application. I either must set up the head node NFS, or buy an NAS and do it this way. Would steps 2-4 from here: MpichCluster - Community Help Wiki be useful in this regard?

wertigon · November 23, 2023, 7:12am

Have you tried it? The idea is that you do SSHFS to handle logs and initial configuration and the rest of the signalling with some other protocol. If you are just transmitting five log messages a second, and that is all that is required of SSHFS, there is no need for something more complex. SSHFS does not scale to hundreds of nodes; that does not mean it is useless for what you want to do.

Otherwise, just set up the head node with NFS according to the above instructions I linked. Probably not the safest option but definitely the fastest remote FS option.

mmk · November 24, 2023, 1:22pm

Sorry, the error shows only the initial log file, at which point the simulations are killed. I assume this is going to be a problem for all output, for which we write frequently and can be very large. So, If we’re going to use a different protocol for all of the output, we might as well do the same for the log.

The more I think about it, the more I think I just need to bite the bullet and purchase an NAS, set that up as NFS.

infinitevalence · November 24, 2023, 4:23pm

What about setting up a Ceph cluster using spare storage on each of the nodes?

It will use a bit more overhead but provides redundant object storage that scales well with nodes and different sized disks

mmk · November 27, 2023, 7:11pm

Well, I found a simple Qnap NAS for a good price, so I just went ahead and ordered that. I think from there I know how to set it up properly…

mmk · December 8, 2023, 2:49am

Overall, the NAS with an shared NFS folder did the trick.

Start by installing autofs via apt.

I added the following to auto.master:
/- /etc/auto.nfs --timeout=0

And created auto.nfs with the following:
/mnt/point/on/nodes -fstype=nfs4,rw nashostname:/nasdirectory

Then simply start the autofs service:
service autofs start

Executing this on all nodes gives a common write point on the NAS, which works well with Slurm. the directory being shared on the NAS has to be NFS-enabled.

I think the same could be achieved by setting up the head/main node as an NFS server, but this has a bit more overhead. The NAS did the trick.

vlycop · December 13, 2023, 1:46pm

Systemd mount has autofs capability builtin if you ever need to make the stack simpler

I’m surprised NFS is enough and you didn’t need glusterFS or ceph

mmk · December 14, 2023, 2:02am

Seems to be working fine, and we don’t have that many writes (compared to some codes), so even if it is suboptimal in terms of performance, read/write is not our bottleneck. The code we use is primarily CPU-limited, so the more cores the better.

One thing that has been bugging me, though, is the lights on the network switch indicate"1G/100M" into the NAS, though it has 10G capabilities. Using the same network cord throughout, and the nodes all light up as 10G… again, not super limiting for the frequency at which we write / the sizes, but this is definitely slowing us.

vlycop · December 14, 2023, 10:43am

can be bad config, bad cable, bad port …
Are you using infiniband ? 10g base T ? SFP+ ?

mmk · December 14, 2023, 3:09pm

Netgear XS724EM: 10-Gigabit/Multi-Gigabit Plus - XS724EM
QNAP TS-431XeU: TS-431XeU - Features | QNAP (US)

EDIT: …and no, I am not using the SFP+ port, and that’s my problem. It would have helped for me to read the manual.

vlycop · December 15, 2023, 1:49pm

sfp+ to 10gbT are very common but make sure to take one compatible as i don’t know if qnap have some power limitation or vendor lock on the sfp+ port.

Also, not related to slurm, but i am interested about your input on this nas, i’m moving everything “prod” in my home to a network cabinet, and i would like a small little thing like that if i can make it run zfs

mmk · December 15, 2023, 2:40pm

Yeah, this is new territory for me - not sure what SFP+ to buy… would something like this be appropriate, and then CAT6 from the SFP+ to the switch?

EDIT: Hit “go” too early: what would you like to know? So far, it was a breeze setting up and has worked quite nicely for our needs. Happy to describe further.

infinitevalence · December 15, 2023, 3:10pm

The 10Gtek are generally decent quality, these SFP+ to RJ45 adapters run hot so if your using a bunch of them make sure you have cooling on the front of the switch. You also need to make sure the switch has enough power to drive all the adapters.

They are generally used for one or two connections not multiple since the cost of buying 16 would be more than the cost of buying a 16 port 10g switch.

mmk · December 15, 2023, 5:52pm

@infinitevalence : I think the setup would be plugging this into the SFP+ on the NAS and run a CAT6 between the NAS and the switch (RJ45 on both ends). So the only point we would have this is on the NAS. Does that sound appropriate?

infinitevalence · December 15, 2023, 6:23pm

should work

vlycop · December 16, 2023, 11:33pm

what would you like to know?

Well i check and sadly it’s not the “pro” grade that run NFS, still, di you get a 1 or 8 gb of ram model ? And how much can you hear the fan ?