New Slurm issues (nodes dropping/down unexpectedly)

Posted this over on Server Fault, but no bites there… maybe someone here can untangle this one. Lots of logs below, sorry in advance for the long post. I’m a relatively naive user of Slurm, so I may need a bit of hand-holding on this one.

I’ve set up a cluster using Slurm, consisting of a head node, 16 compute nodes, and an NAS with NFS-4 network shared storage. I’ve recently installed Slurm on Ubuntu v22 via apt (sinfo -V reveals slurm-wlm 21.08.5 ). I’ve tested with some single-node and multi-node jobs, and I can get jobs to run to completion as one would expect. However, for some simulations, some nodes keep changing state to down midway through the simulations. It is the same two nodes that display this behavior, though seemingly randomly. It is happening more often than not, but I believe we have had a few simulations finish using these nodes. On the nodes whose status changes to down , the slurmd daemon is still active—that is, whatever failure is happening is not due to the daemon going down.

Overall: Why are these nodes terminating jobs, and setting state to down?

Detailed info: I’ve checked the slurmd log on one of the nodes which goes down, and this is what we get (from approximate time of job submission to failure / node down). Note that this is for a job (the job ID here is 64) submitted to 4 nodes and 64 processors per node:

[2023-12-07T16:48:29.487] [64.extern] debug2: setup for a launch_task
[2023-12-07T16:48:29.487] [64.extern] debug2: hwloc_topology_init
[2023-12-07T16:48:29.491] [64.extern] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-07T16:48:29.493] [64.extern] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:48:29.494] [64.extern] debug:  cgroup/v1: init: Cgroup v1 plugin loaded
[2023-12-07T16:48:29.498] [64.extern] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-07T16:48:29.498] [64.extern] debug2: profile signaling type Task
[2023-12-07T16:48:29.499] [64.extern] debug:  Message thread started pid = 4176
[2023-12-07T16:48:29.503] [64.extern] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-07T16:48:29.507] [64.extern] debug:  task/cgroup: init: core enforcement enabled
[2023-12-07T16:48:29.507] [64.extern] debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:257579M allowed:100%(enforced), swap:0%(permissive), max:100%(257579M) max+swap:100%(515158M) min:30M kmem:100%(257579M permissive) min:30M swappiness:0(unset)
[2023-12-07T16:48:29.507] [64.extern] debug:  task/cgroup: init: memory enforcement enabled
[2023-12-07T16:48:29.509] [64.extern] debug:  task/cgroup: task_cgroup_devices_init: unable to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file or directory
[2023-12-07T16:48:29.509] [64.extern] debug:  task/cgroup: init: device enforcement enabled
[2023-12-07T16:48:29.509] [64.extern] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-07T16:48:29.510] [64.extern] cred/munge: init: Munge credential signature plugin loaded
[2023-12-07T16:48:29.510] [64.extern] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-07T16:48:29.510] [64.extern] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-07T16:48:29.513] [64.extern] debug2: _spawn_job_container: Before call to spank_init()
[2023-12-07T16:48:29.513] [64.extern] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-07T16:48:29.513] [64.extern] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-07T16:48:29.513] [64.extern] debug2: _spawn_job_container: After call to spank_init()
[2023-12-07T16:48:29.555] [64.extern] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-63'
[2023-12-07T16:48:29.555] [64.extern] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-63'
[2023-12-07T16:48:29.555] [64.extern] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-63'
[2023-12-07T16:48:29.555] [64.extern] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-63'
[2023-12-07T16:48:29.556] [64.extern] task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.556] [64.extern] task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.556] [64.extern] debug:  cgroup/v1: _oom_event_monitor: started.
[2023-12-07T16:48:29.579] [64.extern] debug2: adding task 3 pid 4185 on node 3 to jobacct
[2023-12-07T16:48:29.582] debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG
[2023-12-07T16:48:29.830] debug2: Start processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-07T16:48:29.830] debug2: Processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-07T16:48:29.830] launch task StepId=64.0 request from UID:1000 GID:1000 HOST:10.115.79.9 PORT:48642
[2023-12-07T16:48:29.830] debug:  Checking credential with 868 bytes of sig data
[2023-12-07T16:48:29.830] task/affinity: lllp_distribution: JobId=64 manual binding: none,one_thread
[2023-12-07T16:48:29.830] debug:  Waiting for job 64's prolog to complete
[2023-12-07T16:48:29.830] debug:  Finished wait for job 64's prolog to complete
[2023-12-07T16:48:29.839] debug2: debug level read from slurmd is 'debug2'.
[2023-12-07T16:48:29.839] debug2: read_slurmd_conf_lite: slurmd sent 8 TRES.
[2023-12-07T16:48:29.839] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-12-07T16:48:29.839] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-12-07T16:48:29.839] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-12-07T16:48:29.839] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-12-07T16:48:29.839] debug2: Received CPU frequency information for 64 CPUs
[2023-12-07T16:48:29.840] debug:  switch/none: init: switch NONE plugin loaded
[2023-12-07T16:48:29.840] debug:  switch Cray/Aries plugin loaded.
[2023-12-07T16:48:29.840] [64.0] debug2: setup for a launch_task
[2023-12-07T16:48:29.840] [64.0] debug2: hwloc_topology_init
[2023-12-07T16:48:29.845] [64.0] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-07T16:48:29.846] [64.0] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:48:29.847] [64.0] debug:  cgroup/v1: init: Cgroup v1 plugin loaded
[2023-12-07T16:48:29.851] [64.0] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-07T16:48:29.852] [64.0] debug2: profile signaling type Task
[2023-12-07T16:48:29.852] [64.0] debug:  Message thread started pid = 4188
[2023-12-07T16:48:29.852] debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS
[2023-12-07T16:48:29.857] [64.0] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-07T16:48:29.861] [64.0] debug:  task/cgroup: init: core enforcement enabled
[2023-12-07T16:48:29.861] [64.0] debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:257579M allowed:100%(enforced), swap:0%(permissive), max:100%(257579M) max+swap:100%(515158M) min:30M kmem:100%(257579M permissive) min:30M swappiness:0(unset)
[2023-12-07T16:48:29.861] [64.0] debug:  task/cgroup: init: memory enforcement enabled
[2023-12-07T16:48:29.863] [64.0] debug:  task/cgroup: task_cgroup_devices_init: unable to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file or directory
[2023-12-07T16:48:29.863] [64.0] debug:  task/cgroup: init: device enforcement enabled
[2023-12-07T16:48:29.863] [64.0] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-07T16:48:29.863] [64.0] cred/munge: init: Munge credential signature plugin loaded
[2023-12-07T16:48:29.863] [64.0] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-07T16:48:29.863] [64.0] debug:  mpi type = none
[2023-12-07T16:48:29.863] [64.0] debug2: Before call to spank_init()
[2023-12-07T16:48:29.863] [64.0] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-07T16:48:29.864] [64.0] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-07T16:48:29.864] [64.0] debug2: After call to spank_init()
[2023-12-07T16:48:29.864] [64.0] debug:  mpi type = (null)
[2023-12-07T16:48:29.864] [64.0] debug:  mpi/none: p_mpi_hook_slurmstepd_prefork: mpi/none: slurmstepd prefork
[2023-12-07T16:48:29.864] [64.0] error: cpu_freq_cpuset_validate: cpu_bind string is null
[2023-12-07T16:48:29.883] [64.0] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-63'
[2023-12-07T16:48:29.883] [64.0] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-63'
[2023-12-07T16:48:29.883] [64.0] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-63'
[2023-12-07T16:48:29.883] [64.0] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-63'
[2023-12-07T16:48:29.883] [64.0] task/cgroup: _memcg_initialize: job: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.884] [64.0] task/cgroup: _memcg_initialize: step: alloc=0MB mem.limit=257579MB memsw.limit=unlimited
[2023-12-07T16:48:29.884] [64.0] debug:  cgroup/v1: _oom_event_monitor: started.
[2023-12-07T16:48:29.886] [64.0] debug2: hwloc_topology_load
[2023-12-07T16:48:29.918] [64.0] debug2: hwloc_topology_export_xml
[2023-12-07T16:48:29.922] [64.0] debug2: Entering _setup_normal_io
[2023-12-07T16:48:29.922] [64.0] debug2: io_init_msg_write_to_fd: entering
[2023-12-07T16:48:29.922] [64.0] debug2: io_init_msg_write_to_fd: msg->nodeid = 2
[2023-12-07T16:48:29.922] [64.0] debug2: io_init_msg_write_to_fd: leaving
[2023-12-07T16:48:29.923] [64.0] debug2: Leaving  _setup_normal_io
[2023-12-07T16:48:29.923] [64.0] debug levels are stderr='error', logfile='debug2', syslog='quiet'
[2023-12-07T16:48:29.923] [64.0] debug:  IO handler started pid=4188
[2023-12-07T16:48:29.925] [64.0] starting 1 tasks
[2023-12-07T16:48:29.925] [64.0] task 2 (4194) started 2023-12-07T16:48:29
[2023-12-07T16:48:29.926] [64.0] debug:  Setting slurmstepd oom_adj to -1000
[2023-12-07T16:48:29.926] [64.0] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-07T16:48:29.959] [64.0] debug2: adding task 2 pid 4194 on node 2 to jobacct
[2023-12-07T16:48:29.960] [64.0] debug:  Sending launch resp rc=0
[2023-12-07T16:48:29.961] [64.0] debug:  mpi type = (null)
[2023-12-07T16:48:29.961] [64.0] debug:  mpi/none: p_mpi_hook_slurmstepd_task: Using mpi/none
[2023-12-07T16:48:29.961] [64.0] debug:  task/affinity: task_p_pre_launch: affinity StepId=64.0, task:2 bind:none,one_thread
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_STACK  : max:inf cur:inf req:8388608
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_STACK succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_CORE   : max:inf cur:inf req:0
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_CORE succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_RSS no change in value: 18446744073709551615
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_NPROC  : max:1030021 cur:1030021 req:1030020
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_NOFILE : max:131072 cur:4096 req:1024
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: RLIMIT_MEMLOCK: max:inf cur:inf req:33761472512
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_MEMLOCK succeeded
[2023-12-07T16:48:29.961] [64.0] debug2: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615
[2023-12-07T16:48:59.498] [64.extern] debug2: profile signaling type Task
[2023-12-07T16:48:59.852] [64.0] debug2: profile signaling type Task
[2023-12-07T16:51:03.457] debug:  Log file re-opened
[2023-12-07T16:51:03.457] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.457] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.457] debug2: hwloc_topology_init
[2023-12-07T16:51:03.462] debug2: hwloc_topology_load
[2023-12-07T16:51:03.480] debug2: hwloc_topology_export_xml
[2023-12-07T16:51:03.482] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:51:03.483] debug2: hwloc_topology_init
[2023-12-07T16:51:03.484] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-07T16:51:03.485] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-07T16:51:03.485] topology/none: init: topology NONE plugin loaded
[2023-12-07T16:51:03.485] route/default: init: route default plugin loaded
[2023-12-07T16:51:03.485] debug2: Gathering cpu frequency information for 64 cpus
[2023-12-07T16:51:03.487] debug:  Resource spec: No specialized cores configured by default on this node
[2023-12-07T16:51:03.487] debug:  Resource spec: Reserved system memory limit not configured for this node
[2023-12-07T16:51:03.490] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-07T16:51:03.490] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-07T16:51:03.490] debug:  auth/munge: init: Munge authentication plugin loaded
[2023-12-07T16:51:03.490] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-07T16:51:03.490] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-07T16:51:03.491] cred/munge: init: Munge credential signature plugin loaded
[2023-12-07T16:51:03.491] slurmd version 21.08.5 started
[2023-12-07T16:51:03.491] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-07T16:51:03.491] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-07T16:51:03.491] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: container_p_restore: job_container.conf read successfully
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 58 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/58/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 56 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/56/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.492] error: _restore_ns: failed to connect to stepd for 64.
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/64/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 54 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/54/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 59 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/59/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] error: Encountered an error while restoring job containers.
[2023-12-07T16:51:03.492] error: Unable to restore job_container state.
[2023-12-07T16:51:03.493] debug:  switch/none: init: switch NONE plugin loaded
[2023-12-07T16:51:03.493] debug:  switch Cray/Aries plugin loaded.
[2023-12-07T16:51:03.493] slurmd started on Thu, 07 Dec 2023 16:51:03 -0600
[2023-12-07T16:51:03.493] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.494] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.494] CPUs=64 Boards=1 Sockets=1 Cores=64 Threads=1 Memory=257579 TmpDisk=937291 Uptime=14 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2023-12-07T16:51:03.494] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.494] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.494] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-12-07T16:51:03.494] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-12-07T16:51:03.495] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-12-07T16:51:03.495] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-12-07T16:51:03.495] debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf)
[2023-12-07T16:51:03.499] debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
[2023-12-07T16:51:03.500] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2023-12-07T16:51:03.500] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2023-12-07T16:51:03.500] debug:  _rpc_terminate_job: uid = 64030 JobId=64
[2023-12-07T16:51:03.500] debug:  credential for job 64 revoked
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.extern stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.0 stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug2: No steps in jobid 64 were able to be signaled with 998
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.extern stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.0 stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug2: No steps in jobid 64 were able to be signaled with 18
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.4294967292: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.extern stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_64.0: Connection refused
[2023-12-07T16:51:03.500] debug:  signal for nonexistent StepId=64.0 stepd_connect failed: Connection refused
[2023-12-07T16:51:03.500] debug2: No steps in jobid 64 were able to be signaled with 15
[2023-12-07T16:51:03.500] debug2: set revoke expiration for jobid 64 to 1701989583 UTS
[2023-12-07T16:51:03.501] error: _delete_ns: umount2 /var/nvme/storage/cn4/64/.ns failed: Invalid argument
[2023-12-07T16:51:03.501] error: container_g_delete(64): Invalid argument
[2023-12-07T16:51:03.501] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB

Immediately, I see some errors regarding /var/nvme/storage (this is a local folder on each node, not a network-shared location on the NAS), but this is the same across all nodes and is only causing issues on a couple of nodes. Note that this is the base path as set in job_container.conf:

AutoBasePath=true
BasePath=/var/nvme/storage

Additionally, here is cgroup.conf:

CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
ConstrainKmemSpace=no
TaskAffinity=no
CgroupMountpoint=/sys/fs/cgroup

…and slurm.conf:

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cauchy
SlurmctldHost=cauchy
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
MaxJobCount=1000000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
PrologFlags=contain
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity,task/cgroup
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=120
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
MaxArraySize=100000
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerParameters=enable_user_top,bf_job_part_count_reserve=5,bf_continue
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
####### Priority Begin ##################
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightAge=100
PriorityWeightPartition=10000
PriorityWeightJobSize=0
PriorityMaxAge=14-0
PriorityFavorSmall=YES
#PriorityWeightQOS=10000
#PriorityWeightTRES=cpu=2000,mem=1,gres/gpu=400
#AccountingStorageTRES=gres/gpu
#AccountingStorageEnforce=all
#FairShareDampeningFactor=5
####### Priority End ##################
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompParams=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobContainerType=job_container/tmpfs
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug2
SlurmdLogFile=/var/log/slurmd.log

#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=cn1 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn2 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn3 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn4 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn5 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn6 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn7 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn8 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn9 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn10 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn11 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn12 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn13 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn14 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn15 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000
NodeName=cn16 CPUs=64 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 State=UNKNOWN RealMemory=257000

PartitionName=main Nodes=ALL Default=YES MaxTime=INFINITE State=UP PriorityJobFactor=2000
PartitionName=low Nodes=ALL MaxTime=INFINITE State=UP PriorityJobFactor=1000

I cancelled a job (submitted to 4 nodes, cn1 - cn4) that showed this error before it could reallocate to new nodes / overwrite the Slurm error file. Here are the contents of the error file:

Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
slurmstepd-cn1: error: *** JOB 74 ON cn1 CANCELLED AT 2023-12-10T10:05:57 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***

The Authorization required... error is pervasive on all nodes / all simulations, so I’m not sure it is of particular consequence for the consistent failure of the cn4 node. The segfault doesn’t appear when the same job is run on other nodes, so this is new info / anomalous. The slurmctld log is not particularly illuminating:

[2023-12-09T19:54:46.401] _slurm_rpc_submit_batch_job: JobId=65 InitPrio=10000 usec=493
[2023-12-09T19:54:46.826] sched/backfill: _start_job: Started JobId=65 in main on cn[1-4]
[2023-12-09T20:01:33.529] validate_node_specs: Node cn4 unexpectedly rebooted boot_time=1702173678 last response=1702173587
[2023-12-09T20:01:33.529] requeue job JobId=65 due to failure of node cn4
[2023-12-09T20:01:38.334] Requeuing JobId=65
1 Like

can you pull logs for the host kernel too at the same time? I wonder if there is anything on the system that is not being captured directly in the Slerm logs.

EDIT: The timestamps are relative. Node cn4 had been rebooted after the head node, so they do not match (not that this was expected).

Here are the contents of the head node /var/log/kern.log (after startup - if the entire log would be useful, I can post it.):

...
[   22.374961] Bridge firewalling registered
[   22.455639] Initializing XFRM netlink socket
[ 3664.659393] rfkill: input handler enabled
[ 3666.820094] rfkill: input handler disabled
[ 3715.677980] FS-Cache: Loaded
[ 3715.895269] NFS: Registering the id_resolver key type
[ 3715.895280] Key type id_resolver registered
[ 3715.895282] Key type id_legacy registered
[ 4201.910954] nfs: server cauchynas not responding, timed out
[ 4382.318627] nfs: server cauchynas not responding, timed out
[ 4388.462364] nfs: server cauchynas not responding, timed out
[ 4930.479577] usb 1-1: USB disconnect, device number 2
[ 4930.679604] usb 1-2: USB disconnect, device number 3
[13763.477494] usb 1-1: new low-speed USB device number 4 using xhci_hcd
[13763.640988] usb 1-1: New USB device found, idVendor=413c, idProduct=2113, bcdDevice= 1.10
[13763.640997] usb 1-1: New USB device strings: Mfr=0, Product=2, SerialNumber=0
[13763.641000] usb 1-1: Product: Dell KB216 Wired Keyboard
[13763.670524] input: Dell KB216 Wired Keyboard as /devices/pci0000:00/0000:00:08.1/0000:03:00.3/usb1/1-1/1-1:1.0/0003:413C:2113.0006/input/input7
[13763.729933] hid-generic 0003:413C:2113.0006: input,hidraw0: USB HID v1.11 Keyboard [Dell KB216 Wired Keyboard] on usb-0000:03:00.3-1/input0
[13763.735499] input: Dell KB216 Wired Keyboard Consumer Control as /devices/pci0000:00/0000:00:08.1/0000:03:00.3/usb1/1-1/1-1:1.1/0003:413C:2113.0007/input/input8
[13763.793670] input: Dell KB216 Wired Keyboard System Control as /devices/pci0000:00/0000:00:08.1/0000:03:00.3/usb1/1-1/1-1:1.1/0003:413C:2113.0007/input/input9
[13763.793830] hid-generic 0003:413C:2113.0007: input,hidraw1: USB HID v1.11 Device [Dell KB216 Wired Keyboard] on usb-0000:03:00.3-1/input1
[13766.525306] usb 5-1: new low-speed USB device number 2 using xhci_hcd
[13766.798551] usb 5-1: New USB device found, idVendor=0461, idProduct=4d51, bcdDevice= 7.17
[13766.798560] usb 5-1: New USB device strings: Mfr=0, Product=2, SerialNumber=0
[13766.798564] usb 5-1: Product: DELL Laser Mouse
[13766.805642] input: DELL Laser Mouse as /devices/pci0000:40/0000:40:03.5/0000:45:00.0/usb5/5-1/5-1:1.0/0003:0461:4D51.0008/input/input10
[13766.805937] hid-generic 0003:0461:4D51.0008: input,hidraw2: USB HID v1.11 Mouse [DELL Laser Mouse] on usb-0000:45:00.0-1/input0
[13861.484094] rfkill: input handler enabled
[13869.266796] audit: type=1400 audit(1701973279.736:46): apparmor="DENIED" operation="open" class="file" profile="snap.snapd-desktop-integration.snapd-desktop-integration" name="/etc/host.conf" pid=8192 comm="snapd-desktop-i" requested_mask="r" denied_mask="r" fsuid=1000 ouid=0
[13869.266807] audit: type=1400 audit(1701973279.736:47): apparmor="DENIED" operation="open" class="file" profile="snap.snapd-desktop-integration.snapd-desktop-integration" name="/run/systemd/resolve/stub-resolv.conf" pid=8192 comm="snapd-desktop-i" requested_mask="r" denied_mask="r" fsuid=1000 ouid=101
[13869.266814] audit: type=1400 audit(1701973279.736:48): apparmor="DENIED" operation="open" class="file" profile="snap.snapd-desktop-integration.snapd-desktop-integration" name="/etc/hosts" pid=8192 comm="snapd-desktop-i" requested_mask="r" denied_mask="r" fsuid=1000 ouid=0
[13869.268256] audit: type=1400 audit(1701973279.736:49): apparmor="DENIED" operation="create" class="net" profile="snap.snapd-desktop-integration.snapd-desktop-integration" pid=8192 comm="snapd-desktop-i" family="inet" sock_type="dgram" protocol=0 requested_mask="create" denied_mask="create"
[13869.341250] rfkill: input handler disabled
[13872.739725] usb 1-1: USB disconnect, device number 4
[13872.915807] usb 5-1: USB disconnect, device number 2
[14646.106595] audit: type=1400 audit(1701974056.591:50): apparmor="DENIED" operation="open" class="file" profile="/snap/snapd/20290/usr/lib/snapd/snap-confine" name="/mnt/" pid=8461 comm="snap-confine" requested_mask="r" denied_mask="r" fsuid=1001 ouid=0
[14851.077295] audit: type=1400 audit(1701974261.568:51): apparmor="DENIED" operation="open" class="file" profile="/snap/snapd/20290/usr/lib/snapd/snap-confine" name="/mnt/" pid=8729 comm="snap-confine" requested_mask="r" denied_mask="r" fsuid=1000 ouid=0
[14916.967176] audit: type=1400 audit(1701974327.462:52): apparmor="DENIED" operation="open" class="file" profile="/snap/snapd/20290/usr/lib/snapd/snap-confine" name="/mnt/" pid=8986 comm="snap-confine" requested_mask="r" denied_mask="r" fsuid=1000 ouid=0
[14928.867650] audit: type=1400 audit(1701974339.358:53): apparmor="DENIED" operation="open" class="file" profile="/snap/snapd/20290/usr/lib/snapd/snap-confine" name="/mnt/" pid=9162 comm="snap-confine" requested_mask="r" denied_mask="r" fsuid=1000 ouid=0
[14946.630760] audit: type=1400 audit(1701974357.122:54): apparmor="DENIED" operation="open" class="file" profile="/snap/snapd/20290/usr/lib/snapd/snap-confine" name="/mnt/" pid=9344 comm="snap-confine" requested_mask="r" denied_mask="r" fsuid=1000 ouid=0
[15556.017990] audit: type=1400 audit(1701974966.530:55): apparmor="DENIED" operation="open" class="file" profile="/snap/snapd/20290/usr/lib/snapd/snap-confine" name="/mnt/" pid=9708 comm="snap-confine" requested_mask="r" denied_mask="r" fsuid=1002 ouid=0
[20644.191591] audit: type=1400 audit(1701980054.837:56): apparmor="DENIED" operation="open" class="file" profile="/snap/snapd/20290/usr/lib/snapd/snap-confine" name="/mnt/" pid=10849 comm="snap-confine" requested_mask="r" denied_mask="r" fsuid=1006 ouid=0
[21499.392857] audit: type=1400 audit(1701980910.060:57): apparmor="DENIED" operation="open" class="file" profile="/snap/snapd/20290/usr/lib/snapd/snap-confine" name="/mnt/" pid=11942 comm="snap-confine" requested_mask="r" denied_mask="r" fsuid=2001 ouid=0
[21538.404315] audit: type=1400 audit(1701980949.073:58): apparmor="DENIED" operation="open" class="file" profile="/snap/snapd/20290/usr/lib/snapd/snap-confine" name="/mnt/" pid=12159 comm="snap-confine" requested_mask="r" denied_mask="r" fsuid=1003 ouid=0
[23133.086900] audit: type=1400 audit(1701982543.798:59): apparmor="DENIED" operation="open" class="file" profile="/snap/snapd/20290/usr/lib/snapd/snap-confine" name="/mnt/" pid=13526 comm="snap-confine" requested_mask="r" denied_mask="r" fsuid=2001 ouid=1000
[55811.006946] audit: type=1400 audit(1702015222.584:60): apparmor="DENIED" operation="capable" class="cap" profile="/usr/sbin/cupsd" pid=20006 comm="cupsd" capability=12  capname="net_admin"
[73817.649198] loop16: detected capacity change from 0 to 151784
[79515.920066] systemd[1]: systemd 249.11-0ubuntu3.11 running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY -P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
[79515.920210] systemd[1]: Detected architecture x86-64.
[142208.673589] audit: type=1400 audit(1702101622.582:61): apparmor="DENIED" operation="capable" class="cap" profile="/usr/sbin/cupsd" pid=33737 comm="cupsd" capability=12  capname="net_admin"
[228595.086088] audit: type=1400 audit(1702188011.313:62): apparmor="DENIED" operation="capable" class="cap" profile="/usr/sbin/cupsd" pid=46693 comm="cupsd" capability=12  capname="net_admin"
[314993.941330] audit: type=1400 audit(1702274412.418:63): apparmor="DENIED" operation="capable" class="cap" profile="/usr/sbin/cupsd" pid=59478 comm="cupsd" capability=12  capname="net_admin"

…and from one of the offending nodes (cn4):

...
[   14.973994] FS-Cache: Loaded
[   15.191833] NFS: Registering the id_resolver key type
[   15.191848] Key type id_resolver registered
[   15.191850] Key type id_legacy registered
[  710.778005] audit: type=1400 audit(1702225054.508:46): apparmor="DENIED" operation="open" class="file" profile="/snap/snapd/20290/usr/lib/snapd/snap-confine" name="/mnt/" pid=2897 comm="snap-confine" requested_mask="r" denied_mask="r" fsuid=1000 ouid=0
[48309.033918] loop17: detected capacity change from 0 to 130888
[50055.397221] audit: type=1400 audit(1702274400.252:47): apparmor="DENIED" operation="capable" class="cap" profile="/usr/sbin/cupsd" pid=5395 comm="cupsd" capability=12  capname="net_admin"
[85064.345527] audit: type=1400 audit(1702309409.986:48): apparmor="DENIED" operation="open" class="file" profile="/snap/snapd/20290/usr/lib/snapd/snap-confine" name="/mnt/" pid=7188 comm="snap-confine" requested_mask="r" denied_mask="r" fsuid=1000 ouid=

well none of that is helpful :stuck_out_tongue:

1 Like

The segfault made me think there could be an issue with RAM (since I am not getting this issue on any other node with the same simulation), so I started running memtest86+ (v6). It’s currently 14% complete, and it has found 16 errors.

Is this normal? I have no calibration here. Assuming it is not normal, could it be a culprit for the above behavior?

EDIT: …and now 1600+ errors at 28% and memtest stopped and the machine shut down.

EDIT 2: …and now looking more closely at the kernel logs for the node in question:

[    1.654373] [Hardware Error]: event severity: fatal
[    1.654374] [Hardware Error]:  Error 0, type: fatal
[    1.654375] [Hardware Error]:  fru_text: DIMM Locate:  P0G0
[    1.654376] [Hardware Error]:   section_type: memory error
[    1.654377] [Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000040400)
[    1.654380] [Hardware Error]:   node:0 card:5 module:0 
[    1.654381] [Hardware Error]:   error_type: 15, physical memory map-out event
[    1.654382] [Hardware Error]:  Error 1, type: recoverable
[    1.654383] [Hardware Error]:  fru_text: ProcessorError
[    1.654384] [Hardware Error]:   section_type: IA32/X64 processor error
[    1.654385] [Hardware Error]:   Local APIC_ID: 0x0
[    1.654385] [Hardware Error]:   CPUID Info:
[    1.654387] [Hardware Error]:   00000000: 00830f10 00000000 00400800 00000000
[    1.654389] [Hardware Error]:   00000010: 76d8320b 00000000 178bfbff 00000000
[    1.654390] [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
[    1.654391] [Hardware Error]:   Error Information Structure 0:
[    1.654392] [Hardware Error]:    Error Structure Type: cache error
[    1.654393] [Hardware Error]:    Check Information: 0x000000001c4d0077
[    1.654394] [Hardware Error]:     Transaction Type: 1, Data Access
[    1.654395] [Hardware Error]:     Operation: 3, data read
[    1.654396] [Hardware Error]:     Level: 1
[    1.654397] [Hardware Error]:     Uncorrected: true
[    1.654398] [Hardware Error]:     Precise IP: true
[    1.654399] [Hardware Error]:     Restartable IP: true
[    1.654400] [Hardware Error]:    Instruction Pointer: 0x00000000a91d49a6
[    1.654401] [Hardware Error]:   Context Information Structure 0:
[    1.654402] [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
[    1.654402] [Hardware Error]:    Register Array Size: 0x0050
[    1.654403] [Hardware Error]:    MSR Address: 0xc0002001
[    1.654410] BERT: Total records found: 1
[    1.654413] mce: [Hardware Error]: Machine check events logged
[    1.654416] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 0: bc002800000c0135
[    1.654422] fbcon: Taking over console
[    1.654423] mce: [Hardware Error]: TSC 0 ADDR 100003d006d9a80 MISC d01c0dff00000000 PPIN 2b598a2a1144052 IPID b000000000 
[    1.654430] mce: [Hardware Error]: PROCESSOR 2:830f10 TIME 1702413238 SOCKET 0 APIC 0 microcode 830107a

EDIT 3: Removing the first stick of RAM, no hardware errors on boot (unsure if this is the culprit stick yet).

EDIT 4: Reseating RAM and rebooting, no hardware errors on boot. I’ll test the node and see if that was the issue. Otherwise, I think I have to run memtest one stick at a time…

Definitely a hardware issue of some sort. Memtest86+ with individual sticks gives no issues one at a time. Reducing down to half capacity with 4 sticks, still no issues. Using 8, tons of errors as described above.

Temporary solution: run with 4 sticks / half RAM capacity. Simulations seem to run fine, none of the above dropped-node issues.

I’ll try reseating the CPU tomorrow, per suggestion from @infinitevalence, and see if I can get all 8 sticks to work.

2 Likes

Hope this works!

Okay, it seems like there were two issues simultaneously that contributed to this:

The first (a hardware issue) was correctly surmised by @infinitevalence. The segfault > RAM issue was likely due to poor seating of the EPYC processor (I likely had the torque wrench under-torqued). Reseating the CPU led to memtest passing with all sticks installed.

But, still errors, similar to the above from (for job ID = 64):

[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 54 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/54/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 59 not found, deleting the namespace
[2023-12-07T16:51:03.492] error: _delete_ns: umount2 /var/nvme/storage/cn4/59/.ns failed: Invalid argument
[2023-12-07T16:51:03.492] error: Encountered an error while restoring job containers.
[2023-12-07T16:51:03.492] error: Unable to restore job_container state.

Similar errors were still arising and causing a crash. My guess here is that Slurm left some residual directories in /var/nvme/storage/cn4/ associated with failed jobs (in this example, 54 and 59). Removing theses folders from said directory cleared these errors, and the simulations didn’t terminate prematurely. So, all is well.

I wonder why Slurm leaves these directories hanging for failed jobs, and since this was causing persistent errors for future jobs, how we can make sure that these directories are automatically removed, even if a simulation fails prematurely. But, that’s probably a post-holiday question to tackle…

EDIT: I am using eno1np0 for networking on this node, to provide context to the syslog below.

New error in this regard. New jobs are failing on this node, but this seems to be a software or networking issue.

From the slurmctld log on the controller node:

[2023-12-16T08:30:48.472] Node cn4 now responding
[2023-12-16T08:30:48.472] validate_node_specs: Node cn4 unexpectedly rebooted boot_time=1702737034 last response=1702736799
[2023-12-16T08:31:31.830] error: Nodes cn4 not responding

From cn4’s slurmd log:

... (job 280 running normally)
[2023-12-16T08:26:40.779] [280.extern] debug2: profile signaling type Task
[2023-12-16T08:26:41.151] [280.0] debug2: profile signaling type Task
[2023-12-16T08:27:10.779] [280.extern] debug2: profile signaling type Task
[2023-12-16T08:27:11.151] [280.0] debug2: profile signaling type Task
[2023-12-16T08:27:40.779] [280.extern] debug2: profile signaling type Task
[2023-12-16T08:27:41.151] [280.0] debug2: profile signaling type Task
[2023-12-16T08:30:47.761] debug:  Log file re-opened
[2023-12-16T08:30:47.761] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_280.0: Connection refused
[2023-12-16T08:30:47.761] debug:  Cleaned up stray socket /var/spool/slurmd/cn4_280.0
[2023-12-16T08:30:47.761] debug:  _step_connect: connect() failed for /var/spool/slurmd/cn4_280.4294967292: Connection refused
[2023-12-16T08:30:47.761] debug:  Cleaned up stray socket /var/spool/slurmd/cn4_280.4294967292
[2023-12-16T08:30:47.761] debug2: hwloc_topology_init
[2023-12-16T08:30:47.766] debug2: hwloc_topology_load
[2023-12-16T08:30:47.784] debug2: hwloc_topology_export_xml
[2023-12-16T08:30:47.786] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-16T08:30:47.787] debug2: hwloc_topology_init
[2023-12-16T08:30:47.788] debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
[2023-12-16T08:30:47.789] debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
[2023-12-16T08:30:47.789] topology/none: init: topology NONE plugin loaded
[2023-12-16T08:30:47.790] route/default: init: route default plugin loaded
[2023-12-16T08:30:47.790] debug2: Gathering cpu frequency information for 64 cpus
[2023-12-16T08:30:47.791] debug:  Resource spec: No specialized cores configured by default on this node
[2023-12-16T08:30:47.791] debug:  Resource spec: Reserved system memory limit not configured for this node
[2023-12-16T08:30:47.794] task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
[2023-12-16T08:30:47.794] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-12-16T08:30:47.794] debug:  auth/munge: init: Munge authentication plugin loaded
[2023-12-16T08:30:47.794] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-12-16T08:30:47.795] debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
[2023-12-16T08:30:47.795] cred/munge: init: Munge credential signature plugin loaded
[2023-12-16T08:30:47.795] slurmd version 21.08.5 started
[2023-12-16T08:30:47.795] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-12-16T08:30:47.796] debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
[2023-12-16T08:30:47.796] debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
[2023-12-16T08:30:47.796] debug:  job_container/tmpfs: container_p_restore: job_container.conf read successfully
[2023-12-16T08:30:47.796] debug:  job_container/tmpfs: _restore_ns: _restore_ns: Job 280 not found, deleting the namespace
[2023-12-16T08:30:47.796] error: _delete_ns: umount2 /var/nvme/storage/cn4/280/.ns failed: Invalid argument
[2023-12-16T08:30:47.796] error: Encountered an error while restoring job containers.
[2023-12-16T08:30:47.796] error: Unable to restore job_container state.
[2023-12-16T08:30:47.797] debug:  switch/none: init: switch NONE plugin loaded
[2023-12-16T08:30:47.797] debug:  switch Cray/Aries plugin loaded.
[2023-12-16T08:30:47.797] slurmd started on Sat, 16 Dec 2023 08:30:47 -0600
[2023-12-16T08:30:47.797] CPUs=64 Boards=1 Sockets=1 Cores=64 Threads=1 Memory=257580 TmpDisk=937291 Uptime=14 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2023-12-16T08:30:47.798] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-12-16T08:30:47.798] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-12-16T08:30:47.798] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-12-16T08:30:47.799] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-12-16T08:30:47.799] debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf)
[2023-12-16T08:30:47.802] debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
[2023-12-16T08:30:48.590] debug2: Start processing RPC: REQUEST_TERMINATE_JOB
[2023-12-16T08:30:48.590] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2023-12-16T08:30:48.590] debug:  _rpc_terminate_job: uid = 64030 JobId=280
[2023-12-16T08:30:48.590] debug:  credential for job 280 revoked
[2023-12-16T08:30:48.590] debug2: No steps in jobid 280 to send signal 18
[2023-12-16T08:30:48.590] debug2: No steps in jobid 280 to send signal 15
[2023-12-16T08:30:48.591] debug2: set revoke expiration for jobid 280 to 1702737168 UTS
[2023-12-16T08:30:48.591] error: _delete_ns: umount2 /var/nvme/storage/cn4/280/.ns failed: Invalid argument
[2023-12-16T08:30:48.591] error: container_g_delete(280): Invalid argument
[2023-12-16T08:30:48.591] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB

Now there is something in /var/log/syslog (especially near the end - I didn’t know how far back I should go…):

Dec 16 08:30:40 cn4 gnome-shell[1801]: Unset XDG_SESSION_ID, getCurrentSessionProxy() called outside a user session. Asking logind directly.
Dec 16 08:30:40 cn4 gnome-shell[1801]: Will monitor session c1
Dec 16 08:30:40 cn4 dbus-daemon[1276]: [system] Activating via systemd: service name='org.freedesktop.locale1' unit='dbus-org.freedesktop.locale1.service' requested by ':1.38' (uid=128 pid=1801 comm="/usr/bin/gnome-shell " label="unconfined")
Dec 16 08:30:40 cn4 systemd[1]: Starting Locale Service...
Dec 16 08:30:40 cn4 /usr/libexec/gdm-x-session[1651]: dbus-daemon[1651]: [session uid=128 pid=1651] Activating service name='org.freedesktop.impl.portal.PermissionStore' requested by ':1.5' (uid=128 pid=1801 comm="/usr/bin/gnome-shell " label="unconfined")
Dec 16 08:30:40 cn4 /usr/libexec/gdm-x-session[1651]: dbus-daemon[1651]: [session uid=128 pid=1651] Successfully activated service 'org.freedesktop.impl.portal.PermissionStore'
Dec 16 08:30:40 cn4 dbus-daemon[1276]: [system] Activating via systemd: service name='org.freedesktop.UPower' unit='upower.service' requested by ':1.38' (uid=128 pid=1801 comm="/usr/bin/gnome-shell " label="unconfined")
Dec 16 08:30:40 cn4 kernel: [    6.048298] bnxt_en 0000:45:00.0 eno1np0: NIC Link is Up, 10000 Mbps full duplex, Flow control: none
Dec 16 08:30:40 cn4 kernel: [    6.048309] bnxt_en 0000:45:00.0 eno1np0: EEE is not active
Dec 16 08:30:40 cn4 kernel: [    6.048313] bnxt_en 0000:45:00.0 eno1np0: FEC autoneg off encoding: None
Dec 16 08:30:40 cn4 NetworkManager[1277]: <info>  [1702737040.1897] device (eno1np0): carrier: link connected
Dec 16 08:30:40 cn4 NetworkManager[1277]: <info>  [1702737040.1919] manager: (eno2np1): new Ethernet device (/org/freedesktop/NetworkManager/Devices/3)
Dec 16 08:30:40 cn4 dbus-daemon[1276]: [system] Successfully activated service 'org.freedesktop.RealtimeKit1'
Dec 16 08:30:40 cn4 NetworkManager[1277]: <info>  [1702737040.1934] settings: (eno2np1): created default wired connection 'Wired connection 2'
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Successfully called chroot.
Dec 16 08:30:40 cn4 NetworkManager[1277]: <info>  [1702737040.1935] device (eno2np1): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Successfully dropped privileges.
Dec 16 08:30:40 cn4 systemd[1]: Starting Daemon for power management...
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Successfully limited resources.
Dec 16 08:30:40 cn4 systemd[1]: Started RealtimeKit Scheduling Policy Service.
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Running.
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Canary thread running.
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Watchdog thread running.
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Supervising 0 threads of 0 processes of 0 users.
Dec 16 08:30:40 cn4 dbus-daemon[1276]: [system] Activating via systemd: service name='org.freedesktop.GeoClue2' unit='geoclue.service' requested by ':1.38' (uid=128 pid=1801 comm="/usr/bin/gnome-shell " label="unconfined")
Dec 16 08:30:40 cn4 systemd[1]: Starting Location Lookup Service...
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Successfully made thread 1513 of process 1513 owned by '128' high priority at nice level -11.
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Supervising 1 threads of 1 processes of 1 users.
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Successfully made thread 1515 of process 1515 owned by '128' high priority at nice level -11.
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Supervising 2 threads of 2 processes of 1 users.
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: message repeated 2 times: [ Supervising 2 threads of 2 processes of 1 users.]
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Successfully made thread 1939 of process 1514 owned by '128' RT at priority 20.
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Supervising 3 threads of 3 processes of 1 users.
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Supervising 3 threads of 3 processes of 1 users.
Dec 16 08:30:40 cn4 dbus-daemon[1276]: [system] Activating via systemd: service name='org.freedesktop.PackageKit' unit='packagekit.service' requested by ':1.38' (uid=128 pid=1801 comm="/usr/bin/gnome-shell " label="unconfined")
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Successfully made thread 1947 of process 1513 owned by '128' RT at priority 20.
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Supervising 4 threads of 3 processes of 1 users.
Dec 16 08:30:40 cn4 systemd[1]: Starting PackageKit Daemon...
Dec 16 08:30:40 cn4 gnome-shell[1801]: Extension [email protected] already installed in /usr/share/gnome-shell/extensions/[email protected]. /usr/share/gnome-shell/extensions/[email protected] will not be loaded
Dec 16 08:30:40 cn4 gnome-shell[1801]: Extension [email protected] already installed in /usr/share/gnome-shell/extensions/[email protected]. /usr/share/gnome-shell/extensions/[email protected] will not be loaded
Dec 16 08:30:40 cn4 gnome-shell[1801]: Extension [email protected] already installed in /usr/share/gnome-shell/extensions/[email protected]. /usr/share/gnome-shell/extensions/[email protected] will not be loaded
Dec 16 08:30:40 cn4 /usr/libexec/gdm-x-session[1651]: dbus-daemon[1651]: [session uid=128 pid=1651] Activating service name='org.gnome.Shell.Notifications' requested by ':1.5' (uid=128 pid=1801 comm="/usr/bin/gnome-shell " label="unconfined")
Dec 16 08:30:40 cn4 PackageKit: daemon start
Dec 16 08:30:40 cn4 dbus-daemon[1276]: [system] Activating via systemd: service name='org.bluez' unit='dbus-org.bluez.service' requested by ':1.42' (uid=128 pid=1515 comm="/usr/bin/pulseaudio --daemonize=no --log-target=jo" label="unconfined")
Dec 16 08:30:40 cn4 systemd[1]: Condition check resulted in Bluetooth service being skipped.
Dec 16 08:30:40 cn4 /usr/libexec/gdm-x-session[1663]: dbus-daemon[1663]: Activating service name='org.a11y.atspi.Registry' requested by ':1.2' (uid=128 pid=1801 comm="/usr/bin/gnome-shell " label="unconfined")
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Supervising 4 threads of 3 processes of 1 users.
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Successfully made thread 1952 of process 1515 owned by '128' RT at priority 5.
Dec 16 08:30:40 cn4 rtkit-daemon[1526]: Supervising 5 threads of 3 processes of 1 users.
Dec 16 08:30:40 cn4 systemd[1500]: Started Sound Service.
Dec 16 08:30:40 cn4 /usr/libexec/gdm-x-session[1663]: dbus-daemon[1663]: Successfully activated service 'org.a11y.atspi.Registry'
Dec 16 08:30:40 cn4 /usr/libexec/gdm-x-session[1953]: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry
Dec 16 08:30:40 cn4 dbus-daemon[1276]: [system] Successfully activated service 'org.freedesktop.locale1'
Dec 16 08:30:40 cn4 systemd[1]: Started Locale Service.
Dec 16 08:30:40 cn4 org.gnome.Shell.desktop[1801]: Window manager warning: Failed to parse saved session file: Failed to open file “/var/lib/gdm3/.config/mutter/sessions/10757cb1bd85d34e56170273703955945600000016520000.ms”: No such file or directory
Dec 16 08:30:40 cn4 /usr/libexec/gdm-x-session[1651]: dbus-daemon[1651]: [session uid=128 pid=1651] Successfully activated service 'org.gnome.Shell.Notifications'
Dec 16 08:30:40 cn4 dbus-daemon[1276]: [system] Successfully activated service 'org.freedesktop.UPower'
Dec 16 08:30:40 cn4 systemd[1]: Started Daemon for power management.
Dec 16 08:30:40 cn4 gnome-shell[1801]: JS ERROR: TypeError: this._managerProxy is undefined#012_onGeoclueVanished@resource:///org/gnome/shell/ui/status/location.js:169:9
Dec 16 08:30:40 cn4 snapd[1320]: overlord.go:272: Acquiring state lock file
Dec 16 08:30:40 cn4 snapd[1320]: overlord.go:277: Acquired state lock file
Dec 16 08:30:40 cn4 kernel: [    6.178511] rfkill: input handler disabled
Dec 16 08:30:40 cn4 gnome-shell[1801]: Error looking up permission: GDBus.Error:org.freedesktop.portal.Error.NotFound: No entry for geolocation
Dec 16 08:30:40 cn4 snapd[1320]: daemon.go:247: started snapd/2.60.4 (series 16; classic) ubuntu/22.04 (amd64) linux/6.2.0-39-generic.
Dec 16 08:30:40 cn4 /usr/libexec/gdm-x-session[1651]: dbus-daemon[1651]: [session uid=128 pid=1651] Activating service name='org.gtk.vfs.Daemon' requested by ':1.20' (uid=128 pid=1984 comm="ibus-daemon --panel disable --xim " label="unconfined")
Dec 16 08:30:40 cn4 /usr/libexec/gdm-x-session[1651]: dbus-daemon[1651]: [session uid=128 pid=1651] Successfully activated service 'org.gtk.vfs.Daemon'
Dec 16 08:30:40 cn4 kernel: [    6.196325] loop17: detected capacity change from 0 to 8
Dec 16 08:30:40 cn4 /usr/libexec/gdm-x-session[1651]: dbus-daemon[1651]: [session uid=128 pid=1651] Activating service name='org.freedesktop.portal.IBus' requested by ':1.20' (uid=128 pid=1984 comm="ibus-daemon --panel disable --xim " label="unconfined")
Dec 16 08:30:40 cn4 snapd[1320]: daemon.go:340: adjusting startup timeout by 1m20s (pessimistic estimate of 30s plus 5s per snap)
Dec 16 08:30:40 cn4 /usr/libexec/gdm-x-session[1651]: dbus-daemon[1651]: [session uid=128 pid=1651] Successfully activated service 'org.freedesktop.portal.IBus'
Dec 16 08:30:40 cn4 snapd[1320]: backends.go:58: AppArmor status: apparmor is enabled and all features are available (using snapd provided apparmor_parser)
Dec 16 08:30:40 cn4 spice-vdagent[2098]: vdagent virtio channel /dev/virtio-ports/com.redhat.spice.0 does not exist, exiting
Dec 16 08:30:40 cn4 gnome-session-binary[1652]: Entering running state
Dec 16 08:30:40 cn4 xbrlapi.desktop[2117]: openConnection: connect: No such file or directory
Dec 16 08:30:40 cn4 xbrlapi.desktop[2117]: cannot connect to braille devices daemon brltty at :0
Dec 16 08:30:40 cn4 systemd[1]: Started Snap Daemon.
Dec 16 08:30:40 cn4 dbus-daemon[1276]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.56' (uid=0 pid=1320 comm="/usr/lib/snapd/snapd " label="unconfined")
Dec 16 08:30:40 cn4 gsd-media-keys[2011]: Failed to grab accelerator for keybinding settings:hibernate
Dec 16 08:30:40 cn4 gsd-media-keys[2011]: Failed to grab accelerator for keybinding settings:rotate-video-lock
Dec 16 08:30:40 cn4 gsd-media-keys[2011]: Failed to grab accelerator for keybinding settings:playback-random
Dec 16 08:30:40 cn4 gsd-media-keys[2011]: Failed to grab accelerator for keybinding settings:playback-repeat
Dec 16 08:30:40 cn4 gsd-color[1985]: failed to get edid: unable to get EDID for output
Dec 16 08:30:40 cn4 systemd[1]: Starting Wait until snapd is fully seeded...
Dec 16 08:30:40 cn4 /usr/libexec/gdm-x-session[1651]: dbus-daemon[1651]: [session uid=128 pid=1651] Activating service name='org.gnome.ScreenSaver' requested by ':1.24' (uid=128 pid=2033 comm="/usr/libexec/gsd-power " label="unconfined")
Dec 16 08:30:40 cn4 systemd[1]: Starting Time & Date Service...
Dec 16 08:30:40 cn4 gsd-color[1985]: unable to get EDID for xrandr-VGA-1: unable to get EDID for output
Dec 16 08:30:40 cn4 /usr/libexec/gdm-x-session[1651]: dbus-daemon[1651]: [session uid=128 pid=1651] Successfully activated service 'org.gnome.ScreenSaver'
Dec 16 08:30:40 cn4 dbus-daemon[1276]: [system] Successfully activated service 'org.freedesktop.timedate1'
Dec 16 08:30:40 cn4 systemd[1]: Started Time & Date Service.
Dec 16 08:30:40 cn4 systemd[1]: Finished Wait until snapd is fully seeded.
Dec 16 08:30:40 cn4 systemd[1]: Condition check resulted in Auto import assertions from block devices being skipped.
Dec 16 08:30:40 cn4 dbus-daemon[1276]: [system] Activating via systemd: service name='net.reactivated.Fprint' unit='fprintd.service' requested by ':1.38' (uid=128 pid=1801 comm="/usr/bin/gnome-shell " label="unconfined")
Dec 16 08:30:40 cn4 systemd[1]: Starting Fingerprint Authentication Daemon...
Dec 16 08:30:40 cn4 gnome-shell[1801]: Registering session with GDM
Dec 16 08:30:40 cn4 systemd[1]: Received SIGRTMIN+21 from PID 1013 (plymouthd).
Dec 16 08:30:40 cn4 systemd[1]: Finished Hold until boot process finishes up.
Dec 16 08:30:40 cn4 systemd[1]: Starting Set console scheme...
Dec 16 08:30:40 cn4 systemd[1]: Finished Set console scheme.
Dec 16 08:30:40 cn4 systemd[1]: Created slice Slice /system/getty.
Dec 16 08:30:40 cn4 dbus-daemon[1276]: [system] Successfully activated service 'net.reactivated.Fprint'
Dec 16 08:30:40 cn4 systemd[1]: Started Fingerprint Authentication Daemon.
Dec 16 08:30:40 cn4 gnome-shell[1801]: JS ERROR: Failed to initialize fprintd service: Gio.IOErrorEnum: GDBus.Error:net.reactivated.Fprint.Error.NoSuchDevice: No devices available#012asyncCallback@resource:///org/gnome/gjs/modules/core/overrides/Gio.js:114:23
Dec 16 08:30:41 cn4 ModemManager[1382]: <info>  [base-manager] couldn't check support for device '/sys/devices/pci0000:40/0000:40:03.6/0000:45:00.0': not supported by any plugin
Dec 16 08:30:41 cn4 ModemManager[1382]: <info>  [base-manager] couldn't check support for device '/sys/devices/pci0000:40/0000:40:03.6/0000:45:00.1': not supported by any plugin
Dec 16 08:30:41 cn4 ModemManager[1382]: <info>  [base-manager] couldn't check support for device '/sys/devices/pci0000:40/0000:40:08.1/0000:47:00.3/usb7/7-1/7-1.2': not supported by any plugin
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.6871] manager: (enxbe3af2b6059f): new Ethernet device (/org/freedesktop/NetworkManager/Devices/4)
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.6887] settings: (enxbe3af2b6059f): created default wired connection 'Wired connection 3'
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.6887] device (enxbe3af2b6059f): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.6895] device (enxbe3af2b6059f): carrier: link connected
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.6910] failed to open /run/network/ifstate
Dec 16 08:30:41 cn4 dbus-daemon[1529]: [session uid=128 pid=1529] Successfully activated service 'org.gnome.OnlineAccounts'
Dec 16 08:30:41 cn4 dbus-daemon[1529]: [session uid=128 pid=1529] Successfully activated service 'org.gtk.vfs.GoaVolumeMonitor'
Dec 16 08:30:41 cn4 systemd[1500]: Started Virtual filesystem service - GNOME Online Accounts monitor.
Dec 16 08:30:41 cn4 dbus-daemon[1529]: [session uid=128 pid=1529] Activating via systemd: service name='org.gtk.vfs.GPhoto2VolumeMonitor' unit='gvfs-gphoto2-volume-monitor.service' requested by ':1.4' (uid=128 pid=1545 comm="/usr/libexec/tracker-miner-fs-3 " label="unconfined")
Dec 16 08:30:41 cn4 dbus-daemon[1276]: [system] Successfully activated service 'org.freedesktop.GeoClue2'
Dec 16 08:30:41 cn4 systemd[1]: Started Location Lookup Service.
Dec 16 08:30:41 cn4 systemd[1500]: Starting Virtual filesystem service - digital camera monitor...
Dec 16 08:30:41 cn4 dbus-daemon[1529]: [session uid=128 pid=1529] Successfully activated service 'org.gtk.vfs.GPhoto2VolumeMonitor'
Dec 16 08:30:41 cn4 systemd[1500]: Started Virtual filesystem service - digital camera monitor.
Dec 16 08:30:41 cn4 dbus-daemon[1529]: [session uid=128 pid=1529] Activating via systemd: service name='org.gtk.vfs.MTPVolumeMonitor' unit='gvfs-mtp-volume-monitor.service' requested by ':1.4' (uid=128 pid=1545 comm="/usr/libexec/tracker-miner-fs-3 " label="unconfined")
Dec 16 08:30:41 cn4 systemd[1500]: Starting Virtual filesystem service - Media Transfer Protocol monitor...
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7601] modem-manager: ModemManager available
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7617] device (eno1np0): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7622] device (enxbe3af2b6059f): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')
Dec 16 08:30:41 cn4 /usr/libexec/gdm-x-session[1651]: dbus-daemon[1651]: [session uid=128 pid=1651] Activating service name='org.freedesktop.systemd1' requested by ':1.9' (uid=128 pid=1977 comm="/usr/libexec/gsd-sharing " label="unconfined")
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7627] policy: auto-activating connection 'Wired connection 1' (8c83dbf8-bc1a-3e53-9ffd-5b00ced83f9b)
Dec 16 08:30:41 cn4 dbus-daemon[1529]: [session uid=128 pid=1529] Successfully activated service 'org.gtk.vfs.MTPVolumeMonitor'
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7631] policy: auto-activating connection 'Wired connection 3' (27de32e4-7f73-336c-be30-f62d01a99b53)
Dec 16 08:30:41 cn4 systemd[1500]: Started Virtual filesystem service - Media Transfer Protocol monitor.
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7635] device (eno1np0): Activation: starting connection 'Wired connection 1' (8c83dbf8-bc1a-3e53-9ffd-5b00ced83f9b)
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7637] device (enxbe3af2b6059f): Activation: starting connection 'Wired connection 3' (27de32e4-7f73-336c-be30-f62d01a99b53)
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7638] device (eno1np0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7641] manager: NetworkManager state is now CONNECTING
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7643] device (eno1np0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7649] device (enxbe3af2b6059f): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7652] device (enxbe3af2b6059f): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Dec 16 08:30:41 cn4 /usr/libexec/gdm-x-session[1651]: dbus-daemon[1651]: [session uid=128 pid=1651] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7656] device (eno1np0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Dec 16 08:30:41 cn4 gsd-sharing[1977]: Failed to StopUnit service: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.freedesktop.systemd1 exited with status 1
Dec 16 08:30:41 cn4 gsd-sharing[1977]: Failed to StopUnit service: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.freedesktop.systemd1 exited with status 1
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7661] device (enxbe3af2b6059f): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7666] dhcp4 (enxbe3af2b6059f): activation: beginning transaction (timeout in 45 seconds)
Dec 16 08:30:41 cn4 avahi-daemon[1272]: Joining mDNS multicast group on interface eno1np0.IPv4 with address 10.115.79.12.
Dec 16 08:30:41 cn4 avahi-daemon[1272]: New relevant interface eno1np0.IPv4 for mDNS.
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7676] device (eno1np0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Dec 16 08:30:41 cn4 avahi-daemon[1272]: Registering new address record for 10.115.79.12 on eno1np0.IPv4.
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7689] policy: set 'Wired connection 1' (eno1np0) as default for IPv4 routing and DNS
Dec 16 08:30:41 cn4 avahi-daemon[1272]: Joining mDNS multicast group on interface eno1np0.IPv6 with address fe80::ecde:6f14:877c:8b74.
Dec 16 08:30:41 cn4 avahi-daemon[1272]: New relevant interface eno1np0.IPv6 for mDNS.
Dec 16 08:30:41 cn4 avahi-daemon[1272]: Registering new address record for fe80::ecde:6f14:877c:8b74 on eno1np0.*.
Dec 16 08:30:41 cn4 avahi-daemon[1272]: Joining mDNS multicast group on interface enxbe3af2b6059f.IPv6 with address fe80::28ef:ad71:7c7:c8b5.
Dec 16 08:30:41 cn4 avahi-daemon[1272]: New relevant interface enxbe3af2b6059f.IPv6 for mDNS.
Dec 16 08:30:41 cn4 avahi-daemon[1272]: Registering new address record for fe80::28ef:ad71:7c7:c8b5 on enxbe3af2b6059f.*.
Dec 16 08:30:41 cn4 /usr/libexec/gdm-x-session[1651]: dbus-daemon[1651]: [session uid=128 pid=1651] Activating service name='org.freedesktop.systemd1' requested by ':1.9' (uid=128 pid=1977 comm="/usr/libexec/gsd-sharing " label="unconfined")
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7727] dhcp4 (enxbe3af2b6059f): state changed new lease, address=169.254.3.1
Dec 16 08:30:41 cn4 systemd-resolved[1047]: eno1np0: Bus client set default route setting: yes
Dec 16 08:30:41 cn4 avahi-daemon[1272]: Joining mDNS multicast group on interface enxbe3af2b6059f.IPv4 with address 169.254.3.1.
Dec 16 08:30:41 cn4 avahi-daemon[1272]: New relevant interface enxbe3af2b6059f.IPv4 for mDNS.
Dec 16 08:30:41 cn4 avahi-daemon[1272]: Registering new address record for 169.254.3.1 on enxbe3af2b6059f.IPv4.
Dec 16 08:30:41 cn4 systemd-resolved[1047]: eno1np0: Bus client set DNS server list to: 10.40.73.73, 10.40.8.8
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7738] device (eno1np0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7743] device (enxbe3af2b6059f): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7748] device (eno1np0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Dec 16 08:30:41 cn4 /usr/libexec/gdm-x-session[1651]: dbus-daemon[1651]: [session uid=128 pid=1651] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7754] device (eno1np0): Activation: successful, device activated.
Dec 16 08:30:41 cn4 gsd-sharing[1977]: Failed to StopUnit service: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.freedesktop.systemd1 exited with status 1
Dec 16 08:30:41 cn4 gsd-sharing[1977]: Failed to StopUnit service: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.freedesktop.systemd1 exited with status 1
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7776] device (enxbe3af2b6059f): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7778] device (enxbe3af2b6059f): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7781] manager: NetworkManager state is now CONNECTED_SITE
Dec 16 08:30:41 cn4 systemd-resolved[1047]: /etc/hosts:2: line is missing any hostnames
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7784] device (enxbe3af2b6059f): Activation: successful, device activated.
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.7838] agent-manager: agent[4bf6f49a4e899a87,:1.38/org.gnome.Shell.NetworkAgent/128]: agent registered
Dec 16 08:30:41 cn4 NetworkManager[1277]: <info>  [1702737041.8524] manager: NetworkManager state is now CONNECTED_GLOBAL
Dec 16 08:30:41 cn4 dbus-daemon[1529]: [session uid=128 pid=1529] Successfully activated service 'org.freedesktop.Tracker3.Miner.Files'
Dec 16 08:30:41 cn4 systemd[1500]: Started Tracker file system data miner.
Dec 16 08:30:41 cn4 systemd[1500]: Started Tracker metadata extractor.
Dec 16 08:30:41 cn4 systemd[1500]: Reached target Main User Target.
Dec 16 08:30:41 cn4 systemd[1500]: Startup finished in 3.109s.
Dec 16 08:30:42 cn4 dbus-daemon[1276]: [system] Successfully activated service 'org.freedesktop.PackageKit'
Dec 16 08:30:42 cn4 systemd[1]: Started PackageKit Daemon.
Dec 16 08:30:43 cn4 systemd[1]: dmesg.service: Deactivated successfully.
Dec 16 08:30:47 cn4 NetworkManager[1277]: <info>  [1702737047.6862] manager: startup complete
Dec 16 08:30:47 cn4 systemd[1]: Finished Network Manager Wait Online.
Dec 16 08:30:47 cn4 systemd[1]: Reached target Network is Online.
Dec 16 08:30:47 cn4 systemd[1]: Started Download data for packages that failed at package install time.
Dec 16 08:30:47 cn4 systemd[1]: Started Check to see whether there is a new version of Ubuntu available.
Dec 16 08:30:47 cn4 systemd[1]: Reached target Timer Units.
Dec 16 08:30:47 cn4 systemd[1]: Starting Automounts filesystems on demand...
Dec 16 08:30:47 cn4 systemd[1]: Started Make remote CUPS printers available locally.
Dec 16 08:30:47 cn4 systemd[1]: Starting Docker Application Container Engine...
Dec 16 08:30:47 cn4 systemd[1]: Starting Tool to automatically collect and submit kernel crash signatures...
Dec 16 08:30:47 cn4 systemd[1]: Starting Notify NFS peers of a restart...
Dec 16 08:30:47 cn4 systemd[1]: Started Slurm controller daemon.
Dec 16 08:30:47 cn4 sm-notify[2519]: Version 2.6.1 starting
Dec 16 08:30:47 cn4 systemd[1]: Started Slurm node daemon.
Dec 16 08:30:47 cn4 systemd[1]: Condition check resulted in Ubuntu Pro Background Auto Attach being skipped.
Dec 16 08:30:47 cn4 systemd[1]: Starting Ubuntu FAN network setup...
Dec 16 08:30:47 cn4 systemd[1]: Started crash report submission.
Dec 16 08:30:47 cn4 dbus-daemon[1276]: [system] Activating via systemd: service name='org.freedesktop.network1' unit='dbus-org.freedesktop.network1.service' requested by ':1.66' (uid=0 pid=2536 comm="networkctl " label="unconfined")
Dec 16 08:30:47 cn4 whoopsie[2528]: [08:30:47] Using lock path: /var/lock/whoopsie/lock
Dec 16 08:30:47 cn4 systemd[1]: tmp-autoMOdFIh.mount: Deactivated successfully.
Dec 16 08:30:47 cn4 systemd[1]: Started Automounts filesystems on demand.
Dec 16 08:30:47 cn4 systemd[1]: Started Notify NFS peers of a restart.
Dec 16 08:30:47 cn4 systemd[1]: whoopsie.service: Deactivated successfully.
Dec 16 08:30:47 cn4 dbus-daemon[1276]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.network1.service': Unit dbus-org.freedesktop.network1.service not found.
Dec 16 08:30:47 cn4 systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Dec 16 08:30:47 cn4 systemd[1]: slurmctld.service: Failed with result 'exit-code'.
Dec 16 08:30:47 cn4 systemd[1]: Finished Ubuntu FAN network setup.
Dec 16 08:30:47 cn4 systemd[1]: kerneloops.service: Found left-over process 2556 (kerneloops) in control group while starting unit. Ignoring.
Dec 16 08:30:47 cn4 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Dec 16 08:30:47 cn4 systemd[1]: Started Tool to automatically collect and submit kernel crash signatures.
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  Log file re-opened
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  CPUs:64 Boards:1 Sockets:1 CoresPerSocket:64 ThreadsPerCore:1
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: topology/none: init: topology NONE plugin loaded
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: route/default: init: route default plugin loaded
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug2: Gathering cpu frequency information for 64 cpus
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  Resource spec: No specialized cores configured by default on this node
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  Resource spec: Reserved system memory limit not configured for this node
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: task/affinity: init: task affinity plugin loaded with CPU mask 0xffffffffffffffff
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  auth/munge: init: Munge authentication plugin loaded
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf"
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: cred/munge: init: Munge credential signature plugin loaded
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: slurmd version 21.08.5 started
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  job_container/tmpfs: init: job_container tmpfs plugin loaded
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  job_container/tmpfs: _read_slurm_jc_conf: Reading job_container.conf file /etc/slurm/job_container.conf
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  job_container/tmpfs: container_p_restore: job_container.conf read successfully
Dec 16 08:30:47 cn4 dockerd[2513]: time="2023-12-16T08:30:47.796552457-06:00" level=info msg="Starting up"
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  switch/none: init: switch NONE plugin loaded
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  switch Cray/Aries plugin loaded.
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: slurmd started on Sat, 16 Dec 2023 08:30:47 -0600
Dec 16 08:30:47 cn4 dockerd[2513]: time="2023-12-16T08:30:47.797730418-06:00" level=info msg="detected 127.0.0.53 nameserver, assuming systemd-resolved, so using resolv.conf: /run/systemd/resolve/resolv.conf"
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: CPUs=64 Boards=1 Sockets=1 Cores=64 Threads=1 Memory=257580 TmpDisk=937291 Uptime=14 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
Dec 16 08:30:47 cn4 slurmd[2523]: slurmd: debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
Dec 16 08:30:47 cn4 kernel: [   13.687459] kauditd_printk_skb: 34 callbacks suppressed
Dec 16 08:30:47 cn4 kernel: [   13.687464] audit: type=1400 audit(1702737047.819:45): apparmor="STATUS" operation="profile_load" profile="unconfined" name="docker-default" pid=2583 comm="apparmor_parser"
Dec 16 08:30:47 cn4 dockerd[2513]: time="2023-12-16T08:30:47.842292050-06:00" level=info msg="[graphdriver] using prior storage driver: overlay2"
Dec 16 08:30:47 cn4 dockerd[2513]: time="2023-12-16T08:30:47.842892361-06:00" level=info msg="Loading containers: start."
Dec 16 08:30:47 cn4 kernel: [   13.722469] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
Dec 16 08:30:47 cn4 kernel: [   13.723685] Bridge firewalling registered
Dec 16 08:30:47 cn4 kernel: [   13.800099] Initializing XFRM netlink socket
Dec 16 08:30:47 cn4 systemd-udevd[2585]: Using default interface naming scheme 'v249'.
Dec 16 08:30:47 cn4 NetworkManager[1277]: <info>  [1702737047.9553] manager: (docker0): new Bridge device (/org/freedesktop/NetworkManager/Devices/5)
Dec 16 08:30:48 cn4 avahi-daemon[1272]: Joining mDNS multicast group on interface docker0.IPv4 with address 172.17.0.1.
Dec 16 08:30:48 cn4 avahi-daemon[1272]: New relevant interface docker0.IPv4 for mDNS.
Dec 16 08:30:48 cn4 avahi-daemon[1272]: Registering new address record for 172.17.0.1 on docker0.IPv4.
Dec 16 08:30:48 cn4 NetworkManager[1277]: <info>  [1702737048.0448] device (docker0): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external')
Dec 16 08:30:48 cn4 NetworkManager[1277]: <info>  [1702737048.0454] device (docker0): state change: unavailable -> disconnected (reason 'connection-assumed', sys-iface-state: 'external')
Dec 16 08:30:48 cn4 NetworkManager[1277]: <info>  [1702737048.0462] device (docker0): Activation: starting connection 'docker0' (41877044-c592-41d7-9fd1-b1aff3979e0e)
Dec 16 08:30:48 cn4 NetworkManager[1277]: <info>  [1702737048.0465] device (docker0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Dec 16 08:30:48 cn4 NetworkManager[1277]: <info>  [1702737048.0468] device (docker0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Dec 16 08:30:48 cn4 NetworkManager[1277]: <info>  [1702737048.0471] device (docker0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Dec 16 08:30:48 cn4 NetworkManager[1277]: <info>  [1702737048.0473] device (docker0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Dec 16 08:30:48 cn4 NetworkManager[1277]: <info>  [1702737048.0515] device (docker0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Dec 16 08:30:48 cn4 NetworkManager[1277]: <info>  [1702737048.0517] device (docker0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Dec 16 08:30:48 cn4 NetworkManager[1277]: <info>  [1702737048.0522] device (docker0): Activation: successful, device activated.
Dec 16 08:30:48 cn4 dockerd[2513]: time="2023-12-16T08:30:48.329973116-06:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Dec 16 08:30:48 cn4 dockerd[2513]: time="2023-12-16T08:30:48.387669581-06:00" level=info msg="Loading containers: done."
Dec 16 08:30:48 cn4 dockerd[2513]: time="2023-12-16T08:30:48.404315666-06:00" level=info msg="Docker daemon" commit="24.0.5-0ubuntu1~22.04.1" graphdriver=overlay2 version=24.0.5
Dec 16 08:30:48 cn4 dockerd[2513]: time="2023-12-16T08:30:48.404430837-06:00" level=info msg="Daemon has completed initialization"
Dec 16 08:30:48 cn4 dockerd[2513]: time="2023-12-16T08:30:48.435420086-06:00" level=info msg="API listen on /run/docker.sock"
Dec 16 08:30:48 cn4 systemd[1]: Started Docker Application Container Engine.
Dec 16 08:30:48 cn4 systemd[1]: Reached target Multi-User System.
Dec 16 08:30:48 cn4 systemd[1]: Reached target Graphical Interface.
Dec 16 08:30:48 cn4 systemd[1]: Starting Record Runlevel Change in UTMP...
Dec 16 08:30:48 cn4 systemd[1]: systemd-update-utmp-runlevel.service: Deactivated successfully.
Dec 16 08:30:48 cn4 systemd[1]: Finished Record Runlevel Change in UTMP.
Dec 16 08:30:48 cn4 systemd[1]: Startup finished in 2min 43.308s (firmware) + 1.521s (loader) + 3.211s (kernel) + 11.127s (userspace) = 2min 59.169s.
Dec 16 08:30:48 cn4 slurmd[2523]: slurmd: debug2: Start processing RPC: REQUEST_TERMINATE_JOB
Dec 16 08:30:48 cn4 slurmd[2523]: slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
Dec 16 08:30:48 cn4 slurmd[2523]: slurmd: debug:  _rpc_terminate_job: uid = 64030 JobId=280
Dec 16 08:30:48 cn4 slurmd[2523]: slurmd: debug:  credential for job 280 revoked
Dec 16 08:30:48 cn4 slurmd[2523]: slurmd: debug2: No steps in jobid 280 to send signal 18
Dec 16 08:30:48 cn4 slurmd[2523]: slurmd: debug2: No steps in jobid 280 to send signal 15
Dec 16 08:30:48 cn4 slurmd[2523]: slurmd: debug2: set revoke expiration for jobid 280 to 1702737168 UTS
Dec 16 08:30:49 cn4 kernel: [   15.220732] FS-Cache: Loaded
Dec 16 08:30:49 cn4 kernel: [   15.461529] NFS: Registering the id_resolver key type
Dec 16 08:30:49 cn4 kernel: [   15.461545] Key type id_resolver registered
Dec 16 08:30:49 cn4 kernel: [   15.461548] Key type id_legacy registered
Dec 16 08:30:56 cn4 dbus-daemon[1529]: [session uid=128 pid=1529] Activating via systemd: service name='org.freedesktop.Tracker3.Miner.Extract' unit='tracker-extract-3.service' requested by ':1.4' (uid=128 pid=1545 comm="/usr/libexec/tracker-miner-fs-3 " label="unconfined")

Not sure how to interpret this, but it seems like a networking issue?

1 Like

Okay, it happened again last night. I’ve attached (around the time of failure) the slurmctld log (not very informative), the slurmd log on the problem node cn4 (similar errors to previously), and the syslog on cn4 all the way to the time of failure (I think this will show more than previously posted). Any guidance would be appreciated.

cn4_slurmd.txt (52.4 KB)
cn4_syslog.txt (285.9 KB)
head_slurmctld.txt (1.6 KB)

I will try to take a look a bit later, trying to wrap work before holidays!

2 Likes

Thanks, @infinitevalence ! If I was looking at the syslog, I would almost think that the node rebooted between Dec 18 21:26:41 and Dec 18 21:30:15… indeed, dmesg begins at (slight mismatch in timestamps?):

sudo dmesg -T
[Mon Dec 18 21:30:11 2023]  Linux version 6.2.0-39-generic (buildd@lcy02-amd64-045) (x86_64-linux-gnu-gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2 (Ubuntu 6.2.0-39.40~22.04.1-generic 6.2.16)

…but why.

EDIT: FWIW (nothing around time-of-failure, but also just learning ipmitool commands, so I might not be looking at the right thing):

sudo ipmitool sel elist
   1 | 03/23/2023 | 02:18:31 | OS Critical Stop | OS graceful shutdown () | Asserted
   2 | 01/01/2011 | 05:22:23 | Unknown #0xff |  | Asserted
   3 | 01/01/2011 | 05:23:13 | Battery | Low () | Asserted
   4 | 09/08/2022 | 15:28:15 | Power Supply PS2 Status | Presence detected () | Asserted
   5 | 09/13/2022 | 14:33:44 | Unknown #0xff |  | Asserted
   6 | 09/13/2022 | 14:37:19 | Unknown #0xff |  | Asserted
   7 | 09/13/2022 | 14:38:03 | Power Supply #0xc0 | Presence detected () | Asserted
   8 | 09/24/2022 | 21:31:11 | Unknown #0xff |  | Asserted
   9 | 09/24/2022 | 21:32:24 | Power Supply PS2 Status | Presence detected () | Asserted
   a | 03/23/2023 | 02:11:48 | Unknown #0xff |  | Asserted
   b | 09/28/2022 | 20:02:51 | Power Supply PS2 Status | Presence detected () | Asserted
   c | 09/29/2022 | 17:26:16 | Unknown #0xff |  | Asserted
   d | 09/29/2022 | 17:26:49 | Power Supply PS2 Status | Presence detected () | Asserted
   e | 03/23/2023 | 02:12:08 | Unknown #0xff |  | Asserted
   f | 09/30/2022 | 19:47:57 | Power Supply PS2 Status | Presence detected () | Asserted
  10 | 11/17/2023 | 13:50:58 | Unknown #0xff |  | Asserted
  11 | 11/17/2023 | 13:52:21 | Power Supply PS2 Status | Presence detected () | Asserted
  12 | 11/20/2023 | 14:38:33 | Unknown #0xff |  | Asserted
  13 | 11/20/2023 | 14:39:02 | Power Supply PS2 Status | Presence detected () | Asserted
  14 | 11/21/2023 | 14:40:12 | Unknown #0xff |  | Asserted
  15 | 11/21/2023 | 14:40:44 | Power Supply PS2 Status | Presence detected () | Asserted
  16 | 12/06/2023 | 19:45:12 | Memory | Correctable ECC (@DIMMB1(CPU1)) | Asserted
  17 | 12/07/2023 | 20:20:20 | Memory | Uncorrectable ECC (@DIMMG1(CPU1)) | Asserted
  18 | 12/07/2023 | 21:27:32 | Memory | Uncorrectable ECC (@DIMMG1(CPU1)) | Asserted
  19 | 12/07/2023 | 22:50:12 | Memory | Uncorrectable ECC (@DIMMG1(CPU1)) | Asserted
  1a | 12/10/2023 | 02:00:43 | Memory | Uncorrectable ECC (@DIMMG1(CPU1)) | Asserted
  1b | 12/10/2023 | 16:05:08 | Memory | Uncorrectable ECC (@DIMMG1(CPU1)) | Asserted
  1c | 12/12/2023 | 20:33:16 | Memory | Memory Device Disabled (@DIMMG1(CPU1)) | Asserted
  1d | 03/23/2023 | 02:11:59 | Unknown #0xff |  | Asserted
  1e | 12/12/2023 | 21:00:12 | Power Supply PS2 Status | Presence detected () | Asserted
  1f | 03/23/2023 | 02:11:53 | Unknown #0xff |  | Asserted
  20 | 12/12/2023 | 21:07:08 | Power Supply PS2 Status | Presence detected () | Asserted
  21 | 12/12/2023 | 21:38:45 | Memory | Uncorrectable ECC (@DIMMG1(CPU1)) | Asserted
  22 | 12/13/2023 | 14:39:33 | Unknown #0xff |  | Asserted
  23 | 12/13/2023 | 14:41:07 | Power Supply PS2 Status | Presence detected () | Asserted
  24 | 12/13/2023 | 14:54:10 | Unknown #0xff |  | Asserted
  25 | 12/13/2023 | 14:55:11 | Power Supply PS2 Status | Presence detected () | Asserted
  26 | 12/13/2023 | 15:17:44 | Unknown #0xff |  | Asserted
  27 | 12/13/2023 | 15:18:10 | Power Supply PS2 Status | Presence detected () | Asserted
  28 | 12/13/2023 | 15:31:42 | Unknown #0xff |  | Asserted
  29 | 12/13/2023 | 15:32:16 | Power Supply PS2 Status | Presence detected () | Asserted
  2a | 12/13/2023 | 15:45:41 | Unknown #0xff |  | Asserted
  2b | 12/13/2023 | 15:49:44 | Unknown #0xff |  | Asserted
  2c | 12/13/2023 | 15:50:23 | Power Supply PS2 Status | Presence detected () | Asserted
  2d | 12/13/2023 | 15:58:57 | Unknown #0xff |  | Asserted
  2e | 12/13/2023 | 15:59:35 | Power Supply PS2 Status | Presence detected () | Asserted
  2f | 12/13/2023 | 17:14:31 | Unknown #0xff |  | Asserted
  30 | 12/13/2023 | 17:15:11 | Power Supply PS2 Status | Presence detected () | Asserted
  31 | 12/13/2023 | 17:20:45 | Unknown #0xff |  | Asserted
  32 | 12/13/2023 | 17:21:52 | Power Supply PS2 Status | Presence detected () | Asserted
  33 | 12/13/2023 | 17:30:11 | Unknown #0xff |  | Asserted
  34 | 12/13/2023 | 17:30:40 | Power Supply PS2 Status | Presence detected () | Asserted
  35 | 12/13/2023 | 17:38:17 | Unknown #0xff |  | Asserted
  36 | 12/13/2023 | 17:38:47 | Power Supply PS2 Status | Presence detected () | Asserted
  37 | 12/13/2023 | 17:48:52 | Unknown #0xff |  | Asserted
  38 | 12/13/2023 | 17:49:26 | Power Supply PS2 Status | Presence detected () | Asserted
  39 | 12/13/2023 | 18:11:44 | Unknown #0xff |  | Asserted
  3a | 12/13/2023 | 18:12:30 | Power Supply PS2 Status | Presence detected () | Asserted
  3b | 12/13/2023 | 20:33:23 | Unknown #0xff |  | Asserted
  3c | 12/13/2023 | 20:34:22 | Power Supply PS2 Status | Presence detected () | Asserted
  3d | 03/23/2023 | 02:11:54 | Unknown #0xff |  | Asserted
  3e | 03/23/2023 | 02:12:03 | Unknown #0xff |  | Asserted
  3f | 03/23/2023 | 02:12:05 | Unknown #0xff |  | Asserted
  40 | 03/23/2023 | 02:11:51 | Unknown #0xff |  | Asserted
  41 | 03/23/2023 | 02:11:54 | Unknown #0xff |  | Asserted
  42 | 03/23/2023 | 02:12:20 | Unknown #0xff |  | Asserted
  43 | 01/01/2011 | 00:00:37 | Battery | Low () | Asserted
  44 | 01/01/2011 | 00:01:02 | Unknown #0xff |  | Asserted
  45 | 09/08/2022 | 15:29:18 | Power Supply PS2 Status | Presence detected () | Asserted
  46 | 12/13/2023 | 21:57:32 | Unknown #0xff |  | Asserted
  47 | 12/13/2023 | 21:58:26 | Power Supply PS2 Status | Presence detected () | Asserted
  48 | 03/23/2023 | 02:12:12 | Unknown #0xff |  | Asserted
  49 | 01/01/2011 | 08:14:52 | Battery | Low () | Asserted
  4a | 09/08/2022 | 15:27:25 | Power Supply PS2 Status | Presence detected () | Asserted
  4b | 03/23/2023 | 02:12:00 | Unknown #0xff |  | Asserted
  4c | 12/14/2023 | 16:39:30 | Power Supply PS2 Status | Presence detected () | Asserted
  4d | 12/14/2023 | 17:57:59 | Memory | Uncorrectable ECC (@DIMMH1(CPU1)) | Asserted
  4e | 12/16/2023 | 02:46:03 | Memory | Uncorrectable ECC (@DIMMH1(CPU1)) | Asserted
  4f | 12/16/2023 | 14:30:00 | Memory | Uncorrectable ECC (@DIMMH1(CPU1)) | Asserted
  50 | 12/18/2023 | 15:22:41 | Memory | Uncorrectable ECC (@DIMMH1(CPU1)) | Asserted
  51 | 12/18/2023 | 16:46:59 | Memory | Uncorrectable ECC (@DIMMH1(CPU1)) | Asserted
  52 | 12/18/2023 | 18:42:48 | Memory | Correctable ECC (@DIMMH1(CPU1)) | Asserted
  53 | 12/19/2023 | 03:29:36 | Memory | Uncorrectable ECC (@DIMMH1(CPU1)) | Asserted

Decided to swap out the PSU (which, notably, was 800W instead of 900W… though we shouldn’t be drawing this much…). I’ll submit jobs overnight and see if it makes it through with no errors.

EDIT: No dice with swapping out the PSU. It dropped again around midnight last night. All logs (slurmctld, slurmd, and syslog) look basically the same as before.

Looks like we still have some marginal memory stuff going on… Good call on the PSU.

I think were back to swapping in known good ram and seeing if that changes things.

It is reporting the error on DIMM 1 every time, but im not sure I totally believe it. It could be failing across all of them and only reporting the first stick. But it may be worth checking into, and rotating that stick to a different slot.

Swapped RAM 1-to-1 between nodes cn3 and cn4. Here is the error list for cn3 upon boot:

   1 | 03/23/2023 | 02:12:47 | OS Critical Stop | OS graceful shutdown () | Asserted
   2 | 01/01/2011 | 14:11:42 | Unknown #0xff |  | Asserted
   3 | 01/01/2011 | 14:13:26 | Battery | Low () | Asserted
   4 | 09/08/2022 | 15:28:53 | Power Supply PS2 Status | Presence detected () | Asserted
   5 | 09/13/2022 | 16:27:44 | Unknown #0xff |  | Asserted
   6 | 09/13/2022 | 16:28:36 | Power Supply PS2 Status | Presence detected () | Asserted
   7 | 09/13/2022 | 16:30:44 | Unknown #0xff |  | Asserted
   8 | 09/13/2022 | 16:31:27 | Power Supply #0xc8 | Presence detected () | Asserted
   9 | 09/24/2022 | 23:15:45 | Unknown #0xff |  | Asserted
   a | 09/24/2022 | 23:16:59 | Power Supply PS2 Status | Presence detected () | Asserted
   b | 03/23/2023 | 02:12:18 | Unknown #0xff |  | Asserted
   c | 09/28/2022 | 21:47:11 | Power Supply PS2 Status | Presence detected () | Asserted
   d | 03/23/2023 | 02:11:57 | Unknown #0xff |  | Asserted
   e | 09/29/2022 | 19:16:17 | Power Supply PS2 Status | Presence detected () | Asserted
   f | 03/23/2023 | 02:12:17 | Unknown #0xff |  | Asserted
  10 | 09/30/2022 | 21:12:46 | Power Supply PS2 Status | Presence detected () | Asserted
  11 | 11/16/2023 | 19:21:54 | Power Supply PS2 Status | Presence detected () | Asserted
  12 | 11/17/2023 | 13:50:59 | Unknown #0xff |  | Asserted
  13 | 11/17/2023 | 13:51:34 | Power Supply PS2 Status | Presence detected () | Asserted
  14 | 11/20/2023 | 14:38:32 | Unknown #0xff |  | Asserted
  15 | 11/20/2023 | 14:39:02 | Power Supply PS2 Status | Presence detected () | Asserted
  16 | 03/23/2023 | 02:12:00 | Unknown #0xff |  | Asserted
  17 | 11/21/2023 | 14:45:37 | Power Supply PS2 Status | Presence detected () | Asserted
  18 | 11/22/2023 | 00:09:25 | Power Supply PS2 Status | Presence detected () | Asserted
  19 | 12/18/2023 | 18:42:08 | Power Supply PS2 Status | Presence detected () | Asserted
  1a | 12/20/2023 | 16:58:09 | Unknown #0xff |  | Asserted
  1b | 12/20/2023 | 16:58:52 | Power Supply PS2 Status | Presence detected () | Asserted

So, no previous RAM issues. We’ll see if they popup here.

On another note, what is Power Supply PS2 Status | Presence detected () | Asserted indicating? Is this really an error?

EDIT: Also of note: it seems the ipmitool timestamps are exactly 6 hours ahead of the log timestamps, so actually the failures in the posts above around Dec 18 21:30:15 match with the memory failures at 12/19/2023 | 03:29:36. Figured that out based on the timestamps below for swapping into cn3.

Annnnnnnd, cn3 has now fallen offline, and ipmitool reveals:

   1 | 03/23/2023 | 02:12:47 | OS Critical Stop | OS graceful shutdown () | Asserted
   2 | 01/01/2011 | 14:11:42 | Unknown #0xff |  | Asserted
   3 | 01/01/2011 | 14:13:26 | Battery | Low () | Asserted
   4 | 09/08/2022 | 15:28:53 | Power Supply PS2 Status | Presence detected () | Asserted
   5 | 09/13/2022 | 16:27:44 | Unknown #0xff |  | Asserted
   6 | 09/13/2022 | 16:28:36 | Power Supply PS2 Status | Presence detected () | Asserted
   7 | 09/13/2022 | 16:30:44 | Unknown #0xff |  | Asserted
   8 | 09/13/2022 | 16:31:27 | Power Supply #0xc8 | Presence detected () | Asserted
   9 | 09/24/2022 | 23:15:45 | Unknown #0xff |  | Asserted
   a | 09/24/2022 | 23:16:59 | Power Supply PS2 Status | Presence detected () | Asserted
   b | 03/23/2023 | 02:12:18 | Unknown #0xff |  | Asserted
   c | 09/28/2022 | 21:47:11 | Power Supply PS2 Status | Presence detected () | Asserted
   d | 03/23/2023 | 02:11:57 | Unknown #0xff |  | Asserted
   e | 09/29/2022 | 19:16:17 | Power Supply PS2 Status | Presence detected () | Asserted
   f | 03/23/2023 | 02:12:17 | Unknown #0xff |  | Asserted
  10 | 09/30/2022 | 21:12:46 | Power Supply PS2 Status | Presence detected () | Asserted
  11 | 11/16/2023 | 19:21:54 | Power Supply PS2 Status | Presence detected () | Asserted
  12 | 11/17/2023 | 13:50:59 | Unknown #0xff |  | Asserted
  13 | 11/17/2023 | 13:51:34 | Power Supply PS2 Status | Presence detected () | Asserted
  14 | 11/20/2023 | 14:38:32 | Unknown #0xff |  | Asserted
  15 | 11/20/2023 | 14:39:02 | Power Supply PS2 Status | Presence detected () | Asserted
  16 | 03/23/2023 | 02:12:00 | Unknown #0xff |  | Asserted
  17 | 11/21/2023 | 14:45:37 | Power Supply PS2 Status | Presence detected () | Asserted
  18 | 11/22/2023 | 00:09:25 | Power Supply PS2 Status | Presence detected () | Asserted
  19 | 12/18/2023 | 18:42:08 | Power Supply PS2 Status | Presence detected () | Asserted
  1a | 12/20/2023 | 16:58:09 | Unknown #0xff |  | Asserted
  1b | 12/20/2023 | 16:58:52 | Power Supply PS2 Status | Presence detected () | Asserted
  1c | 12/20/2023 | 17:24:28 | Memory | Uncorrectable ECC (@DIMMH1(CPU1)) | Asserted

I have tested each of these sticks (via memtest86+) separately, and they all pass… so what’s the issue?

I’ll mix up the sticks and see if it is still the H1 slot that gives the error, and I suppose an RMA is in order if it follows the stick.

Probably if it follows the stick.

Overall, out of 136 total sticks of RAM, 4 sticks across 4 nodes display uncorrectable memory errors (2 sticks) or correctable memory errors (2 sticks). The memory errors have affected only one node in the observed way (though that’s not to say that we just haven’t tripped other nodes yet).

Should I just RMA all 4 sticks? Only the 2 that have had uncorrectable errors?

Switched the ram from slot H to A, and merry Christmas to me:

  1f | 12/24/2023 | 01:12:51 | Memory | Uncorrectable ECC (@DIMMA1(CPU1)) | Asserted

…so, it fully follows the stick.

Yay!

Mary Christmas

RMA’d all sticks with uncorrectable or correctable ECC errors, and no nodes have dropped with 2 weeks of continuous use.

1 Like