Building a 10-100w distributed ARM based multimetia transcode cluster : I missed Devember -> Jobless January -> Freezing February -> Magic March

TLDR; I’m a student, so missed devember (:sob:) and with the current world situation, I’m no longer in a position to be a student.

All this to say I have more time on my hands than I expected, so time to update my lab and learn something new! So let’s build a distributed, load balanced multimedia processing cluster that teaches me: Cuda, ffmpeg, aarch64, nodeJS/kubernetes, distributed computing optimization and how to manage a large project

(please share any thoughts, feedback, ideas below! I’m trying to challenge myself!)

What’s the problem?

I’m currently running some old xeons (2*5670 from the x58 era) as my school computer/workstation and they aren’t exactly the most efficient devices in the world. I’m also a big movie buff when I have the time so run a Plex Server for both myself and my friends and family. Currently my power usage is much higher than I’m comfortable with (system idles north of 100W and maxes out at ~400 when transcoding or I’m doing development work). Essentially, just isn’t efficient enough to justify running long term.

What I’m trying to learn:

Learning Goals

I’m pretty comfortable with C programming (and some x86 assembly), standard networking (Vlans, wireless etc.), but have never done much in the way of GPGPU programing (cuda or OpenCL), distributed compute or work on embedded/arm32/64 platforms.

What I’m trying to achieve:

End Goal

Build a low power cluster of arm SOC/SOM’s that can handle plex trans coding of up to 6 HEVC 4K30p10b HDR to SDR 2k30p8b H264 seamlessly, with tone mapping when needed at as low a total deployment cost as possible.

Ideally the foot print would also fit in something like a bit phoenix prodigy, idle at ~10w and peak no higher than 100w under full load, minus my workstation.

Components:

Hardware and software needed

Hardware:

  • Host system/workstation (I’m not much of a gamer, so looking for more cores and decent amounts of ram for VM’s and so on.) thinking a pair of Ivy bridge Xeons (can reuse some old ECC DDR3)
  • Networking switch (poe to minimize cables if it’s stable enough using inexpensive Rpi hats? )
  • Arm/compute nodes: Multiple Nvidia Jetson’s (ideally xavier NX (agx is the dream)) but probably 2gb nano’s or, depending on launch dates of Nvidia roadmap, Jetson Nano Next or Jetson Orin.
    • They have hardware encoders and decoders, and enough GPU grunt for any filtering I need to do. 2gb is a little tight, but with how quickly I can shuffle data around and clear ram, it should be ok. (concerned about cost of memcpy, but that’s TBD). old optane 16gb sticks are pretty cheap, could help if I can allocate optane as swap, kinda like l2ARC in ZFS. Cost $10 more, still much cheaper than the 4gb variant

Software:

  1. Host
  • a. OS for server- Debian based
  • b. Unicorn Transcoder or Kuber Plex
  • c. NFS share (Data itself is accessed over the network on my nas, ZFS and so on)
  • d. Custom capture scrip to modify arguments sent from PMS to transcoder
  • e. Plex Media Server
  • f. Load balancer
  1. Jetson (aka transcoder node)
  • a. Jetpack (4.3?)
  • b. client side of UT or KP from 1.b
  • c. custom ffmpeg build
    • i. Custom cuda filter I wrote for tonemapping (should upstream to newest branch once I’m done) currently reinhard clip and hable, but will change to BT.2390 eventually
    • ii. Jcover90 ffmpeg patch to enable the use of the transcode blocks
    • iii. nvidia build of ffmpeg to enable decoding and vf_scale_cuda
  • d. client side of load balancer from 1.f
  • e. (things I forgot will go here)

The Plan:

Summary

First thing is to finish up the Cuda Tonemap filter and push that to the next build of ffmpeg! That’s nearly done and, pending approval from the powers that be, will be in the next release of ffmpeg (Learn basics of CUDA: check!)

Second is to actually buy the hardware (look for deals on used devices ideally- do what you can to minimize cost)

  • sell old system when I can, recoup whatever is possible

Third build capture script to change ffmpeg arguments to their HWaccelerated counterparts (nvdec , cuda _scale, cuda_tonemap, nvenc) and others (might make sense to keep audio for stereo transcode on cpu?)

Fourth is relatively straight forward: deploy the system. Will need to decide if I want to use KuberPlex vs UnicornFFmpeg

Fifth is to make it pretty! (or actually set up power monitoring to benchmark performance and share lessons learned) and document how I built this thing!

I’ve been documenting this as I go forward on my github in this repository: (under MIT license, so feel free to play with it as much as you want!)

6 Likes

Wow, this is sweet!

Definitely following this one.

3 Likes

I just pushed a near final version of the FFmpeg filter to my personal repo, waiting on comments/feed back etc. from the IRC channel before I finalize up to push the filter up to the master branch!

the code for those 2 files (c side and cuda side) are here !

I’d love feedback/questions/comments on anything you notice in them!

You’ll need to build from source and have the nvidia headers installed on your system (NB: you don’t need to have a cuda device on your system to build the filter, but you do need a cuda device to be able to test it/run it)

Next part is actually getting the Jetson’s! This should probably be in the classifieds (please PM me and I’ll change this) but if anyone has an old Jetson of any kind that they aren’t using or are willing to sell, I’d love to see if we can make a deal happen!

For the Switch, I found an old netgear 5 port dumb switch in my closet (12v .75A) so pretty low power! Not poe, but that’s a problem for another day. (This definitely counts as feature creep, but integrating a battery bank into this thing could be kinda sweet! something where you turn off/disconnect nodes if they aren’t needed? portable box cable of handling an entire lecture halls worth of clients would be kinda epic)

Beyond that, I’ve pulled the source for the UnicornTranscoder project and am starting to poke around for how it allocates jobs and sends the commands. to implement the feature creep idea above, Kuburnetes might be a better option, but it should also be relatively easy to switch over from UT to KP if and when I get around to it/am looking to teach myself even more things.

1 Like

I’m not familiar with Cuda and I wouldn’t call myself “strong” in C, but I’d be happy to give it a once-over to see if there’s anything glaring.

Starting to look into performance bottlenecks preemptively, I’ve noticed an issue with the way the ffmpeg filter graph (AKA order of operations) works that might kill performance.

Normally, on a discrete gpu the ffmpeg cuda implementation is such that you can output decoded frames in a cuda format, and keep it in a GPU context for filtering and encoding.

right now the flow doesn’t support sending filtered cuda frames in a gpu context directly to the encoding blocks, instead copying them back to system ram and back every time we context switch between cpu and gpu.

+-----------------+                           +-----------------------+                                       +---------------------+                              +------------------------+
|  File in system |                           | Decoded file is moved |                                       | Filtered file is    |                              | Processing complete    |
|  memory-request \                           - back to system ram    \                                       - returned to system  \                              - output file to         |
|  decode         |\                         /| after decode          |\                                     /| memory              |\                            /| destination            |
+-----------------+ \                       / +-----------------------+ \                                   / +---------------------+ \                          / +------------------------+
                     \ +-----------------+ /                             \ +-----------------------------+ /                           \ +--------------------+ /                            
                      \|File moved to    |/                               \| Decoded file moved back to  |/                             \|File is returned to |/                             
                       -gpu via PCIe Bus /                                 - gpu for HW accelerated      /                               |GPU for encoding    /                              
                       |DECODE HERE      |                                 | filtering (ex: cuda_scale)  |                               |ENCODE HERE         |                              
                       +-----------------+                                 +-----------------------------+                               +--------------------+                              

This means we currently have 6 copies of memory instead of 2, and that’s pretty expensive from a compute standpoint. I’ve opened an issue with the developers of the Jetson_ffmpeg community project linked here and can hopefully get that fixed/added in the future.

In the meantime, it appears that the nvidia build of jetson_ffmpeg supports passing cuda frames from a decoder to a given cuda filter. this brings us from 6 to 4 memory copies, so it’s some improvement. (still very expensive unfortunately). The Nvidia implementation does not support encoding unfortunately.

Something I’m worried about is that the lack of ram may end up meaning I need to buy the 4gb variant of the nano. it’s around a 40% increase in cost to go from 2->4gb, so means getting 3 nano 2gb might have to take a back seat to 2 4gb models (The xavier NX with 8gb should be fine regardless, but more performance is always a good thing). I’m still wondering if maybe the idea of adding an optane module to the nano and have the entire things be swap if that would be enough. Both models have 25.6GB/s of memory bandwidth and a x4 2.0 pcie interface. but optane modules are only x2, so bandwidth would be up to 1GB/s. Thankfully this is still more than the 900MB/s peak speed of the drive, so it won’t be bandwidth starved directly.

Question: Thoughts on if that should be fine? so I’d do 3 Nano’s, each with a cheap 16gb optane module installed instead of doing 2 4gb models?

Copying is not really an issue for the compute performance. Copying can be done in parallel without affecting compute performance as from what I have seen (anything is hardware dependent of course and I mainly work with AMD GPU’s for now). The issue it is causing is latency, but if you queue up several frames in parallel you can potentially hide that latency. Meaning you can do the copy for a frame while the previous frame is being processed by the GPU. That would of course increase memory requirements further and likely hit memory bandwidth issues anyway, but changing the filter flow to pass the GPU context in a design that was not made for it might be very difficult.
If the response from the developers is that it is unlikely to be solved in the near future, it might be more realistic to hack together a single stage in the flow that is hardcoded to do the decode->filtering->encode in one step, not supporting any of the normal ffmpeg filtering steps that run on the CPU. (Is the downscaling happening on the GPU as well?)

Where are your memory requirements coming from? The 6 copies doesn’t mean that it needs to hold those 6 copies at once necessarily.
I would be careful with speculating too much on performance by only looking at paper specs. Buy one at first to test (or perhaps ask someone that might have one to try it for you), and then invest in the 2 extra ones if it works.
Can the Nano actually do transcoding at the listed specs? I.e. if you are decoding two 4K streams, will it still be able to encode two 2K steams at the same time?

1 Like

The issue with memory is the the way FFmpeg handles it’s frames is that memory structures are not released until a frame has “completed” the chain, and each time you move hardware context, the memory is copied. so for example, multiple filters in the same hardware context (cpu) where you de-interlace, re-scale, de-saturate will only have one copy of the frame in memory. unfortunately, it copies the entire stack when it changes from one processor to the other :confused:

IIRC all the frames are processed in queue, so I can have multiple frames processing at the same time.

Yeah, I’d be doing hardware scaling with rescale_cuda and rescale_cuda_bicubic. Bicubic is a little more efficient when you can do a direct factor conversion (4k->2k) but get’s messy when doing fractional conversion (2k-> 720p). Will probably default to always doing standard and take the slight quality hit.

Currently hunting for a used one! might just have to bite the bullet so this doesnt die on the vine :frowning:

Yeah! the decode and encode are done on independent hardware blocks. They technically onely process one stream at a time, but can do it fast enough to post the numbers that nvidia claims. Ideally the would quote those numbers as MP/s at a given codec, bitrate and chroma but that might be too much for a standard spec sheet.

1 Like

I’ve begun development on the load balancing side of the system! One of the issues was that, the host server would previously send default CPU transcoding arguments as a sort of lowest common denominator towards any nodes. To make this use case work, I need to substitute any cpu transcoding/scalling etc. arguments to their cuda equivalents.

As it turns out, the UnicornTranscoder project (github repo here) alerady had a case where, due to a certain audio library being poorly supported, they substituted that portion of the arguments to a well documented and well supported version. its a bit of a hack, but seem’s very stable!

In speaking with the devs here: Abillity to modify plex client ffmpeg arguments · Issue #11 · UnicornTranscoder/UnicornFFMPEG · GitHub

I’ve now proposed using the same method to support custom arguments, and by extension custom options including HW accels

I have a working build on my system (using the vaapi and OpenCL equivalents since I’m on a 5600xt ) and it seems to be performing just fine!

the project doesnt have permissions set to be able to push directly unfortunately, so I’ve made a sub folder in my own repository here: Jetson_ffmpeg_trancode_cluster/transcoder.js at master · Camofelix/Jetson_ffmpeg_trancode_cluster · GitHub

This means I’m rapidly reaching the point where I can boot this baby up.

I’m also thinking that, in the meantime while I wait for a deal on a Jetson of some type, why not startup up a linode instance with a gpu to test how this whole thing works!

Edit: Linode GPU clusters need extra verification, waiting on support to get back to me!

Very quick Monday morning update:

Linode GPU clusters need extra verification, waiting on support to get back to me!

Support got back to me and approved the creation of GPU linodes! Hopping to find a chance to play with it soonish™!

In the meantime, right after I post this I’ll be posting a WTB in the classifieds looking for some used Jetson’s.

Hey mate! I noticed that you seem to have some experience with kubernetes! Mind if I pick your brain a little? I don’t have any experience with kubernetes, so apologies if I miss use some terms :sweat_smile:

To handle the distributed compute side of the project, I was originally looking at the UnicornTranscoder project. essentially it’s a nodeJS server that relays the command from a dummy plex transcoder to a transcode node, and then passes the transcoded file back.

I found this Kubernetes project called Kube plex, that attempts to do the same thing, but using kubernetes instead.

If I understand correctly, the pod model of KN is designed to be able to dynamically allocate jobs to different nodes for the purposes of high availability and to distribute load relatively evenly between nodes.

Each nano consumes peak of 10w and idles at 1.25, and the switch I’m planning on using peaks at 5w and idles at 2. With 3 nano’s and the switch, I’d be idling ~5.75w and peaking at 35w, low enough to almost run off of USB power banks (everything is 5v, so that might actually work)

Since one of my goals is to minimize power consumption, would KB me able to help with that by turning off all but one node at times of low/no load? Then dynamically bring those nodes back up in times of higher load?

relevant KB project: GitHub - munnerz/kube-plex: Scalable Plex Media Server on Kubernetes -- dispatch transcode jobs as pods on your cluster!

Edit: also looking further into it, the nano supports both PXE boot and wake on lan, so assuming KB can send both shutdown and WoL to the devices, this might work super well.

Edit2: oops! forgot to tag! @SgtAwesomesauce

Don’t want to make it too easy for you but I’ve used a Jetson Nano(4GB) in a cluster to handle video/image processing from other devices in an AI project(ex: Pi3/Pi4), what you need to keep in mind is the encoder on the Jetson Nano will require running off the barrel power connector as you want a very stable power delivery which the cheap 2GB Jetson Nano lacks–also the 2GB model is said to be more sensitive to the USB power handling. Nano’s PCIe slot is only supported for a wifi module unless things changed on the redesigned 4GB model which added two camera connectors. If you’re wondering about my Jetson use cases, more about automation using camera & LIDAR.

From owning a Jetson Xavier AGX, it can do multiple video encoding streams and the ability to use NVMe SSD is a plus.

Hey mate! Thanks for the advice!

In that case, the idea of using a battery bank as a sort of buffer seems very interesting. If I do run low on power with the 2gb models, Adding in Rpi poe hats is another option (and might actually be better than using barrel or USB power all together). From what I’ve found online it seems that most of those having issues with board power on the 2gb were only seeing issues when connected to a monitor and with multiple peripherals drawing ~.75 amps. Also, thankfully they moved to USB-c for the 2gb unit for power input, meaning that 3amp+ power supplies are abundant. (And a 3A psu is uncluded in the box)

The optane modules themselves peak at ~3w and idle ~0.9w. An average backlit keyboard draw’s more than that so it shouldn’t constrain the power budget too much. In the case of Poe there are some reports of the hats dying if used at 15w long term, but from cross checking with the Rpi forums, it seems that this is due to lack of cooling. Adding a single 92mm-120mm fan to the enclosure should deal with that pretty comfortably I think.

Thankfully the m.2 slot is an E-key, so it can be adapted pin to pin to NVME m-key with a simple adapter! Something like this is what I’m thinking of using. since the Nano is only Pcie 2.0, the margin of error on signalling inst too strict. The cable can also route the drive below the sodimm slot on the left side of the board (looking at it from the port side) and then attached the the underside with standoffs to the drive itself.

They would still boot of the microSD card, (in before feature creep; or as another option down the road, network boot of an old rpi I still have hanging around that could also serve as the PlexServer host as demonstrated here)

Also, via arrow.com, the only “good” jetson first party retailer up in Canada (and the least expensive by far), I can order the version without the wifi adapter included, saving me ~10$ per board, meaning that I can get 2 boards for ~15$ more than a single 4gb. I realize that I lose out on the display port and go from 4 usb3.0-> 1 usb3.0 and 2 usb 2.0, but since these will probably never be plugged in beyond initial config, I’m not overly worried about it.

I’ve also been keeping an eye on the Jetson road map, and it looks as though they may be launching the “Orin” and the “Nano next” at some point this year, presumably based on either pascal or volta (I doubt that a nano would get Turing or ampere before a higher end model)

Does that seem to make sense? I’d love any tips and tricks that you could share visa-vie your experience with these boards :heart:

Screen shot of arrow.com:
Screenshot from 2021-01-25 15-01-26

Edit: BTW those AGX units look freaking amazing, it and the NX look like borderline desktop replacements for people wanting to do full time aarch64

Otherwise, have you tried ffmpeg on any of the jetson’s? if so, would you happen to have used any of the cuda filters?

it seems the cuda core’s are accessed differently on Jetson’s compared to the discrete cards, and I don’t know if that will need a different type of filtering as compared to the existing frame work

From dealing with Jetson specific stuff since the TX1, you’ll need to have a dedicated one for development which is why a 4GB Nano is good starting point–if you can tweak the requirements to work on a 2GB model, it’ll help reduce the project cost.

Never used Pi hats on my Jetson Nano as it was more of the object tracking brains(original project was automotive) also height wise the usage concept left the option of a breakout cable, it had two Logitech C930e and pulled in two other cameras from Pi 3… latency of that kind of setup was fairly low which was surprising. If I could push 4 cameras via USB on a Jetson Nano with minimal latency, it could be possible to have a que system for encoding but thermal wise I never tried to see how long it would take for a thermal throttle–at the time period I thought 5hrs would of been it, yet it stuck at ~50C with a fan (another one I had without a fan remained between 60-65C). Noctua is the best option for lowest noise and highest reliability factor on the fan side(Arctic Cooling is another good option). Hadn’t considered any filtering as object tracking with a camera/LIDAR focuses upon “learning/learned”, I was dealing with balancing the allocated frame buffer memory to squeeze the best performance in a low power environment before the NX launched.

Memory usage of Ubuntu for Jetson is nearly 1.5GB depending upon if you install more AI options, as many said about the 2GB Jetson Nano “you’ll hit the memory limits faster in some form or another”. I’m sensing Jetson Nano Next is going to be a scaled down Jetson NX on the thermal side so they don’t need to include a fan–however ASUS’s new Tinkerboard has a fan which is similar priced with a Nano. Memory usage on say a Pine64 used for a test platform of how low can encoding can go, the latency was higher than a Pi4 but I was using a 1GB model–the Pi4 was a 4GB with 256mb dedicated to the frame buffer.

Gotcha Thanks for the advice! Looks like I’m going to have to do a 4 gig then

Have an spare arctic 92mm on my desk at the moment- I’ve tried it on 5v on an rpi in the past and it worked just fine running off of the 5v gpio headers

in the case of not installing any more than lxde, the embedded kernel and stripping out all of the ai related things (essentually stripping it down of everything except cuda, ffmpeg and nodeJS[to connect to the master node]) That should bring down my usage a fair bit. once everything is working, I’d continue to strip it down- no need for a gui or anything like that and then allocate as much ram as possible to the gpu.

if you have the time (and access to one of your Jetsons) would you mind running a test for me?

specifically this command:
ffmpeg -vsync 0 -c:v hevc_nvmpi -i input.264 -vf "fade,hwupload_cuda,scale_npp=1280:h" -c:v h264_nvmpi output.264
with a test file such as
https://drive.google.com/uc?export=download&id=1omj8vxhzsVAZtsb-bQRWWoy1fNtJGDuY

Using this build off ffmpeg: GitHub - jocover/jetson-ffmpeg: ffmpeg support on jetson nano

(this build enables both hardware encode and decode on the nano. Nvidia has not implemented their own version of the hardware encode option)

in the case the above command doesnt work, can you try

ffmpeg -i input.264 -vf "fade,hwupload_cuda,scale_npp=1280:h," -c:v libx264 output.264

The second command should use software encoding and decoding. if the commands have any errors please send them my way and I should be able to correct them

Edit:

Continuing to look into the power situation, Seeed, one of the other jetson providers makes a 4.6A poe hat (23w) that should work quite well for pretty much any circumstance- so long as I have adequate cooling.

Link to hat: ROCKPI 23W PoE HAT - Radxa Wiki

Kubernetes doesn’t have any node shutdown features to the best of my knowledge. Kubernetes is designed to run in the cloud with always-on nodes. You could manually control this if you’d like.


However, I find kubernetes to have a large amount of overhead for what you’re trying to accomplish. You might find docker swarm to be more to your liking if all you want is a balanced distribution of containers.

Do you need load balancing? If you have at most 6 users and each node can handle at least 2 jobs, you could just have the first one handle them all until you have more than 2 concurrent jobs and then start sending them to the next one.
Shouldn’t you actually try to maximize load on a node without going over, to keep as few nodes active as possible to save power?

I don’t have experience with this kind of stuff, but perhaps you could do it with a small script instead of the raw ffmpeg command? Something like:

  1. Send Wake-on-LAN
  2. Wait until machine gets responsive (perhaps ping will do the job?)
  3. Atomic increase of job count
  4. Do the encode command
  5. Atomic decrease of job count
  6. Go to standby if job count is zero

TLDR;
Would a script running on the master node like:

Docker Master watches total usage across all nodes → if “usage of all active nodes” >= “some threshold”, send wake on lan to next available node in queue. → node boots up, mounts all of the file shares and executes docker swarm join. → if ((nodes -1)/nodes)*usage >= “some other threshold”, send

ssh -t user@worker_node_2 ‘sudo /home/user/graceful.sh’

to highest ID active worker node where graceful.sh is

#!/bin/bash

docker stop $(docker ps -aq)

shutdown +2

// loop

Work for bringing nodes up and down as needed? Do you have a better way of handling this? (it’s a little jank, would love any ideas that don’t involve a semi infinite bash loop)

##full reply##

Yeah, ideally when only serving one or 2 clients, all but one node would be on, and the other would be turned off completely. Then, if a certain threshold of available resources is exceeded, boot up another node in the cluster (or swarm in this case?)

I’ve found and forked a project that seems to be designed to be used as a docker swarm, and added in a shim to enable hw acceleration. it can be found here:

There’s also this post which details cutting out anything not needed to optimize container performance

In terms of docker swarm, are you aware of any way to bring nodes up and down based on load? I’ve begun to go through the wiki, but can’t seem to find that functionality naively.

the “optimal” scenario would be to have the master node send a WoL packet to a shutdown node, have that node boot up, initialize and mount everything, start the container, notify the master that the node can now accept jobs, then shut back down when it’s no longer needed. (and use network boot for extra style points :sunglasses: )

the booting, mounting etc. I know how to do, but having the master node orchestrate waking up and shutting down nodes is beyond my very basic knowledge of containerization.

This project seems to enable master nodes to send wake on LAN packets to systems on the cluster to boot them. Then the onboard boot script would handle loading them up to a usable state. I think this is cleaner then the bash script approach at the top, but this may be adding complication for the sake of complication

If my thinking is correct, the master node would receive a notification that the worker node is removing itself from the swarm gracefully, then shutdown after 2 minutes.

This leaves having to get the usage stats from the docker nodes to see how close to the thresh hold we are. docker stats doesn’t seem to work in swarm mode for collecting stats on all nodes, so not sure how to go about that. Any ideas?

so questions:
Do you know of a way to cleanly orchestrate shutdown and wakeup natively in docker?
Do you know of a way to have the master node actively poll the current usage stats of the swarm?

1 Like

I’m not so sure WoL works well on Jetson Nanos, there are a bunch of “bug notes” which reference random RealTek NIC dropping connections under random amounts of load–the same chipset also is known for getting stuck in “green mode 10/100 idle” if your switch is a green switch.(managed switches you can disable power saving green mode)

Haven’t been able to run that test as my Jetson Nanos are in a “testing trials” effort, I would note seems like there are some AI specific software that does enable hardware encode for real-time processing.

As far as trying to allocate as much memory to the GPU, it may make more sense to try to strike a balance of not too much and not too little so there is headroom. Hardware wise I think a cluster of cheaper ARM boards with encoding support could in theory do a similar job with ffmpeg just like others who’ve built ARM clusters for Blender.

Yup! shutting down as many nodes as possible us the goal! Base on SgtAS’s reply, I’m thinking docker swarm might be a great fit for this project.

I really like the idea of WoL! could use it in combination with ssh user@node_ID 'sudo shutdown +2' as part of a larger bash script!

I’ve laid out how I think this could work in the other post! above, but essentially:

Docker Master watches total usage across all nodes → if “usage of all active nodes” >= “some threshold”, send wake on lan to next available node in queue. → node boots up, mounts all of the file shares and executes docker swarm join . → if ((nodes -1)/nodes)*usage >= “some other threshold”, send

ssh -t user@worker_node_2 ‘sudo /home/user/graceful.sh’

to highest ID active worker node where graceful.sh is

#!/bin/bash

docker stop $(docker ps -aq)
shutdown +2

// loop