TLDR; I’m a student, so missed devember () and with the current world situation, I’m no longer in a position to be a student.
All this to say I have more time on my hands than I expected, so time to update my lab and learn something new! So let’s build a distributed, load balanced multimedia processing cluster that teaches me: Cuda, ffmpeg, aarch64, nodeJS/kubernetes, distributed computing optimization and how to manage a large project
(please share any thoughts, feedback, ideas below! I’m trying to challenge myself!)
What’s the problem?
I’m currently running some old xeons (2*5670 from the x58 era) as my school computer/workstation and they aren’t exactly the most efficient devices in the world. I’m also a big movie buff when I have the time so run a Plex Server for both myself and my friends and family. Currently my power usage is much higher than I’m comfortable with (system idles north of 100W and maxes out at ~400 when transcoding or I’m doing development work). Essentially, just isn’t efficient enough to justify running long term.
What I’m trying to learn:
Learning Goals
I’m pretty comfortable with C programming (and some x86 assembly), standard networking (Vlans, wireless etc.), but have never done much in the way of GPGPU programing (cuda or OpenCL), distributed compute or work on embedded/arm32/64 platforms.
What I’m trying to achieve:
End Goal
Build a low power cluster of arm SOC/SOM’s that can handle plex trans coding of up to 6 HEVC 4K30p10b HDR to SDR 2k30p8b H264 seamlessly, with tone mapping when needed at as low a total deployment cost as possible.
Ideally the foot print would also fit in something like a bit phoenix prodigy, idle at ~10w and peak no higher than 100w under full load, minus my workstation.
Components:
Hardware and software needed
Hardware:
- Host system/workstation (I’m not much of a gamer, so looking for more cores and decent amounts of ram for VM’s and so on.) thinking a pair of Ivy bridge Xeons (can reuse some old ECC DDR3)
- Networking switch (poe to minimize cables if it’s stable enough using inexpensive Rpi hats? )
- Arm/compute nodes: Multiple Nvidia Jetson’s (ideally xavier NX (agx is the dream)) but probably 2gb nano’s or, depending on launch dates of Nvidia roadmap, Jetson Nano Next or Jetson Orin.
- They have hardware encoders and decoders, and enough GPU grunt for any filtering I need to do. 2gb is a little tight, but with how quickly I can shuffle data around and clear ram, it should be ok. (concerned about cost of memcpy, but that’s TBD). old optane 16gb sticks are pretty cheap, could help if I can allocate optane as swap, kinda like l2ARC in ZFS. Cost $10 more, still much cheaper than the 4gb variant
Software:
- Host
- a. OS for server- Debian based
- b. Unicorn Transcoder or Kuber Plex
- c. NFS share (Data itself is accessed over the network on my nas, ZFS and so on)
- d. Custom capture scrip to modify arguments sent from PMS to transcoder
- e. Plex Media Server
- f. Load balancer
- Jetson (aka transcoder node)
- a. Jetpack (4.3?)
- b. client side of UT or KP from 1.b
- c. custom ffmpeg build
- i. Custom cuda filter I wrote for tonemapping (should upstream to newest branch once I’m done) currently reinhard clip and hable, but will change to BT.2390 eventually
- ii. Jcover90 ffmpeg patch to enable the use of the transcode blocks
- iii. nvidia build of ffmpeg to enable decoding and vf_scale_cuda
- d. client side of load balancer from 1.f
- e. (things I forgot will go here)
The Plan:
Summary
First thing is to finish up the Cuda Tonemap filter and push that to the next build of ffmpeg! That’s nearly done and, pending approval from the powers that be, will be in the next release of ffmpeg (Learn basics of CUDA: check!)
Second is to actually buy the hardware (look for deals on used devices ideally- do what you can to minimize cost)
- sell old system when I can, recoup whatever is possible
Third build capture script to change ffmpeg arguments to their HWaccelerated counterparts (nvdec , cuda _scale, cuda_tonemap, nvenc) and others (might make sense to keep audio for stereo transcode on cpu?)
Fourth is relatively straight forward: deploy the system. Will need to decide if I want to use KuberPlex vs UnicornFFmpeg
Fifth is to make it pretty! (or actually set up power monitoring to benchmark performance and share lessons learned) and document how I built this thing!
I’ve been documenting this as I go forward on my github in this repository: (under MIT license, so feel free to play with it as much as you want!)