Building a 10-100w distributed ARM based multimetia transcode cluster : I missed Devember -> Jobless January -> Freezing February -> Magic March

Camofelix · January 27, 2021, 7:25pm

Dang, that could be an issue. Absolute worse case scenario I could run a pair of wire from the GPIO header on the master node to each of the worker nodes to turn them on and off. It mean more wires, put should be doable in a clean way I think.

Yeah, the main reason for doing the nano’s is CUDA filters. for example if going from 4k HEVC to 1080p H264 on a pi, the decoder block and encoder block can be used, but the actual scaling filter will be done on CPU. In the case of HDR content with tonemapping the same applies.

So I’m planning on capturing the ffmpeg arguments sent by the master node to the workers and substituting the cpu versions for their cuda counterparts. In the case of scaling, there’s already a quite good filter in ffmpeg. as for tonemapping, I’ve developed the filter myself and am awaiting up streaming it into the next ffmpeg release assuming it works correctly.

SgtAwesomesauce · January 27, 2021, 7:34pm

How exactly do you plan on booting the node up? Not sure if they support WOL.

And you don’t need to docker swarm join every time. If the node is offline, the manager will simply not assign containers to it.

Aside from that, it would probably work fine.

Camofelix · January 27, 2021, 7:48pm

WOL used to be broken on the early revisions of the board (A01 and A02), but thankfully all the recent versions (B0* since jan 2020) are fully supported by WOL. in the case where that doesnt work/is somewhat spoty (based on @PaintChips it has had issues in the past) I might run wires from GPIO on master node to the power button of all the worker nodes.

Gotcha! so docker swarm join is more of a way to adopt youself into a swarm. Once adopted the first time, the node will “transparently” rejoin the swarm on boot up and be made available for jobs?

Also, any ideas on how monitor total usage on the cluster/determine when I should boot up another node?

SgtAwesomesauce · January 27, 2021, 7:56pm

Prometheus/node exporter would do the trick.

Or simple ssh and load monitoring.

Camofelix · January 29, 2021, 6:07am

Gotcha! I think I’ll start with the ssh method!

easiest thing might be to pipe jtop (same as top but designed to show all parts of the SOC) into a text file and update it every few seconds.

would allow me to write cpu load, mem level and gpu load at the same time to decide if I need to spin up another node or not.

Camofelix · January 29, 2021, 6:29am

Issue that may have cropped up:

Because of the Jetson’s weird implementation of cuda devices, the way I’ve implemented the filter will work on every cuda device except the Jetson’s dies a little on the inside

As Such, might have to do some slightly dodgy bodges to get it hooked in. I think one way of handling it is to not use FFMPEG’s standard cuda frame implementation and instead tell ffmpeg that my filter is a standard filter that doesn’t need any sort of HW accels. From there I’d call a cuda kernel the same way you would normal link on outside function or alternatively implement it as a wrapper function.

Something like this should be relatively clean

My jetson is supposed to arrive in the next day or so, hopefully can get going on that soon’ish.

Camofelix · January 31, 2021, 2:58am

Starting to come together!

Jetsons were supposed to arrive in the mail Thursday, but because of the thing they’ve been delayed. I went and stripped the stand off from a core2duo system and have marked out where I’ll be placing them. I have an old Rpi I’ll be using to orchestra everything for now, and because it’s always on, will have it power the fan. POE switch is installed underneath.

Eventually will attach POE hats to all of the devices, but for now will be a set of power adapters.

For now while I wait (can’t do much more software work until I have the actual hardware) I’ll finish building the enclosure.

If I go from a 92mm fan down to an 80mm, would be able to fit this entire platform into 2U comfortably.

Also from my current estimates, I should be able to fit an additional 2 jetsons (but would need a few more ports on the switch)

If I go 3u, I could install some on the top panel and mount them upside down, I’d have space for ~ eight devices, assuming I can’t put any on the bottom most u because of a larger switch.

I’d also be concerned about a single uplink port being enough, but that’s a problem for another day

Camofelix · February 3, 2021, 6:25pm

Latest update! 2 pieces of good news, 1 piece of (potentially awful) bad news

Good news:
GN1: My 2 jetsons arrived (Yay!) and I’ve b

egun testing on them.

GN2: when using the transcoding blocks, memory is preemtively moved and worked on in GPU memory! so my concern about having to copy things back and forth to use cuda looks like a non factor!

Bad News
Unfortunately I’ve come up against a (potentially huge) roadblock.

BN1:
From RTFM’ing, It seems that nvidia doesnt exactly document how their own decode/encode blocks work.

Specifically, the issue is that the documentation provides contradicting information. Per the Developer guide here the Nano is listed as only supporting 8 bit formats of HEVC, VP9 and H264.

However per the nvidia gstreamer documentation, HW accelerated decode and encode are supported all the way to 12 bit. It even mentions special flags to increase performance on low memory devices, such as the jetson nano series.

My main questions that I need to explore (or if you know anything please share!!) are:

1 Which of these documents should I consider to be correct
2 In the case where the first document is correct, how is gstreamer supporting higher bit depth content then the native decode/encode blocks.
3 In the case where the second document is correct, how would I access this capability outside of Gstreamer?
4 In the case where the Second document is correct, what is the performance penalties for using higher bit depth content?
5 In the case where both documents are correct, what is the expected return? a file that has the “extra” bit depth truncated?

Ideally the 5th option is correct and handled internaly by an api call. This would mean that I can assume that regardless of source content, I will always receive yuv420 content.

Otherwise, I’ve managed to get 1.3* real time performance when transcoding 75mbps HEVC 8 bit to pretty much any form of HEVC/H264!

And since everything is still on gpu, it should mean that the cuda overhead for tonemapping filters shouldn’t be a problem!

Camofelix · February 4, 2021, 1:42am

I’ve begun performance testing on the nano before I start messing around too much!

using

ffmpeg -hide_banner -threads 4 -c:v vp9_nvv4l2dec -i Do\ You\ Love\ Me-fn3KWM1kuAw.mkv -c:a copy -c:v hevc_nvmpi -preset ultrafast -b:v 2m v4l2_output.mkv
(hardware decode VP9, hardware encode to HEVC @2mbps with the ultrafast preset) yields 1.3* realtime performance.

The file in question is the Boston Dynamics Dancing Robots Video, Rec709 @16mpbs 3840p30fps 8 bit

The Jetson is running headless off a dedicated PSU and has a 92mm fan blowing across it.

It seems like the bottle neck is the speed/frequency of the hardware transcode blocks. it seems like they can be overclocked, but that will be a last resort. Looks like you can get as high as a 30% increase in throughput before problems appear. (discussion)

Now to re implement my tone mapping filter!

Outside of that, I might be ready to deploy this project for “standard” file types!

Camofelix · February 5, 2021, 8:18pm

Really interesting performance characteristics:
for a given bitrate (2 or 10 mbps) outside of the veryslow h264 preset, everything performs @ 1.3x speed +/- 0.05%. Means to me that I can pretty much tell the encoder to always use the slow preset and get a little more out of the source file!

Unfortunately the native scalling filter is single threaded, and a little arm CPU just inst going to be able to handle it. In this case it means I’ll also have to implement a cuda scaling filter- but because of the weirdness of the Jetson series, I’ll have to implement from scratch.

On a positive note, encoding the same source file from local or over a network share (ZFS over NFS) performance is within margin of error! (+/- 0.05%)

Spiller · February 5, 2021, 9:58pm

If changing one part which should affect the total time does not seem to make much of a difference, this could be an indication that your bottleneck is somewhere else. But especially with asynchronous systems, how the runtime of specific tasks affects the total runtime is not intuitive. If you are changing stuff up, such as adding your tone mapping filter or doing two transcodes in parallel, the bottleneck might change and the encode setting could then suddenly end up mattering, so don’t take it for granted that it doesn’t seem to matter in your current tests.

How does the memory situation look?

Camofelix · February 5, 2021, 10:50pm

Yup! Tracking down the true bottle neck is still in the works, but so far, for a single 4k hevc->4k h264 transcode I’ve ruled out disk speed and cpu.

I think it’s the encode/decode blocks (testing in the summary below)

I hope I can fit this all in 2gb of memory-would allow the next nodes in the cluster to be the less expensive models.

Implementing the filters (scale and tonemapping) will cost me on ram, but when scaled down the encoder will also need less ram in the first

place.

bunch of extra details from testing is below, but essentially shows how I narrowed things down to the encoding blocks:

Testing details

using Jtop and changing from 4cpu’s @1.5ghz to 2 cpu’s at 0.9 I can get around ~0.75 times real time, down from ~1.3* RT. Turning on jetson clocks (essentially disabling turbo boost and staying at max frequency at all times) on the low power mode moves me to 0.85*RT

I think it’s the encode/decode blocks peaking at 716 mhz, since ram is running at ~1.8gb usage between CPU and GPU, and runs at 25 GB/s shared bus. Locking down a certain amount of ram to only go on the GPU (the encoders use the GPU allocated memory) might take some pressure of a bunch of reallocation calls, but that’s a problem for another day.

Other optimizations will be removing fluff from the OS. I’m already running everything headless, but think there’s a good 5% in crap I can remove. Might also unlock higher clocks down the line depending on how the filter performs. Don’t think CPU will be my limitation for a long while. so could put CPU limit down to ~1.3 all core and use the extra power budget on GPU and the asic’s

PaintChips · February 6, 2021, 8:19pm

In my experience I have my doubts a network switch single uplink will be a bottleneck unless there is a large amount of network traffic-there are network switches with dual gigabit uplink which are reasonable priced, the RealTek 8111 series has an overhead which averages a max speed between ~900-930 Mbps on the Jetson Nano. (PCs with the 8111 chipset top out at 940-950 Mbps based on online benchmarks comparing with Intel NICs)

Performance hit with more bit depth is likely going to put a higher amount of CPU/GPU load and memory usage is going to increase unless working with a crop/scaling solution.
If you’ve followed any Jetson AI automation specific stuff, gstreamer higher bit depth option support is geared for imaging such as factory/industrial QA cameras with high pixel/large image sensors… typically in those conditions they use software based cropping to focus on a smaller grid. For science AI automation, a crop is usually handy to trim off the lens curvature/distortion edges when software correction has its limits. From my own usage of a 3 or 4 camera setup with a Jetson Nano the memory headroom is fairly cramped for real-time but much of the load is done with CUDA–CPU idles a fair amount.

Camofelix · February 6, 2021, 10:53pm

Update on unsupported format:

It seems that prior versions of CUDA supported decoding in SM/ on CUDA cores instead of using the dedicated hardware blocks.

Documentation about this capability can be found here: https://docs.nvidia.com/cuda/archive/8.0/pdf/CUDA_Video_Decoder.pdf

Unfortunately I can’t seem to find any documentation in more recent versions of CUDA.

it would probably perform worse than the dedicated hardware blocks, but would allow for a graceful fallback instead of a complete failure.

Camofelix · February 6, 2021, 11:33pm

Agreed! so far moving multiple large files with multiple devices hasnt been an issue, so I’m ruling that out unless I grow this out to 6+ nodes, then it could become an issue. At that point I’d probably add something like a Mikrotik CSS610-8G-2S+IN (2 SFP+ and 8 gigabit, 11w peak power for 99USD); that’s a problem for another day.

Yeah, I’m working to the assumption that if you’re needing to transcode this sort of content, you’re likely going to be watching it in 1080p or lower.

So the pipeline would be: 4k 10 bit File Arrives → file is decoded from 10/12 bit to 8 bit → file is scaled down to 1080p → file is tone mapped if needed (detect this via BT.2020 colour space) → file is sent to encoder → file is sent as 1080p 8 bit h264 to host device

In the case were scaling is required, that would mean that the second half of the filter doesn’t require nearly as much memory footprint. I think that the way to go would be to always scale first, since that lowers the memory requirements for everything following it in the chain.

I would have to check the bit depth, then either use the NVDEC block normally for 8 bit files, or if 10 bit, parse out to gstreamer some how to have it deal with decoding, and truncating down, hand it back to ffmpeg which forwards it to the “standard” pipeline.

Does that seem right to you?

Good to know my testing was on the right track! Mind sharing the resolutions of those cameras? should be roughly equivalent performance wise to what I’m attempting!

Camofelix · February 7, 2021, 2:14pm

Apologies for the double reply- but if I understand correctly- using gstreamer, you have gotten 10/12 bit hardware decoding support to work on the nano?

PaintChips · February 7, 2021, 2:47pm

Never pushed anything higher on the bit depth, at the time period it was going to be a mix of using the camera connector and up to three USB cameras. From the start a main goal was to keep memory usage at a minimal and see how much the Jetson Nano could handle. Never messed around with gstreamer settings as I had read comments on the Jetson Developer forums about the processor load goes up based on resolution/devices connected, in theory if you stay under 4k by scaling to 720p per stream it could be possible to avoid too much processor usage.

Originally the test base was two Logitech C920s however the C930e made more sense for wide angle–even with scaling to trim down lens curvature issues its kept at 1080p and these older Logitech cameras have onboard hardware acceleration. Automotive usage in a collision-distance warning system is more about real-time tracking and overall awareness.

Spiller · February 7, 2021, 5:03pm

I haven’t worked with HDR or BT.2020 content, but if I understand right you are using the tonemap filter to convert it to something like SDR Rec.709? I belive HDR and BT.2020 both require a higher bitdepth to sufficiently represent the same content, so doing the tone mapping after reducing bitdepth would loose detail and show up as increased banding.
I didn’t look close on how your tone mapping filter worked, but it might be cheaper to doing the tone filter first and convert to 8 bit before downscaling, than doing the downscale with 12 bit content.

Camofelix · February 7, 2021, 8:11pm

Ahhh alright! This capability has me a little bit concerned.

I’ve opened a thread on the Nvidia dev forums asking about what happened to the Cuda based decoder instead of using the hardware blocks since it seems that the functionality was deprecated with the release of cuda 9.

Thread in question is here: Nvidia Developer forums feature request thread

Camofelix · February 7, 2021, 8:45pm

Spiller

4m

AFAIK all HDR content is in BT.2020. The way it is represented however is actually through a linear representation of light, and then the user device uses included meta data to convert the linear representation to a more conventional gamma curve.

Essentially, the HDR metadata cannot be reused on the newly transformed footage. This image shows a rough approximation of what the underlying pixel values are on the left, and what they look like once tone mapped:

It’s pretty much the same idea as “grading” log footage from a digital cinema camera.

Additionally- the tone mapping is a little bit of a salt to taste situation. Up until very recently their hadn’t even been a standard announced or adopted. Recently BT.2390 was announced, but it’s implementation still leaves some work to be desired (not to mention it’s already on it’s 8th iteration)

I’m not too worried about banding for this sort of application. Testing both Truncate first tonemap second and visaversa for a given bit rate going from 4k 10 bit to 1080p 8bit, I really couldn’t tell the difference. This might change for certain animated content where smooth gradients are much more common, but outside of a single short demo that Netflix did with production IG, there isn’t a single piece of HDR Anime that exists AFAIK. Testing with that one piece, Sol Levante had a few issues, but nothing worse than would be expected from lossy image processing.

I’m starting with a pretty simple Tonemapping algorithm called the “Hable Curve”. It’s as close to a standard as their is right now and produces pretty good results.

Either way you’re going to end up with some banding if you push the bit rate too low, so for equivalent results within margin of error, truncate first will probably be the way to go because of the lower memory foot print