Building a 10-100w distributed ARM based multimetia transcode cluster : I missed Devember -> Jobless January -> Freezing February -> Magic March

Latest update: Cluster is up and running!!!- slight issues with argument substitution that I need to clean up- will probably push the changes to the git around friday!

Outside of that it seems that FFmpeg has gotten “too smart”. there used to be functionality to blindly decode 10 bit content with the knowledge that the last 2 bits would be truncated.

This was patched out because people would complain about the image issues and blame FFmpeg for it. Looks like I’ll have to reverse this patch in my custom build, presumably this would be done with a set of patch files that do all of the changes I need at build time. Things like disabling superfluous options, activating optimizations paths and so on.

Outside of that, performance is pretty solid so far!! got 8 1080p hevc → 720p H264 streams going per jetson! This roughly lines up with expected performance, tho running under 2 gigs will be a problem. May need to go and do the optane swap options on the m.2 slot.

Also, I’ll post a video at some point, but SEEED technologies makes a POE adapter that supports 23w POE that is Pi/jetson compatiple, and comes in $12 less expensive than the official Rpi foundation unit.

I did have to move the poe 4 pin connector down a little bit- there’s 2 “standard” positions for the poe connector on the hat, and it so happened that this one used the older style. Was a bit gruesome (developer with a soldering iron meme goes here) but works great! Have yet to try and push it beyond the 15W envelope, but hoping to do so at some point. Will eventually push/OC the gpu and encoder/decoder since those are by far the largest bottle necks from my testing

Link: https://www.seeedstudio.com/PoE-HAT-for-ROCK-Pi-4-p-4143.html

I got mine on digikey, but that’s shouldnt matter.

2 Likes

So remember when I said I’d be pushing an update to the git last friday? well yeah, that didnt happen.

My main issue right now is that I keep banging my head into a wall with implementing the cuda filter. Because for some reason Nvidia decided that Jetson cuda couldnt be the same as normal cuda, getting things to interact together in the same image pipeline has been a bit of a night mare.

At this point I’m thinking of just nuking what I have and starting from scratch on the filter.

It’s a little bit of brute force, but percussive maintenance never hurt anything right? :upside_down_face: :upside_down_face: :upside_down_face:

Otherwise, for things like 1080p hevc 8 bit this is working very very well. Having the pi work as an orchestration node is also very nice.

to get around the 10 bit problem, I also realized that I may be able to use the pi’s decoder, since it supports 10 bit properly. I may have to output frames directly, and have the nano then process those, but it is a solution that wouldn’t come with a quality hit.

Power consumption time!

Loading everything up running their own programs without orchestration to max everything out, the most I’ve been able to pull has been 35W from the wall!

tested this with a kilawatt at the wall. Devices used are:
a UPS powering everything,
a Poe+ 5 port switch (1 downstream port, 4 ports of Poe+, 78w total capacity, 30w per port)
2 jetsons in 10w mode (nodes all powered over POE)
1 Rpi4 also on poe

That’s running 11 1080p hevc-> 1080p h264 streams!

At idle, the total system draw for the ups, switch, pi and 2 nano’s is around 14w

I think I can get this down lower by optimizing a few things, but I also expect power to go up a little bit when cuda filtering comes into play

Other factors that I expect to increase power and/or decrease performance is when a client also asks for a stream to be scaled. AFAICT I’d be best off doing this in cuda, but the normal ffmpeg aren’t applicable, so I’d also be writing my own scaling filter

4 Likes

That is so impressive. Electricity is quite expensive where I live (mostly because of taxes), so some quick back on an envelope calculation, this would save 400 USD a year if it was idling all year. How long do you think it would take for this to pay it self back?

1 Like

Yeah seems superb so far, but very limited functionality.

Because I need to use CPU scaling as of now, and that cpu scaling doesn’t support multi threading (there’s been a few pull requests made over the past few, but all have been rejected)

Now, the pseudo upside to this is that a single core can do realtime scalling for a 1080->720p stream. So without re inventing the wheel I think I can get around 3 streams of 1080p hevc 30fps → without risk of crashing.

3 streams pins 3 cores @ 100% and leaves the fourth core @ ~70%. if I push to 4 streams I end up at 0.9x Real time, which just ins’t a possibility IMO.

That means that I could either overclock the cpu and NVENC/NVDEC blocks, and/or build my scalling filter in cuda.

From the nvidia forums, hitting 2.0ghz all core is typical for CPU, as for the hardware blocks, they can go from 716 to 894.

It also seems the most power hungry block on the nano is the GPU itself, and since that should be the least utilized, I think I have the power envelope to either: continue doing the scaling on CPU and overclocking the cpu, or B implement scaling on the gpu as default.

In the case I do go to GPU scaling, I think I’d be able to shut down 2 cpu cores as they don’t go above 30% per core when simply shuffling data around.

Should free up another 2-3 watts of power, which I can either take as a victory, or reallocate for extra performance elsewhere. I could also lock the cpu clocks to .75GHZ and take advantage of a bathtub curve, giving me ~the performance of 2 CPU’s at max speed, but only drawing the power of ~1.3-5 depending on silicon quality (but that’s a topic for another day)

(broke the power/ROI into a separate reply to make it a little cleaner)

As for Power consumption, I haven’t factored in whatever the power draw of the newer system I build will be. The dream would be some sort of used threadripper v1 system, or some haswell/broadwell xeons, then selling everything I have in my current system outside of the add in cards (Usb, HBA, GPU etc.)

Assuming that system idles ~90w all in including gpu, drives cpu etc. i’d be sitting at ~105 w idle draw at all times. But I’m also hopping to have the new (to me) workstation support hibernation, which my current system doesn’t unfortunately.

at that point, I think average draw over the course of the day would average out to maybe 50w?

and peak total closer to 250w?

assuming that I draw peak for 1 hour total per day, total draw per day = 250+ 23*50= 1.4 kwh/day. power in my area is ~15c CAD per kwh, so projected daily running cost comes out to ~21 cents per day

where as using those same figures from my current system: 1h* 400w + 23h100w = 2.7 kwh/day, AKA a factor of ~2
old system is then at a cost of ~39c CAD per day.

Assuming I can get into the new system for a reasonably low cost, say ~600 cad(?)

From current ebay sold prices, I think I can get maybe 300(?) for my current system?

so after that, I’m 650 Cad in and saving ~18c per day of electricity.

For some reason I feel like I’ve made a mistake in my power consumption math :thinking:

1 Like

Soooooo I’ve been away for a little bit working on a different project (getting an android translation layer called anbox running on the raspberry pi to do data capture and analytics on the PVP ladder of a mobile game)

Hoping to jump back in next week. To get around some issues, I’m thinking of calling gstreamer API commands from within an FFMPEG filter.

Seems like a much cleaner approach, tho I worry about how FFMPeg will like dealing with gstreamer; not to mention the overhead associated with it

Hoping to come back to this in a months time, Life happens :frowning:

1 Like

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.