Whats the issue with HEVC anyway?

FurryJackman · January 21, 2022, 7:14am

I mean widely used in post production. FFV1 is only widely used in archival, and NOBODY uses FFV1 even in open source cinema cameras to capture. (Apertus for example)

https://www.apertus.org/

FurryJackman · January 21, 2022, 7:21am

At this point I think it’s better for direct FPGA access to a piece of storage like a PCI-E 16x Intel Optane. The data from the Optane would then be encoded out with the Optane as a buffer.

FPGA + ARM ASIC for encoding is what I’d go down, but to get more storage, you kinda have to go with Optane. Gives more buffer in case the ASIC has to slow down for complex scenes.

MazeFrame · January 21, 2022, 10:17am

Excuse my ignorance here. I am only slowly grasping what you two are discussing here. Lots more reading to do.

The big PCIe FPGAs (U2x0 line) from Xilinx can handle 64GB onboard RAM (at 70GB/s bandwidth). Would that be sufficient buffer or is that cutting it too close?

Edit:
This is what you intend to make happen, right?

Amazon EC2 F1 instance (which offers Xilinx UltraScale+ VU9P FPGA)

FurryJackman · January 21, 2022, 10:38am

That’s the second FPGA that reads from a large cache like a Intel Optane.

We’re seeing if we can compress to JPEG XS first using a GPUDirect or DirectDMA connection that captures NVFBC or similar of the desktop, then that uncompressed feed goes to a JPEG XS FPGA that compresses to JPEG XS as a cache to an Intel Optane, then another FPGA reads from Optane and does the final compression to AV1, where it can tweak it’s speed to always be in line with the cache with a very big lookahead buffer to optimize bitrate with more available video to sample from.

It can allow multiple passes of the cache if it’s big enough and the encoder is fast enough. JPEG XS can fit in the bandwidth of a Optane but it’s the out of sequence reads that’s the important part for multiple passes.

The end result would be low enough bitrate a spinning HDD would be fast enough to write to if it’s AV1. If it was the professional profiles, a typical NVMe like the SK Hynix Gold P31 would be fast enough.

This is less real-time streaming, than recording with multiple passes using a cache.

anon94072931 · January 21, 2022, 10:45am

You don’t need to make it sound like it’s uber-elite talk, although it kinda is Depending on what codec to use and on what kind of hardware. I don’t know every detail about every codec and there is nothing new about using a large framebuffer for encoding. Although it is kind of new. The newish part would be doing it on a ram or memory disc thus having the frame-buffer in a zone on the disc, if not even using the entire disc as a buffer aswell. Which is probably something similar to what jpeg-turbo does.

FurryJackman · January 21, 2022, 10:53am

It’s what my friend David’s high speed camera does. Bayer data is stored in RAM, then it plays the data at 60fps out to the ARM H264 encoder.

Last I heard he hasn’t yet figured out faster than realtime playback of RAM to the encoder yet.

anon94072931 · January 21, 2022, 12:38pm

That’s the point of Jpeg-turbo. Although the loss of frames are way to high, but even then it can provide a steady 30fps. There are probably codecs like it, although there should be way many more. Because the technical aspects of using memory or harddrive allows for so much more than simply pushing through frames with cpu cores which is way to linear way to do such a task. It’s actually kind of borderline linear to be a bit more honest.

If anyone looks at the tech sheets of both what ram and harddisc tech can provide, then looking at cpu cores after. Using cpu cores for encoding, would be the most stupid choice of the other options if the codec that is used is x264 and similar codecs…

anon94072931 · January 21, 2022, 12:42pm

Yes that is similar to codecs such as jpeg-turbo. Encoding in ram first then using the cpu on the stored / compressed frames. Where something like an ARM cpu, would be an amazing choice for that. Although again the issue with storing on ram, is the amount of frames that are lost when compressed with jpeg-turbo codec.

anon94072931 · January 31, 2022, 4:06pm

So Jpeg-turbo, yes basically taking jpeg’s in a fast order. Like superfast screenshots that are then compressed before being processed by a cpu-based codec like x264 (example) Others are also available.

Not sure where the loss of frames occurs, perhaps when compressing frames? It happens on another level of code, where a cpu isn’t used to encode the frames which could be the reason alone for the massive loss of frames. Very likely, as using the cpu is a very basic part way of codec usage. In general.

It has to be admitted though, that codec’s that does not use cpu to compress frames, work on another level of code. It’s pretty amazing stuff going and a so much much more smarter way to do what a only cpu-based codec would normally do.

Although it sounds more likely that the losses occur when they are being processed from the compression. Perhaps even both, if not before processing (at the encoding part) Not 100% sure, about that it would just sound much more likely that losses are most viable from the encoding part of the compression as there isn’t a cpu to save the frames that are being compressed, instead the cpu uses already compressed frames that it can then process.

FurryJackman · January 31, 2022, 8:00pm

So you’re trying to find which framebuffer is dropping frames. This is the same issue DXGI and NVFBC face when passing through FFmpeg or OBS. They pretty much have to all be in Vsync to avoid dropping frames.

In fact, Internal triple buffered Vsync, then capturing externally with perfect Vsync is the best way to capture because you’re allowing triple buffering to happen, then capturing the buffered output to a compressed format is pretty much the best way to go. It’s why Dual System is so popular.

When grabbing an internal framebuffer, there is the likelihood it’s before any buffering has occured. NVFBC is likely the closest to capturing Vsynced single buffer, but even it has a sync offset if your monitor and capture framerates don’t precisely match. This is why people love dual system.

AbstractConcept · February 1, 2022, 3:25pm

I think there is confusion in how you are wildly mixing terminology between codecs (encoder/decoders built in software, FPGA, or ASIC hardware) and the compression standards that those codecs implement.

H.265 == HEVC == MPEG-H Part 2

The compression standard that is being used for 4K+ video, as well as for lower resolution video since it is an improvement on the previous standard compression format, H.264/AVC/MPEG-4 Part 10.

x265

An open-source software encoder/decoder implementing the H.265 compression standard/format. As far as I can tell, x265 runs only on the CPU, but in theory, one might be able to use OpenCL/CUDA/Vulkan to offload some software calculations to the GPU.

NVENC (Nvidia Encoder)

How Nvidia refers to the portion of its GPU that contains one or more encoding ASICs (the decoding ASICs are separately grouped as NVDEC); from Nvidia’s comparison table, it looks like 5th gen of NVENC was the first to include an H.265/HEVC encoder, though it required 4:2:0 chroma subsampling.

Additionally, when you talk about codecs using memory in a special way, it almost sounds as though you are referring to Processing In Memory (PIM), where the RAM DIMMs themselves do some pre-processing (alternatively called C-RAM for Computational RAM); however, PIM is not currently available in any meaningful way, certainly not for end users.

Maybe you are merely referring to using more memory to cache frames in some way before the compression is run on them, but that does not seem remarkable at all; I think on most CPUs in normal operation, one or two 4K frames will not fit entirely in L3, so you will be falling back to RAM whether you want to or not.

While I am mentioning somewhat confusing things, @FurryJackman seems to be talking about some kind of double-compression approach where the input device does a round of JPEG-XS encoding before handing the stream off to the HEVC codec.

Lossless input → JPEG-XS —[PCIe]→ HEVC

I guess the intent is to reduce PCIe bandwidth requirements, but I do not see how this would be especially beneficial, since then the JPEG-XS data would need to first be decompressed to plaintext, then re-compressed to HEVC. Is the additional latency and quality loss from this approach really outweighed by the reduced PCIe bandwidth? I would suspect that this requires more memory as well, since you need a JPEG-XS decoder in front of the HEVC encoder, no?

anon94072931 · February 1, 2022, 8:01pm

Well using a larger memory pool would possibly allow for lessening the losses of frames in a cheap way. As the amount of already lost frames from compressed frames is very very high, if compared to a cpu doing the compression.

Personally don’t get the part where it’s not remarkable, that it’s even possible to compress frames without using a cpu for the compression part of it. let alone getting + - 30 fps from it.

Even with massive loss of frames in, after or before the compression. The amount of memory it would need say versus sheer cpu power should be enough to impress anyone. As they are literally worlds apart in results, especially at super high resolutions where jpeg-turbo memory compression (even with massive losses) outsmarts every cpu there is, doing the same task.

FurryJackman · February 2, 2022, 6:10am

It all happens in a secondary GPU, where it takes care of the decode, and can take it’s time with a massive read-ahead buffer to do rate control. JPEG XS barely requires any resources since it’s a lightweight compression, and most of the resources would be dedicated to the multiple passes encoding to a lossy format.

It doesn’t have to be HEVC, it could be any other number of codecs CUDA could use it’s acceleration to both decode and encode at the same time using CUDA cores. (Not NVENC, that’s a ASIC portion of a GPU)

system · November 3, 2022, 12:10am

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.