Looking Glass over Cat6 Cable?

xhivo · December 31, 2019, 12:52am

Hello,

Send the raw data (framebuffer?) from Looking Glass through a CAT6 cable. Impossible?

Blame Linus for making me want to try the one server rules them all project. That’s what sparked the idea of using Looking Glass for that.

1080p60 should be good on 10gbit, for higher resolutions 20gbit should do, now, would the latency be reasonable through ethernet? It would be amazing if you could achieve low latency through a standard ethernet switch; I would imagine a port to port configuration to have low enough latency for it to be unnoticeable.

Considering current solutions for this niech use case are proprietary ASICS and protocols, that cost from a couple of hundred bucks like HDBaseT, to literally thousands of dollars with something like a fiber cable or extender.

Thoughts?

gnif · December 31, 2019, 12:55am

If the bandwidth is there, sure it would be possible, for 1080p @ 60FPS you would need 474.6MB/s for lossless, however even if your network latency is 0ms (instant), packet splitting, queuing, buffering, etc, will all introduce latency at each end of the link.

xhivo · December 31, 2019, 1:16am

Thanks for the quick response.

What about a direct pc to pc connection? Would that allow for a low jitter low latency signal?

How much processing power would the client end need? Would something a low power system be sufficient?

I know little about ethernet but this could be my excuse to start learning about it and give this a shot.

I was looking to find the right hardware for this project, for so long, it turns out software might be the answer after all. Anyway, I got to say I love what you guys are doing. Honestly this sort of stuff is really cool!

Best Regards

gnif · December 31, 2019, 1:32am

This is what I am talking about.

In theory… it would need to be evaluated.

Ok, let me explain a little of the basics. Lets say you want to send a buffer that is 1MB in size, this is larger then the maximum payload supported by all current consumer and even most enterprise level equipment, so the TCP stack splits this up into packets of data limited by the MTU (maximum transmission unit) of your hardware.

Generally the default MTU is 1500 bytes unless you’re using Jumbo frames, in which case it’s usually 9000 bytes. The MTU may further be limited by your PC or any equipment between you and the target PC (routers/switches/gateways, etc).

Each packet has a IP header containing among other things, the source address, destination address, ttl, crc, flags and sequence number. Each of these fields need to be populated by the TCP stack in the kernel, some things like crc can be offloaded to the ethernet hardware if it supports it (most do today).

At the end of the day, your 1MB packet, assuming jumbo frames had to be split up into 117 packets. Since your kernel is doing lots of things at once, it may re-order the packets and send them out in a jumbled up mess (worst case). As such the remote end needs to take each packet and inspect the sequence number to determine the original ordering. It also needs to check the CRC of each packet and if using TCP an acknowledge packet needs to be sent back to the sender to verify receipt. If the CRC doesn’t verify and can’t be corrected, a resend message is sent back and remote host re-transmits the packet.

Finally the kernel needs to reassemble the packet payloads into a contiguous buffer to hand to the userspace application. The userspace application then needs to check if it has the entire buffer yet (the kernel can’t know this, it’s protocol specific), and buffer each payload until it’s all arrived.

As you can imagine, this is a very complex stack to take large amounts of data through, and we have not even covered the fact that all this goes through a firewall layer and routing layers which adds additional computational complexity.

This is why LG at current has zero support for this, while all this is written into the kernel, the overheads it adds goes against the project’s goal of being as close to zero latency as possible.

Note: A technology like RDMA would make this feasible… but that is ~~not usually even in the reach of even enthusiast users with massive budgets for high end equipment (including me… hint hint, someone send me some gear so I can learn it and add LG support :P)~~

Log · December 31, 2019, 3:05am

I believe that mellanox connectx-3 cards are capable of RDMA.

In the US, cards from ebay (whether they are pulled server equipment or chinese copies is often unclear) can often be found for $40-70 for even a dual port 40gb QSFP+. Finistar transceivers (which are probably compatable, but I’ve not looked into it enough) are going for ~$45. OM4 cable is really cheap off of amazon. What do the prices look like on Australian ebay?

gnif · December 31, 2019, 3:26am

I had a look, seems I can get some 20Gbit cards for around $30 AUD each… I will add this to my list of project purchases when I finish with the AMD issues.

max1220 · December 31, 2019, 5:47pm

Maybe it should be noted that implementing sending the entire Frame as one chunk over TCP might not be the most clever solution. If I were to implement it, I would encode the frame in chunks, then send the over UDP. In most network, especially in direct connections(Without a switch), you shouldn’t have packet loss, and the packet processing should be easier. If you miss a packet, a small region of your screen might be behind up to 1/60 of a second. Even if you want/need frame synchronization, doing that in user-space might be more efficient for this task.

But even doing this over TCP should be very much possible, as long as you can grab frames etc. somewhat fast - I was just at the 36C3 hacker conference, and I’ve witnessed a pixelflood server(pixelflood is an ASCII protocol protocol over TCP) that could handle ~6000 connections, and a bandwidth of ~25GBit/s on consumer-level hardware(i7-6700K with some radeon GPU for offloading). I can easily generate a multi-GBit/s stream from a scripting language on my i5-2520M in my notebook. (I’m not sure what the server software was, but it might be this: https://github.com/timvisee/pixelpwnr-server)

DangoPC · December 31, 2019, 9:22pm

KVM over IP?

2FA · December 31, 2019, 9:42pm

I’ve not used personally but Parsec might be close to what you’re looking for.

xhivo · January 1, 2020, 3:24am

@Log Wow I never heard of RDMA before, sounds like the perfect solution, brilliant. And cheap enough those mellanox cards. Also would the numbers listed here (I can’t post links, google: “WP_RoCE_vs_iWARP.pdf”) mean that it is possible to get single digit microsecond latency? Best someone with the know-how reads this to confirm.

@max1220 Still, a very neat solution is to use RDMA as mentioned by Log, that way you offload all the processing to the actual NIC. And that would allow for some really low latency and low cpu usage.

xhivo · January 1, 2020, 3:27am

@2FA Parsec is in the same category as a steam link or steam in-home streaming. Maybe even the best in it’s category. But that means 7 ms (I took that from their site) of latency and compressed video capped at 60. Which is OK for many uses, especially streaming casual games to a TV. Precisely why I like the steam link.

@DangoPC Exactly.

Commissar · January 1, 2020, 3:47am

right. this is why fax over IP is not so much of a thing without other devices to translate

risk · January 19, 2020, 8:19am

Rdma is just a fancy network card firmware that implements a simple but gnarly to use in practice protocol with read/write verbage and that the driver can expose to the app as shared memory.

I started doing some back of the envelope math and thinking about protocols until I realized that moving e.g. a 30MB buffer along a 10Gbps network takes 30ms.

None of the networking synchronization or memory copying happening on either side would add as much latency.

But that’s probably ok, there’s plenty of latency everywhere already, percentage wise it’s probably not a big deal?

Either tcp or udp could work fine:

Reordering of packets is not an issue unless your stream is split to travel across multiple network paths (everyone in the network industry avoids this on purpose).
Socket buffer sizes are tunable, at worst you’d skip some packets, and drop old frame data (skipping needs thinking how to implement, could start with 5 frames worth of time timeout)
tcp has checksumming - it’s usually done in hardware/network driver coordinates it.
Worst thing about tcp and latency is the buffer that exists and is usually eventually delivered by the kernel, even if the receiver doesn’t care, you can’t really give up on receiving it of you’re running late…you could tcp reset in a home setting I guess.
Alternative world be udp, which as long as you have big enough buffers in kernels can work fine. (Let’s not do forward error correction, no need in home setting until wifi speeds catch up)
checksumming in general is cheap - apparently even sha1 can be done mostly using special cpu instructions probably other simpler algorithms can be hardware vectorized more easily.

With TCP/UDP you always end up with that last memory copy from nic over PCIe to your userland system ram, and in this case another copy from system ram to gpu, and probably another copy to unwrap your data from your own network protocol embedded in tcp/udp, but it’s so much easier to work with and build with TCP/IP than rdma. And as long as these packets/chunks fit inside your caches, the copying can be really fast since there’s not much ram involved - it all happens at L2/L3 cache speeds.

reference: We have some rdma over udp at work that some folks ended up building after RoCE as implemented in nics ended up melting down our clusters at high utilizations, basically read/write over udp, udp packet would have an extra header with stream id (identifies a buffer) and relative buffer offset, and the whole payload would be encrypted. For this use, you could make the packet contain maybe a conditional variable id as well, or a semaphore operation. You could negotiate a connection in a side channel over tcp, or over grpc or over ssh, whatever. It’s perfectly conceivable that such rdma protocols just get implemented in userland with ok performance, and later down the road this can be lifted into the kernel or the network driver itself.