UDP v/s TCP in safety critical embedded applications

jaiyam · January 18, 2022, 5:42am

I want to understand the features of UDP and TCP for safety critical use cases.
I want to use ethernet in a real-time safety critical application, where latency and data integrity both are important. Although I can get low latency with UDP, somewhere within me I am uneasy knowing that the data integrity is not guaranteed. If a UDP packet is lost, I have no problem as I can design around that, but I fear a scenario where a corrupt packet is received and there is no way to tell if it is what the sender intended.

Is it possible for UDP data to get corrupted in transit?
Is the same possible with TCP and would checksums/other checks help identify the problem?
If the network cable runs through an area with high EM noise, is it a bad idea to use UDP?

risk · January 18, 2022, 8:30am

If you don’t know, use TCP.
If worried about integrity (malicious actors, 32bit checksums you get out of the box are not enough) use TLS.

Both TCP and UDP already have checksums, Ethernet underneath also has another checksum. The apps don’t receive corrupted data, unless the corruption is such that a checksum still matches.

TCP guarantees ordering and handles retransmission for you, you can tune params in the kernel

You’d just either TCP or UDP depending on API you need.

High em environments usually don’t matter - thanks to physics of twisted pairs and differential signaling. There exist “shielded” Ethernet cable that are commonly used when you have many ethernet cables running in a bundle strapped next to each other, it’s basically a piece of grounded aluminum foil that wraps the pairs, so you end up with a faraday cage.
You can also use fiber.

What are your latency requirements?.. in general you make up for UDP packet retransmission delay by using a FEC scheme or netblt … this is all complicated.

Unless you’re talking about doctors running surgery from their crappy home ISP and robots cutting around people’s harts half way around the world at a different crappy ISP, you probably don’t care.

TCP latency is the same as UDP latency when on a single cable, or when you own the infrastructure, or when the infrastructure is simple.

thro · January 18, 2022, 8:56am

If you need assured delivery (or rather to know whether the packet was delivered or not) use tcp. Otherwise you need to invent your own state tracking mechanism.

Udp is more responsive/efficient as it does not track state. This can be useful if state is irrelevant if a retransmit means it is out of date. Classic example is voip: a dropped packet is better than an out of order re transmit coming out the other end as out of order audio. So udp is used there.

If you need to go through firewalls I’d suggest using tcp as any Nat firewall can track state. If you use udp there is no session state to track which means Nat would probably break two way communication for your app.

xzpfzxds · January 18, 2022, 11:19am

Use UDP when the TCP stack doesn’t give you the required control over TCP features. Assume any packet will be corrupted, the length will be wrong, the contents scrambled : you’ll want strong HMAC to deal with this.

Randomly introducing an error to every TCP packet would let one pass the checksum check every 1/(2^16) packets - if you send enough data, this could introduce multiple errors per second on bad hardware, relying on TCP checksum alone.

UDP has the advantage (and disadvantage - complexity) that you have complete control over retransmission, congestion avoidance, ordering, multiplexing. For example, if you know that your link drops packets often but you can’t tolerate latency from waiting for a receive acknowledgement, you can proactively send multiple messages at the same time and discard duplicates at the receive end.

I’d probably look at the QUIC design to see the motivation of why that protocol was chosen, for some ideas.

Mastic_Warrior · January 21, 2022, 3:11pm

Can you better define these requirements. Are we talking about realtime systems where if something fails someone will die, equipment will break?

Are we talking about something that has strict time to live requirements? IE if the data is too late, it can no longer be trusted?

Critical and Safety systems are really hard to engineer without having clearly defined requirements and risk tolerances.