How does a switch work? Does a Switch merge or queue packages

djcat · December 2, 2021, 6:08am

Hi. I would love for someone to explain or point me to some documentation that shows how a switch works. I’m wondering how a switch could/would handle 2x1Gbit streams to a pc:

When the pc has a 1Gbit line from switch.
When the pc has a 10Gbit line from switch.
Assuming that the streams have a linerate of 1Gbit.
(For example: Downloading at linerate from WAN, while streaming a game or video to another pc on LAN.)
Could (2) help reduce jitter? Can MTU or “package” size be different and packages merged “last mile” or could they “go out” quicker? Or are they qued? Is there a difference between packages incoming at same time or one after other? (When are some packages dropped?)
(I assume no QoS or package prioritization in switch, just a “dumb” one)

risk · December 2, 2021, 7:11am

(lies and omissions for sake of simplicity below)

Switches (L2) handle sending and receiving separately.

Each port on the switch has a really short send queue/buffer, and a somewhat larger receive queue/buffer. Zooming out to real-time scale of things, if traffic is operating at line rate, these queues only store about a microsecond worth of traffic.

Each packet gets into an Ethernet frame before it goes on the wire, switch only looks at outer frame data (mac addresses) and doesn’t care what’s inside. If there’s no space left in these queues on the switch the frame is dropped. For example if you have a multiport gigabit switch and two hosts both trying to send data to a third host, the transmit queue on the switch towards the third host will saturate and data will be lost.

From the perspective of hosts that are trying to communicate through this switch, they need to be able to detect this dropped data, and figure out if they should slow down sending… or what’s more complicated - maybe speed up sending.

This is where things like TCP (and increasingly QUIC) come in.

TCP gets data from the app as a stream of bytes and keeps it in a buffer of data that is too be sent. The data is removed from this buffer only when the other side acknowledges it has received it (small packet coming back with just a counter). It’s only removed once confirmed received, because we want to be able to retransmit it, if we detect it’s been lost.

If there’s plenty of bandwidth available there’s no loss, and delays introduced by this. Just a bit of extra memory used for as long as it takes the packets to traverse the switch…(or in some cases the internet and then be read by the receiving app so that the receiving host kernel can signal back an ACK-nowledgment it has received the data).
If the packets are missing - e.g. the receiving side gets packets 1,2,3,4,5,7… the receiver will ask for a retry.

The effect of that from a receiving app perspective is, you see some data coming in with some low latency… and this one dropped packet requires data to be resent (app doesn’t see it)… so that piece of data just comes in later, at a higher latency…
This inconsistent latency of data being received is what’s called jitter.

Also the buffer for this connection on the sending side will not be emptied until the data is confirmed received, eventually the sending app will just get stuck or similar, thus will be sending slower.

The default size of this sending TCP buffer varies by OS and apps can usually control it with some reasonable bounds.

You can’t really control what your ISP does with your traffic vs other people’s traffic and what the other person’s ISP does with other person’s traffic and what network carriers in between do and whether they maybe have boxes with big buffers to smooth out spikes in traffic thus raising utilization on the link… and they might do what’s called RED and GRED to probabilistically drop traffic before links are saturated to avoid “global synchronization” and flow control and traffic engineering across multiple links, and all other kinds of weird tricks and stuff.

It’s a miracle that somehow incentives end up being aligned with all these middlepeople entities meddling with traffic such that you’re able to e.g. stream live video… or even harder have relatively low latency 1080p60 10mbps+ video calls with others whenever you randomly feel like it.

Also, there’s UDP where apps don’t care if the data is lost, often times because they can handle it somehow. And there’s a thing called forward error correction, like old Skype used to use, where instead of using TCP the app sends more data all the time and the receiving side can reconstruct it if some gets lost (think raid for streams). But then it’s up to Skype to handle all the fancy retry loss detection logic, and it’s up to Skype to figure out how much bandwidth there is and how much bandwidth to use for redundancy and protection against packet loss… and how to deal with too much loss.

Best you can do on your end, is try not to drop packets uncontrollably in your own network, where you have full control over all traffic policies.

Some switches support priorities, some switches support traffic rate limiting and flow control signaling… hosts on a LAN can do a lot too, to limit the traffic rate/bandwidth used.

thro · December 2, 2021, 7:47am

The short version (well it started short):

A switch processes ethernet frames
IP encapsulates data into packets which are then embedded in the above frames. assuming you’re running over ethernet. otherwise the IP is embedded into something else, e.g., a PPP frame, a GRE or L2TP or IPSEC tunnel packet, etc.

IP and switches(ethernet) operate on two different layers of the network model. Switching (ethernet) is layer 2, IP packets are layer 3. The data inside your packet (that your app cares about) is layer 4.

If you’re unfamiliar with the layered network model, look it up so you understand what layer 1, 2, 3, 4 are as its common network speak.

All a layer 2 switch does is attempt to dump the frame on the destination port; if it is too fast (e.g., destination port’s buffer overflows) the frame is dropped and it is up to layer 3 (IP, specifically TCP) or layer 4 (the application sending the data) to deal with that.

Yes, if you connect to a 1GbE device and smash it with 2x 1GbE clients sending data to it at the same time at full rate, there will be a non-zero number of ethernet frames simply getting thrown away. Layer 3 and layer 4 of the network model are designed to handle that.

— end short version —

Now, standard ethernet frames are 1500 bytes. This is different from the MTU which is an an IP thing - you want your MTU to be smaller than this frame size.

If you have a 10 gig port that has been manually configured for jumbo frames (larger than 1500 bytes, lets say 9000 bytes - on a switch that can support it) sending data to a port that is only able to handle or is only configured for 1500 byte frames, what happens?

The frame is dropped. Which means the data inside the frame (e.g., your IP packet) is also dropped.

TCP is designed to deal with this - detect the packet is dropped and back off (in case of too fast) or send smaller MTU if MTU scaling is turned on.

But yeah, as you can see, playing with jumbo frames on the ethernet side can be fraught with danger. You need to know that the path between A and B is 100% capable of handling your frame size; so its typically only used between say, a 10+ GbE san or host, and another same speed host traversing only one, or a small number of core switches. If the traffic coming in on a port is going anywhere else, you really don’t want to be turning on jumbo frames.

The wins from jumbo frames aren’t huge anyway these days (enterprise switches are usually now powerful enough to process every port running at full line rate so there’s processing headroom to spare) so they’re often not configured.

QoS can be done at layer 2 or layer 3 with queuing but by default a switch won’t do anything QoS related out of the box.

If unconfigured, or with a dumb switch (as the OP mentions) - It’s a simple case of “overflow = drop”.

All QoS does is tell you what to throw away as well, so its not a magic bullet; something has to lose and typically you need to determine that on the sending end as once the packet hits the receiving buffer (or overflows) it is too late.

And yeah the above is somewhat simplified with lies and omissions as well.

QoS and traffic control is quite a complex subject - but the basics are that QoS is just nerfing things you want to throw away or shuffling them to a higher priority queue. (e.g., in QoS software you may have 4 queues: realtime, high, bulk and default - in increasing priority of drop when bandwidth becomes scarce - you put size limits on the higher priority queues to guarantee a priority for them but its not magic and won’t magically help a sheer lack of bandwidth if you’re massively oversubscribed)

Trooper_ish · December 2, 2021, 9:29am

I got a dumb question…

Presuming a Nic is the small brain dedicated to looking at a packet that comes in on a port, reading the encapsulated headers, then quickly working out where to send it, and pushing the packet onwards… what is the difference between 1 gig and 10/100/200+ gig nics?
Can the packets can only come in one at a time, so the controller just has to identify and move them on quicker for higher speed? Or is there some parallel streams going on?

I’m presuming packets/frames/whatever stay the same size (1500) and they don’t magically move down the wire any faster (faster than Veristabliums 1/c) although fibre has multi frequency concurrent streams, I don’t think that would be applicable in anything other than huge backhauls.

So, can anyone eli5 how multigig differs? Is it just the controller being better?

ThatGuyB · December 2, 2021, 2:26pm

Just to clarify, the receiver will ask for a retry of only packet 6, not the whole stream (that’s how TCP works).

The rest has been pretty well explained by both risk and thro. But none of them got into packet resizing, different speed links and thro mentioned in passing frame encapsulation. This will also answer Trooper_ish’s question.

Also a bit of lying and omission for the sake of explanation will happen here.

In any TCP/IP network (even if you use UDP or other protocols), you can have the layer 3 (the IP part) be encapsulated into other protocols. This is done by routing. Say that your PC has a an ethernet port, you router is connected via a telephone line (DSL) and the server you want to access is connected with an optical cable. Your PC will spit out ethernet frames. The router gets an ethernet frame on one port, decapsulates the ethernet frame and re-encapsulates the traffic in a PPP frame. From there, it reaches the ISP who is likely using fibers, so the PPP frame gets decapsulated and the payload gets re-encapsulated again into an ethernet frame and potentially goes like this all the way to the server. It’s important to note that layer 2 is still a protocol and layer 1 (the physical layer) can be anything, which is why ethernet can work over both classic twisted pairs of copper cables with RJ45 connectors at the end, and through a fiber optic cable.

Now, we have a map of our connection. It’s important to keep in mind that this “chain” of routers and switches will work at the speed of the slowest hop (the “weakest link in the chain”). So, let’s say your PC has a 1 Gbps ethernet port and is connected directly to your router which has 100 Mbps ethernet port and on the other end, you get 56 Kbps internet via a telephone line. We ignore the ISP side and assume it’s bigger than 100 Gbps. We then arrive at the server, which has a 10 Gbps that has a sfp+ port.

Your PC, being connected to a slower link, will negotiate with the router and it will get set to 100 Mbps link. So your PC will try to upload a file to the server, pushing it’s maximum throughput of 100 Mbps ethernet, which is made of frames of 1518 bytes in size (of which 18 bytes are the ethernet frame headers, so the payload is 1500 bytes). That would make for about 0.00143 MB. I’m too lazy to keep doing the math and it’s boring anyway, but basically your PC will split your payload (your file) into multiple pieces of roughly 1400 bytes (some bytes are reserved for the layer 3 IP header and some for the layer 4 TCP header) and will try to push a number of frames that their size would total about 100 Mbps (if my math is correct, about 69,900 frames of 1500 bytes per second).

Your router gladly accepts the traffic, but then, when it tries to push a decapsulated Ethernet frame of 1500 bytes over the phone line, it sees a problem. The payload is big and there are lots of them. Let’s assume that the 56K modem can’t encapsulate 1500 bytes payloads into PPP frames (it usually can, but let’s assume it doesn’t). IP (layer 3) knows how to deal with this. So one of two things happen. If your original IP packets were marked with a DF (Don’t Fragment) flags, then the router will return an ICMP message to your PC saying “Destination host unreachable. Packet is too big! Packet needs fragmentation, but the DF flag is enabled” after dropping the original packets. The sender PC will see the message. The same message also contains the next-hop MTU, so your PC should know to readjust its frames in order to successfully send packets through the smallest hop.

The other things that can happen is that the router sees the big MTU of the original frames, sees that no DF flag is on, then when it decapsulates the frame, the payload will be fragmented into multiple smaller frames and this is how the server will receive the payload. It will have to reassemble the packets itself.

The DF flag is rarely used, mostly for applications that cannot handle fragmented packets, which are rare.

One nifty thing about TCP is that when there’s congestion, the server knows to tell the clients “hey, slow down there, boy” and the PC will send fewer frames to the server.

Speaking of congestion, or rather, the speed of a destination link. So we determined that your PC will send smaller frames, in number that can be processed by the slow 56K modem. So the 100 Mbps link to the router is far from being saturated. What that means is that if you have another host in your LAN, you can use the remaining throughput to send a file to that host from your PC.

Now the server. So the server is receiving a meager 56Kbps traffic from your PC, but it has a 10 Gbps link. That means it can receive a multitude of packets from a lot of PCs per second. But let’s assume that there are just 3 PCs, yours, another that can talk at 1Gbps to the server and another than can also talk at 10 Gbps with the server (assuming no slowdowns on the path). That means the first two can push their traffic pretty well, with the last one not speaking at 10 Gbps, but at roughly 8.9 Gbps with the server (assuming that only this PC will be throttled and not the other two).

Dexter_Kane · December 2, 2021, 3:26pm

Pretty much it’s just operating at a higher frequency. The data isn’t physically moving any faster but more things are happening per second. The hardware may or may not be working in parallel, older 10gbps fibre transceivers for example used multiple lasers with prisms to get that speed as the electronics couldn’t run fast enough, but the controller is running at 10gbps.

risk · December 2, 2021, 9:33pm

It’s kind of what you’d expect, lots and lots of CPU offloading happens and signaling is at higher frequency and at higher power and power electronics is more precise and the controller is smarter…

… let’s take a 400G NIC as an extreme example … not a switch, but a NIC (it’s a standard, they exist), and let’s try to server 40GB/s of Netflix movies out of some NVMe storage.

NIC controller needs to be very smart, this is cutting edge silicon valley stuff where you have hundreds of people costing companies millions of dollars per year in salaries and equity each, building stuff for internal use only of small set of companies, never to be sold or used by anyone else.

boring details

So if we had a “stupid slow brain nic”, electrical signals go into some shift registers and a packet appears in a piece of ram. CPU gets an interrupt, ding-dong, hey CPU stop what you’re doing there’s a packet for you.

CPU stops, saves context from whatever the core was doing goes into kernel, kernel copies some bytes from what looks like memory but is actually over on some PCIe slot somewhere (think really slow ram), and then starts looking… oh, is this for me? Is this IP, is it TCP/UDP, is there a port, ok let’s see the firewall … cool let’s see if there’s a socket that an app has open and so on… oh there nice, does the socket have a buffer with space? Cool. Let’s see if there’s space and check the checksum and copy the data and then ask the scheduler nicely to wake up the app at first convenience and send a TCP ACK back, by getting some memory creating a buffer and copying some memory…to PCIe
… app wakes up … oh it’s TLS , ok, decrypt, oh it’s HTTP 1.1 GET /secret_api/star_trek.mkv HTTP/1.1 … ok luckily we’re…

… [I’ll stop here, this is tiring…]…

So, this HTTP request is about a 1K long, you end up with a loop that reacts to events such as data received from NVMe → wrap it in TLS, split into packets send to NIC.

Good thing CPUs are fast.

At 400 Gbps… you get 25ns worth of time to process a 1KB packet. At 4GHz that’s 100CPU cycles… If you have a really smart CPU (thinking M1 or similar), that’s about 500 dumb streamlined instructions.

So, simple, … parts of firewall move to NIC(modern NICs run BPF), you don’t get a firewall interrupt on every packet, you get one every several… (interrupt coalescing), but that’s not enough… you move TCP buffers onto the NIC (GRO), so if you have 10000 connections the NIC ends up doing TCP, and maybe TLS, and you get the request, you never see or need to send ACKs or run any crypto code (fancy stuff more BPF, modern NICs are more like FPGAs, sometimes they actually have FPGAs on the card)… what else… oh yeah, with CXL you setup a virtual memory address and hookup nvme reads to go directly to NIC bypassing most of the CPU caches…

… so your app on CPU gets woken up every 1MB of data to update some stats. NIC handles the rest… because it’s more like a GPU or more like a machine learning accelerator, you get a few hundred dumb cores doing simple stuff for each of the connections. It doesn’t take 25ns to do all the stuff - it takes longer but with massive parallelism.

These scary 400G NICs are more like a second computer that your software has to figure out how to use, they also cost almost as much as you’d pay a decent 2 socket motherboard and a tee’o’ram (TiB of RAM).

Gigabit NICs do a bit of TCP offloading partially, but they do it one packet/stream at a time, or a few, typically you only get some kind of BPF programmable capabilities with very few 10Gbps chipsets. 10-40Gbps gets you the ability to steer packets to CPU cores by an identifying tuple and chances are greater that you get more programmability to do less work on main CPU.

Trooper_ish · December 2, 2021, 9:37pm

perfect, Thanks!

Enough detail for the gist, and not enough to overload with stuff a casual interest can’t handle, I appreciate it!

thro · December 2, 2021, 10:03pm

The carrier is clocked faster, so yes, 10, 100 gig ethernet runs at higher frequency over the wire and the ASICs are running faster.

This is why cable specs for 1 gig, 10 gig, etc. get more tight, and why you won’t see 100 gig over cat5e any time soon.

Layer 2 switches are heavily optimised to do what they do. Even a layer3 ethernet switch cuts out a bunch of stuff a pure router can do as it is optimised for ethernet only.