Punishing an Epyc 9254 with 500k tcp connections

memnoch_proxy · January 31, 2024, 7:38am

I’m doing a stress test of LANforge emulating 570,000 tcp connections, doing 11+ Gbps across 8 10GbE ports. This is 400 processes with 1425 sockets open, each sending 20Kbps per socket. I didn’t expect to need 128GB ram. Next servers will get 256GB!

This has hyperthreading disabled. I don’t necessarily know if it would be any better with hyperthreading enabled. I’ll try tomorrow. I’m hoping it will run overnight without falling over. Swap usage has actually gone down since the test started.

Naelith · January 31, 2024, 1:03pm

you could probably reach +10 million TCP connections rather easily using something that can do tcp connections with less memory usage and cpu usage like erlang/elixir

memnoch_proxy · January 31, 2024, 3:31pm

I’ve heard interesting things about Erlang.

Part of the memory footprint is the per connection statistics tracking. I didn’t write the LANforge Server, but it has been maintained and improved for over 20 years.

shadragon · February 1, 2024, 5:26pm

Still wouldn’t keep up with my wife. I’m still answering her first text when texts 2-5 arrive.

risk · February 1, 2024, 9:21pm

@memnoch_proxy - are you after a load balancer of some kind or…

From where I’m sitting the internet has never been smaller relative to size and capacity of a single machine.

A lot depends on driver and software stacks, and NICs and their drivers can do a lot of heavy lifting (DPDK and XDP can do some amazing things, there’s people writing crazy firewalls in Go + some C for eBPF).

And then, there’s that thing some router vendors call “sticky ECMP”, which is how you get from a few terabits of traffic hitting a handful of VIPs on a pair of expensive routers - to a few dozen load much cheaper software load balancers / firewalls / whatever somewhat smarter network function you want to implement.

… so I’m curious, what are you trying to do with the host?

memnoch_proxy · February 4, 2024, 3:03am

In this case a reseller is looking to test an IDS/DPI appliance. They desired 600k stateful connections and over 60 Gbps of traffic. We are proposing one system to provide 60Gbps and 130000 connections, and a second system to provide 10Gbps and 570k connections.

These two different scenarios fit in 128GB of ram, but it appears the kernel scheduler doesn’t want to mix fast connections and slow connections on the same system. Spread across 400 processes, it prefers to evenly schedule the processes when so oversubscribed this way.

The traffic handling is keeping statistics on packet loss, latency and jitter for each group of endpoints. So loss at the DUT will be noticed, so will tcp retrans.

risk · February 4, 2024, 6:43pm

So you’re hooking up 8 interfaces and the software is expecting them to loop back through the DUT while trying to torture the DUT ?

What’s the payload? does it have to be good looking TLS or QUIC or plaintext or random? … or is it configurable from a capture file?

I don’t know why I’d expect a 9254 to be able to do faster than 10Gbps / 60Gbps.

memnoch_proxy · February 5, 2024, 4:07pm

Payload in this case is not layer 4-7, but layer 3 packets stuffed with numeric patterns. Layer 4-7 is possible with libcurl and, but 130000 copies of curl is not as efficient for the given amount of memory.