GPU -> GPU expected bandwidth

I’m working on a machine learning project and I’m having trouble scaling my training to multiple GPUs. Even with heroic efforts, I’m seeing less than 40% throughput increase moving to 2 GPUs.

I’ve run lambda’s benchmark and saw a 34% throughput increase with 2 GPUs versus 1. In their blog post, they claim an 80% speed up for 2 gpus.

A few years ago, I was able to get ~linear scaling with multiple GPUs. Since then, I’ve changed GPUs (2080 tis → 3070 tis), and my motherboard (x399m taichix399 Phantom Gaming 6).

I’m concerned that one of the things that made the replacement motherboard so cheap was some part of the PCIe system that limited slot to slot bandwidth.

I ran the cuda BusGrind tool, and I’m seeing card to card bandwidth of 3.32 GB/s:

P2P Bandwidth Matrix (GB/s) - Unidirectional, P2P=Enabled
   D\D      0      1
      0 257.84   3.32
      1   3.32 263.90
P2P Bandwidth Matrix (GB/s) - Bidirectional, P2P=Enabled
   D\D      0      1
      0 270.60   4.88
      1   4.98 270.33

What is the expected slot to slot bandwidth for 2 Gen 3 16x lanes?

I’ve learned about the nvidia-smi topo -m command, which outputs (edited slightly):

	GPU0	GPU1	
GPU0	 X 	PHB	
GPU1	PHB	 X 	
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)

Which tells me that my GPUs can’t talk to each other directly, they need to pass data through the CPU. I found this Serve The Home article which included this chart:

Which shows that much higher GPU to GPU bandwidth is possible.

On a hunch, I reconfigured my CPU (Threadripper 2950x) to run in NUMA mode, and re-ran BusGrind and got this output:

P2P Bandwidth Matrix (GB/s) - Unidirectional, P2P=Enabled
   D\D      0      1
      0 254.46   5.55
      1   5.54 263.66
P2P Bandwidth Matrix (GB/s) - Bidirectional, P2P=Enabled
   D\D      0      1
      0 270.00   7.50
      1   6.82 269.96

Which is good for a 67% increase in performance. I suspect that if I can overclock my RAM a bit, I should get even more bandwidth.