Has anyone liquid cooled an A100 yet?

I just received an A100 for a work project and I’m installing it in a workstation. I was hoping to not have to build/buy an active air-cooling adapter to run it as I already have water cooling.

I can’t find anyone making liquid cooling for the A100 besides the companies selling the entire workstations. I was starting to lose hope until I watched the YouTube on cooling the V100 and it mentioned the V100 was basically the same layout as the Titan.

I wasn’t sure if anyone had tried or compared the layout of the A100 to other products to see if their was something close enough to work?

Nvidia, I guess

Yeah, my bad.

I assume you mean this one

And not this one

Nvidia do be making some good looking waterblocks. Wonder if you could track down the spare parts.


Also the above one is for the SXM A100 module

1 Like

Correct, the PCIE version.

I don’t think their are any easily obtainable water blocks for them, at least to the public in single number volumes.

The die on it is unlike any of the mainline stuff as far as I know so there is no common block for it. If they do them at all it will be corporate surplus.

Edit: if this picture is correct I don’t think there are any water blocks for it unless there is special ones for data centre.

Pretty sure this is not the A100 but the DPU from mellanox

Why would a Mellanox card need the same die as the DGX A100 with 6 stacks of HMB for 40GB of memory? and NVLink connectors?

I was suspect too because it has Infiniband connectors and none of the renders of the EGX A100 have them but this is the picture that is everywhere of the bare PCB, there does not seem to be other shots readily available.

They look similar but not the same

Sorry, my fault

Nah no problems. I was not sure myself.

I had a couple messages back and forth with ekwb and it doesn’t look like I’ll be seeing a commercially available water block from them any time soon. I’ve tried a couple of things and finally have something working.

The first thing I tried was to seal off the case and turn all my fans into intake fans thinking that if the only place the air could escape through was the A100 it might be enough. Even purchased an anemometer to measure the CFM through the GPU. This works to some level except that the I need all my intake fans running at full speed to get the needed airflow through the GPU.

I began experimenting with a couple fan shroud designs with some cardboard and gaffers tape. Once I was happy with the layout and airflow I was able to transfer the measurements to some aluminum sheet scraps I had in my shop and welded it up.

It appears that it works best if I pull air through the shroud as it is hardly engineered for optimal airflow. One of the problems I had with running the fan as intake (intake from GPUs standpoint not the case) is that because the fan had enough power, and was basically blowing a flat plate, a good portion of the air was actually getting directed back out. It had me scratching my head as to why my intake fan was exhausting air.

But in any case here is my temporary solution until I can find or make a waterblock.


4 Likes

Pretty sure @wendell used external(?) blower fans for his V100, perhaps he could link the exact model on here.

1 Like

I tried attaching on of the PCI slot exhaust fans as well but it wasn’t moving enough air. This is the one I tried: evercool fox-2 image

Flow wise it would probably be better to have the fan pointing into the end of the shroud and have it reduce down to the size of the cards opening than having then air try to make a tight 90° turn. That way the air can keep going in a more or less straight line and potentially have less speed or pressure loss than being forced to make a turn and having to get going again in a different direction.

1 Like

That was another cardboard prototype but it required a lot more aluminum and I had a couple of scrap pieces that were not big enough for that design. It also included a lot more small parts and angles. I completely agree that if it was more efficient I could probably run a slower fan. Maybe rev b. when I find some more aluminum. But to be honest this is keeping the GPU pretty cool. I can run some benchmarks and get temp readings. It just sounds like a drone.

If anyone comes across a big block of copper or nickel I’ll gladly machine a water block prototype and test it.

3 Likes

I ran some benchmarks and turns out its working quite well. I ran the SSD V1.2 TensorFlow training example container just to stress test the system.

I ran on both my “custom” cooled A100 and the new RTX A6000. I’m assuming its running well as the A100 appears to hover around 77 degrees Celsius once warmed up and running for a couple hours whereas the RTX A6000 was at 84 degrees Celsius and only running its internal fan at ~60%. I don’t know the thermal profile of the fan but I would assume that if its only running at 60% then 84 degrees is fine. I know the A100 has a thermal throttle at 92 degrees so I think I can safely call it a success. In fact I might change the profile of the A100’s fan to run a little slower and lower db.

The container can be found on nvidia ngc website under nvidia:ssd_for_tensorflow

and run using:

bash examples/SSD320_FP16_1GPU_BENCHMARK.sh

Training ResNet 50 on the CoCo dataset times on a single GPU were:
A100 : real = 276m21.864s
RTX A6000 : real = 408m54.403s

4 Likes

Damn, A100 really on another level.

Definitely on another level. I want to run some more benchmarks if I can get access to some more GPUs. I’m sure there is another thread for benchmarking etc. So transition there for further developments.

Nothing immediate seen [still is a bit new of a device outright, to be looking at LC]. Maybe try wedging a more aggressive rpm fan on your existing shroud, or arranging for having fans directly facing the intake point [granted clearances applicable]