Cooling 8x Titan RTX in a server chassis

hi,
We have a Supermicro 4028GR-TR Server which we would like to fill with 8 Nvidia Titan RTX GPUs. It has to be Titan RTX because they have a great price/performance ratio in our application.
The problem is: with the cooler on the Titan RTX there is no way to get rid of all the heat in a tightly packed server chassis. There is just no space for the cooler to take in air in from the side (~5mm space to the next card) or exhaust on top (~5mm up to the closed off top cover of the server). The totally closed pci slot cover don’t help either.
Preliminary testing showed at least 25% performance drop off after 30s because of overheating. Does anyone have any ideas how to get around that problem?
thanks

Custom loop watercooling?

Other than halving your GPU count to leave a 2 slot gap between cards I cannot see a solution that does not involve (semi-)custom hardware.
Server cards have passive coolers, no fans just one big fin array along the length of the card. Your server seems to have plenty of fans, so if you could replace the stock coolers with a passive cooler, you might be able to get better performance.
But you will have trouble sourcing appropriate coolers, unless you can find a bunch of dead server cards with coolers that you can jerry-rig onto your Titans.

The Quadro RTX card coolers may be able to fit on them, but I don’t think they have been released yet. Also, I doubt you could buy just the cooler for it.

That or get a second server and distribute the GPU’s across them if you can

1 Like

I ran into this once. After a LOT of teeth gnashing our fix was to retool the top cover of the case for forced air exhaust.

We used weather stripping around the top of the case so the fans in the chassis forced air out the top of the case.

You can use cardboard and Saran wrap to prototype the solution. The 5mm gap isn’t a big deal for intake if the rpm on the midboard fans are high enough.

Once we were done we went to a local machinist and they used their bending brake and a couple jigs to make us a fit-alike cover that would exhaust the hot air out the top

2 Likes

watercooling isn’t an option for 2 reasons:

  • it would require complete retooling of the top cover to make space for inlet/oulet ports
  • too risky for a 24/7 running machine in a company…

i am pretty sure that 2080ti coolers woult fit the titans, and there are blower style coolers for the 2080 which usually work fine in this server as they allow airflow from front to back. the problem is sourcing them. buying those cards just for their coolers is crazy expensive.
and yeah, i tried sourcing those coolers without cards either used or new but without success.

@wendell: thanks for the suggestion, i will try that. maybe it’s enough to cut exhaust slots i the to cover and close up the unused pcie slots in the back

Pretend the air is a liquid flow. Design it so the positive pressure is forced between the cards and out of the case. If the air can recirculate it will and will just heat up. So run some different tests.

1 Like

Having worked with that exact SuperMicro system in the past; there is no way you will be able to cool the GPU’s to not thermal throttle and run at 90C constantly, unless it’s in a very isolated server room, in which you can run those SUPER LOUD AND ANNOYING FANS at a constant 100%, and even then, I’m doubtful that would prevent the cards from thermal throttling.

We were running 8x 1080 TI’s, whichh also have a lower TDP then that Titan RTX, and the cards were still always overheating.

Another thing to note, we were running Redshift 3D on that system often, with all 8 GPU’s, and no matter what we changed in software or hardware (excluding switching the host O.S to Linux from Windows, since we had a lot of Windows-only applications), the system would crash. Sometimes, the crashes would occur multiple times per day, and other times, it would happen once a week. Myself and my senior never figured out what the exact issue was, but we just narrowed it down to unstable Nvidia drivers (the BSOD was often a driver related BSOD, or other times, the screen would completely freeze but still accept commands, or not accept commands, or a whole bunch of other weird stuff).

Before you fully commit to that system, I really recommend you make sure you can get it stable with your use-case (or make sure it doesn’t matter if it crashes multiple times a day); we never did find the exact root cause to fix it, so we eventually just gave up on troubleshooting it, and downgraded the system to 3x 1080ti’s (the system hasn’t crashed in months).

Edit: The 1080ti’s that we were running all had a blower-style cooler as well, not the open-style cooler that the Titan RTX has, so I’d imagine it would be a lot harder to cool your cards, becausse they will constantly be sucking in the heat of other cards… Wendells suggestion may help with cooling, along with removng the card backplates (if possible on the Titan RTX), and creating make-shift spaces, but I’m doubtful.

1 Like

yes, I should add I was figuring he was running the fans at 80%+… this chassis does not really work with fans < 70% for any high end build anyway…

2 Likes

hm… we were running that system for over a year now with 3 Titan Xs + 1 1080Ti and had no issues. i hope this doesn’t change with upgrading…
the application is a in-house developed simulation and also runs on windows only sigh
the cards run hot but they don’t throttle noticeably. I don’t even know how fast the fans are running, but luckily the server room is on another floor and the fans shouldn’t bother anyone.
thanks and i’ll get back to you when i’va had time to do the testing

I am considering exactly this but with 2080Tis. Unfortunately the company selling me this did not test and cannot tell me if it will work. Do I look into only “blower” type cards or open type cooler might work as on this one (which is also the lowest-costing 2080ti) https://www.evga.com/products/product.aspx?pn=11G-P4-2281-KR

I was able to play with 4 titan rtx’s, and did some overclocked overnight renderings. Nvidia drivers are unstable if offset coreclock over +100, blender cycles seems to like +100 at that exact number. After overnight rendering my cards were barely breaking 70 degrees Celsius, and I believe it was due to the placement of my cards: a two level column arrangement.

I used Dreambox and custom built a rig to provide ample space. The dreambox provides stability and customization that server rigs do not offer. Titan RTX’s has a blower-style as well…

1 Like

Any update on this? I was thinking of building something similar this 8x TITAN RTX Server.