9 GPUs on Ubuntu

IDJVideos · December 22, 2019, 10:14am

Hi,

I have 9 GPUs on Ubuntu 19, on z170 with i7-6700k (each GPU using x1 PCIe lanes), one enterprise sata SSD and 32GB of RAM. All the GPU’s work very well together, using only 20% of the CPU. The problems start when I fire up my QEMU VM’s (26 of them) on the very same system. When VM’s are running I can’t initialize some of the GPU’s. VM’s are set to use only 5 threads. There is ~12GB free RAM memory, and 50GB SWAP file on the machine (only ~2GB used). The same issue occurs when just using the CPU for random stuff (~ 90 % CPU Usage constantly), regardless of VM’s. What could be the issue? I’m clueless at this point.

Thanks.

IDJVideos · December 22, 2019, 10:01pm

Finally got around this. What I found out is, that when VM’s are running, linux for some reason eats up all of the free RAM as buff/cache, so I’m left with only 200 MB of usable RAM memory (Even Windows with Hyper-v, never did something so stupid). Essentially GPU’s can’t get the RAM they need for operating, thus few of them fail to initialize, printing “out of memory” error. I don’t know why the kernel doesn’t release RAM when GPUs need it, but the solution is to clear RAM cached before starting GPU work.

Tomislav · December 22, 2019, 10:38pm

I can’t help but be curious as to the task this machine has.

AnotherDev · December 23, 2019, 2:01am

Glad you figured it out.

I’m also curious as to what OP is doing with this. Sounds pretty badass

IDJVideos · December 23, 2019, 9:29am

GPUs (mix of 1070s 1070Tis and 1080s) are involved in deep learning, while QEMU VM’s (all Lubuntu) are doing fully automated stuff, driven by bash scripts I wrote. VM’s are fine with 500 MB RAM limit, and sharing 5 vcpus from the host. All this on SM863a, because of ~1 TB write to SSD every day. Had standard SSD’s/NVMEs die in just 2 months upon reaching their max TBW rating.

If anyone stumbles upon the very same issue here’s a detailed solution:

Add this line to crontab (dropping cache every 15 min):

*/15 * * * * root sync && echo 3 > /proc/sys/vm/drop_caches

or execute this when needed:

sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'

There’s only 3GB cached RAM all the time now.

Token · December 23, 2019, 8:20pm

Yeah really curious about this use-case, OP hopefully you will make some posts about this here?

system · September 22, 2020, 2:20pm

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.