I have 9 GPUs on Ubuntu 19, on z170 with i7-6700k (each GPU using x1 PCIe lanes), one enterprise sata SSD and 32GB of RAM. All the GPU’s work very well together, using only 20% of the CPU. The problems start when I fire up my QEMU VM’s (26 of them) on the very same system. When VM’s are running I can’t initialize some of the GPU’s. VM’s are set to use only 5 threads. There is ~12GB free RAM memory, and 50GB SWAP file on the machine (only ~2GB used). The same issue occurs when just using the CPU for random stuff (~ 90 % CPU Usage constantly), regardless of VM’s. What could be the issue? I’m clueless at this point.
Finally got around this. What I found out is, that when VM’s are running, linux for some reason eats up all of the free RAM as buff/cache, so I’m left with only 200 MB of usable RAM memory (Even Windows with Hyper-v, never did something so stupid). Essentially GPU’s can’t get the RAM they need for operating, thus few of them fail to initialize, printing “out of memory” error. I don’t know why the kernel doesn’t release RAM when GPUs need it, but the solution is to clear RAM cached before starting GPU work.
GPUs (mix of 1070s 1070Tis and 1080s) are involved in deep learning, while QEMU VM’s (all Lubuntu) are doing fully automated stuff, driven by bash scripts I wrote. VM’s are fine with 500 MB RAM limit, and sharing 5 vcpus from the host. All this on SM863a, because of ~1 TB write to SSD every day. Had standard SSD’s/NVMEs die in just 2 months upon reaching their max TBW rating.
If anyone stumbles upon the very same issue here’s a detailed solution:
Add this line to crontab (dropping cache every 15 min):