KVM,GPU passthrough,Development!

SirTeddy · November 15, 2017, 8:35am

Hey! Well done! Finally Ryzen can be used to run kvm.
I am a linux engineer.I am very intrested about kvm on gaming PC.

I have done some work on it. And also I have some problem on it.I will be glad to share it,if someone need it

many motherboards only provide 1 usb controller so that it cannot be used by 2 vms(or 2 users) at the same time.
So I make a small change.
I wrote a small program to distribute different USB devices EASILY.

Then I made some other changes to get the whole system more suitable in this situation(KVM on Gaming PCs).
it can sleep when vms are all shutdown.
run a long time without reboot
use app to control vms start&shutdown
etc.

It is a tiny Linux,maybe 255MB.
It have libvirt,qemu and others.
It can boot via PXE or HDD.

So the PROBLEM is
intel CPUs like 6700k/7700k,4 cores and 8 threads,
have some different between 2 vms use 4 threads for each other AND 1 vms use 4 threads alone.

VM 1 's CPUID list 0,1,4,5
VM 2 's CPUID list . 2,3,6,7

then the benchmark is very close . and each VM only have 2 whole cores ( hyperthreading) rather than four 1/2 cores

1 VM alone can get more performance on Gaming.the fps is higher than 2 VMs and closer to the bare metal.

Why it cannot be balanced??? both Ryzen and Intel CPUs have this problem.Why?

TheIdiotYouYellat · November 15, 2017, 1:44pm

This may be a stupid question, but have you configured cpu pinning?

Also it might be helpful to see the config you are using.

SirTeddy · November 16, 2017, 1:48am

i7 6700k/7700k has 8 threads:

0,1,2,3,4,5,6,7

0,4 is the first core

1,5 is the second core

2,6 is the third core

3,7 is the last core.

there is 2 VMs:

VM1 cpu pinning: 0,4,1,5

VM2 cpu pinning: 2,6,3,7

VM1 has a gtx1060

VM2 also has a gtx1060,too

SirTeddy · November 16, 2017, 1:51am

Then i invite my friends to test it(maybe 2 month ago,cpu is i7 6700):
League of Legends:
only VM1 ,VM2 is shutdown: 130~140fps
only VM2 ,VM1 is shutdown: 130~140fps

VM1 & VM2 : both 90~100fps

TheIdiotYouYellat · November 16, 2017, 3:49am

Odd. I have an i7-4790S 4 core 4 thread split between to vm’s with no noticeable slow down.

Mine are pinned
First VM - 2,3,6,7
Second VM - 0,1,4,5
Giving each VM 2 cores and 2 threads

SirTeddy · November 21, 2017, 2:28am

Thank you for reply.
Mine are pinned,too. Static.
do you have tested it seriously?
if you do not notice the FPS in some games,you may not notice the difference.

Maybe you can test League of Legends,
you control VM1
your friend control VM2
you and your friend both play it at the same time compare to you play it alone,the another VM is shut down.
ctrl+F to enable FPS display,observe it changes.

TheIdiotYouYellat · November 21, 2017, 12:14pm

I’m sorry I don’t play League of Ledgends, but my wife and I play Warframe most night on the machine with me steam streaming to a laptop from my VM. With vsync on we both average 60 fps turning it off does odd things with streaming but give 130fps average on both. I have 2 1050 Ti’s for reference.

SirTeddy · November 22, 2017, 2:58am

did you try to play Warframe alone,and the another VM was shutdown.
And How about the FPS.

My girlfriend and I play games most night,too ,but we dont use streaming.

TheIdiotYouYellat · November 22, 2017, 6:23am

With one VM off, I usually am getting the locked 60 with vsync on. I haven’t had a chance to try without vsync and won’t until after Thanksgiving. I’ll report back next week.

I’ll also go through the Heaven Benchmark or the Epic Citadel so we have something that can be compared.

SirTeddy · November 22, 2017, 6:57am

Thank you for that.

SgtAwesomesauce · November 22, 2017, 7:45am

Are you using hugepages? If not, try enabling them and using that memory backend. Huge pages can significantly improve performance.

SirTeddy · November 22, 2017, 8:19am

hugepages was tested(both 2MB & 1GB)
got same result.
compared to transparent hugepage.

SirTeddy · November 22, 2017, 8:24am

Or my way to use hugepages was not correct,please provide me the turtorial

BUT,I researched hugepages a lot,and tried 3 or 4 ways to enable it ,I think I did it very correct.

SgtAwesomesauce · November 22, 2017, 6:06pm

I don’t currently have a hugepages tutorial available. I’m going to get to that eventually.

Regardless of your distro, the archwiki is very helpful for most topics.

https://wiki.archlinux.org/index.php/KVM#Enabling_huge_pages

SirTeddy · November 23, 2017, 2:30am

Emmm…
Looks like I have done current work on hugepages.

Then my test result was that hugepages didn’t have performance boost, compared to transparent hugepages

I tested it ,right now.
Even worth than transparent hugepages,56(fraps,avg) vs above 60.and the bare metal is above 70 (the game was PUBG)

gnif · November 23, 2017, 11:32am

@SirTeddy, why are you asking this again… you have already explained your situation and I spent conciderable time explaining to you why you are suffering performance issues.

Your CPU has a memory controller built in, it controls… you guessed it, the RAM. You have ONE of these, not 8. The CPUs all share the controller, and have to wait on it when there are memory operations in progress and it is busy.

Again, as stated in the other thread:

VM 1 issues a memory read which is not in the L1/2/3 cache.
VM 2 issues a memory read also that is not in the L1/2/3 cache.
The memory controller goes to work servicing VM 1, VM 2 has to wait.
VM 1 get’s it request, but then asks to read more.
VM 2’s request get’s serviced, now VM 1 is waiting.

They are sharing the memory controller and bus to the RAM, the on die cache (unified cache (L3), I am sure gets thrashed under these circumstances). These operations can come in any order at any time and as such are unpredictable.

TL;DR; You have two VMs, sharing RAM, running RAM intensive operations. They are fighting for bandwidth to the RAM… it is that simple. You will NOT no matter what you do get a perfect 50/50 split on the system.

Also it seems you have your pinning wrong for Ryzen (Edit: Just realized you’re talking about Intel in this thread, but will leave this information here for others).

Ryzen’s hyper threaded cores are not interleaved like Intel, they are as follows:

0,1 = Core 0
2,3 = Core 1
4,5 = Core 2
6,7 = Core 3
8,9 = Core 4
10,11 = Core 5
12,13 = Core 6
14,15 = Core 7

So you want

VM1 to use 0,1,2,3
VM2 to use 4,5,6,7

You can verify this by running:

for F in /sys/devices/system/cpu/cpu[0-9]*; do echo -n "$F = "; cat $F/topology/thread_siblings_list; done

On my system (Ryzen 1700X) this outputs:

/sys/devices/system/cpu/cpu0 = 0-1
/sys/devices/system/cpu/cpu1 = 0-1
/sys/devices/system/cpu/cpu10 = 10-11
/sys/devices/system/cpu/cpu11 = 10-11
/sys/devices/system/cpu/cpu12 = 12-13
/sys/devices/system/cpu/cpu13 = 12-13
/sys/devices/system/cpu/cpu14 = 14-15
/sys/devices/system/cpu/cpu15 = 14-15
/sys/devices/system/cpu/cpu2 = 2-3
/sys/devices/system/cpu/cpu3 = 2-3
/sys/devices/system/cpu/cpu4 = 4-5
/sys/devices/system/cpu/cpu5 = 4-5
/sys/devices/system/cpu/cpu6 = 6-7
/sys/devices/system/cpu/cpu7 = 6-7
/sys/devices/system/cpu/cpu8 = 8-9
/sys/devices/system/cpu/cpu9 = 8-9

SirTeddy · November 24, 2017, 9:24am

I didnt ask again, this post was earlier than that.
@SgtAwesomesauce he suggested me Hugepages, so I tested it yesterday and find out it cannot have performance boost in this situation.

But,i3 7100 vs i7 7700:
L1: 2x32kb 4x32kb
L2: 2x256kb 4x256kb
L3: 3MB 8MB

so I dont think L1/2/3 cache caused that issue.
But, I think RAM bandwith caused that issue,maybe,and I am trying to prove it.
Hey, do not be angry man,take it easy.
I am very interesting about your new project,now.

gnif · November 24, 2017, 9:44am

I am sorry, I didn’t see the timeline.

I did not say the cache causes the issue, I said that the cache likely gets thrashed. The entire point of cache is to keep the information close to the CPU, it is much much faster than RAM, but is very small and expensive to produce.

On a normal desktop computer, the CPU is pretty linear in what it is doing, it will task switch, etc… but the kernel does what it can to try to ensure that the task switching is minimal and threads dont jump between cores.

If a program wants to access some RAM that is not already in the L1 cache, it will then look for it in the L2 cache… if it isn’t there it looks in L3, and if not found still, it will then go to the much slower RAM.

This round trip is slow, but it doesn’t get slower to get more data then was actually requested, so since it’s doing a RAM retrieval anyway instead of fetching say 4 bytes, it might fetch 64 bytes, and store them into the L3 cache (note, the method and amount is CPU architecture dependent).

If the L3 cache is full, it will chose something and evict it (forget it), if the cached memory has been changed and not yet stored in the slower RAM, it will go and do this, which is slow also.

Now the kernel tries very hard to ensure this happens as infrequently as possible by tracking everything it can to avoid the performance hit this incurs, BUT, you’re in a VM, which is unaware there are more VMs, and unaware that there is a host OS also. Suddenly you have three kernels trying to manage what is in the CPU cache, each with their own ideas about what is good and bad.

Thus… the cache gets trashed.

You have… when there is no memory contention you get great FPS, when there is you get a marginal performance loss… I say marginal because it literally is. If you can’t see that, you are asking way too much of KVM.

It is simply impossible to get a 50/50 split in performance without every single resource being duplicated, the only way I could see you improving this would be with the following (expensive) setup:

Two CPUs (not more cores)
Two sets of RAM, for each CPU
Two GPUs
Two HDD controllers
Pin each VM to a physical CPU
Configure KVM for NUMA

You might as well just build two computers.

I am not angry, just confused as to why you are having trouble grasping that your one PC is not two PCs and as such will never perform as well as two PCs.

SirTeddy · November 27, 2017, 3:06am

I had a pretty reason:
I had a PC before I met my girlfriend.
i7 6700k +gtx1080+ b150mplus+ 16G DRAM.(The price of DRAM was so cheap at that time.)
But I didnt really reach full usage,when I used it alone.
So I just added 1 gtx1060.
Then we get 2 PCs.
Then we can play together.
It almost meets demand now.

Maybe I need to buy a new gaming PC to get 3 PCs after I have a family,in the future.

TheIdiotYouYellat · November 29, 2017, 1:42pm

Sorry for the wait but I finally got a chance to run the benchmarks.

Cinebench R15 Scores:
1st VM only - 395
2nd VM only - 400
Both at the same time - 370, 375

Unigen Heaven 4 Scores
DX11, High quallity, x2 anti-aliasing, 1920x1080:
1st VM only - 1653 average 65.7FPS
2nd VM only - 1653 average 65.7FPS
Both VM’s at the same time - 1653 average 65.3FPS

For reference here is the specs of my PC:
Asus Z97-A
i7-4790S
2 x 8GB G.Skill DDR3 2400
2 x EVGA 1050TI SSC