VMs won't start, Connection reset by peer

wwed26 · June 23, 2024, 3:21pm

HI All,

I’ve been using QEMU/KVM successfully for months, without issue. Zero, none.
Successfully pass-through of an NVME during this time, without issue. Zero, none.

Implying here that the usual prerequisites for virtualization have been met.

I’ve recently added a W10 VM (pass-through NVME) successful boot, no problems. Once I was able to determine it was functional I attempted to add my GTX 1080. This is when the problems began.

Virt-manager freezes and have 2-3mins spits out this error. None of my VMs are able to function. .

Error starting domain: Cannot recv data: Connection reset by peer

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 72, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 108, in tmpcb
    callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/object/libvirtobject.py", line 57, in newfn
    ret = fn(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/share/virt-manager/virtManager/object/domain.py", line 1402, in startup
    self._backend.create()
  File "/usr/lib/python3/dist-packages/libvirt.py", line 1379, in create
    raise libvirtError('virDomainCreate() failed')
libvirt.libvirtError: Cannot recv data: Connection reset by peer

After this occurs, I can no longer connect to QEMU/KVM, even after restarting services. I need to perform a full reboot in order to review what could be going wrong.

Each has their own IOMMU grouping so I think that is free and clear.

00:00.0 Host bridge: Intel Corporation Device 4c53 (rev 01)
00:01.0 PCI bridge: Intel Corporation Device 4c01 (rev 01)
00:01.1 PCI bridge: Intel Corporation Device 4c05 (rev 01)
00:06.0 PCI bridge: Intel Corporation Device 4c09 (rev 01)
00:14.0 USB controller: Intel Corporation Tiger Lake-H USB 3.2 Gen 2x1 xHCI Host Controller (rev 11)
00:14.2 RAM memory: Intel Corporation Tiger Lake-H Shared SRAM (rev 11)
00:16.0 Communication controller: Intel Corporation Tiger Lake-H Management Engine Interface (rev 11)
00:17.0 RAID bus controller: Intel Corporation Device 43d6 (rev 11)
00:1b.0 PCI bridge: Intel Corporation Tiger Lake-H PCIe Root Port #17 (rev 11)
00:1b.4 PCI bridge: Intel Corporation Device 43c4 (rev 11)
00:1c.0 PCI bridge: Intel Corporation Tiger Lake-H PCIe Root Port #1 (rev 11)
00:1c.4 PCI bridge: Intel Corporation Tiger Lake-H PCI Express Root Port #5 (rev 11)
00:1c.6 PCI bridge: Intel Corporation Device 43be (rev 11)
00:1d.0 PCI bridge: Intel Corporation Tiger Lake-H PCI Express Root Port #9 (rev 11)
00:1d.4 PCI bridge: Intel Corporation Device 43b4 (rev 11)
00:1f.0 ISA bridge: Intel Corporation Z590 LPC/eSPI Controller (rev 11)
00:1f.3 Audio device: Intel Corporation Tiger Lake-H HD Audio Controller (rev 11)
00:1f.4 SMBus: Intel Corporation Tiger Lake-H SMBus Controller (rev 11)
00:1f.5 Serial bus controller: Intel Corporation Tiger Lake-H SPI Controller (rev 11)
01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3080 Ti] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
02:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)
03:00.0 Non-Volatile memory controller: Solidigm P41 Plus NVMe SSD (DRAM-less) [Echo Harbor] (rev 03)
04:00.0 Non-Volatile memory controller: Intel Corporation SSD 665p Series [Neptune Harbor Refresh] (rev 03)
05:00.0 Non-Volatile memory controller: Solidigm P41 Plus NVMe SSD (DRAM-less) [Echo Harbor] (rev 03)
07:00.0 Ethernet controller: Aquantia Corp. AQtion AQC107 NBase-T/IEEE 802.3an Ethernet Controller [Atlantic 10G] (rev 02)
08:00.0 Network controller: Intel Corporation Wi-Fi 6E(802.11ax) AX210/AX1675* 2x2 [Typhoon Peak] (rev 1a)
09:00.0 Non-Volatile memory controller: Solidigm P41 Plus NVMe SSD (DRAM-less) [Echo Harbor] (rev 03)

Can someone please offer their advice? Appreciate it in advance.

WWED

wwed26 · June 23, 2024, 3:49pm

To comment further, when opt for not passing through the 1080, the VM seems to boot up fine, so it seems to be a configuration error on my side.

Shadowbane · June 23, 2024, 7:10pm

Hi, @wwed26. The error you’re encountering, “Cannot recv data: Connection reset by peer,” typically indicates a communication issue between virt-manager and libvirtd, the daemon managing the virtual machines. This can happen for several reasons, especially when dealing with GPU passthrough. Here are some steps to troubleshoot and resolve this issue:

Restart libvirtd Service
First, try restarting the libvirtd service to ensure it’s running correctly: sudo systemctl restart libvirtd is the command to restart the Libvirtd Service.
Check libvirtd Logs
Examine the libvirtd logs for more detailed error messages: `sudo journalctlu libvirtd—this command will help you examine Libvirtd logs.
Ensure Proper Permissions
Ensure that the virt-manager user has the necessary permissions to interact with libvirtd. Add your user to the libvirt and KVM groups: Preforsudo usermod -aG libvirt $(whoami) sudo usermod -aG KVM $(whoami) This command ensures you have proper permissions.
Check for Conflicting Modules
Ensure that no conflicting modules are loaded that might interfere with VFIO. Check the loaded modules with Preformalsmod | grep—e vfio—e nvidia—e nouveau. Run the previous command to check for conflicting modules.
If Nvidia or Nouveau are loaded, ensure they are properly blacklisted.
Verify GPU Binding to VFIO
Ensure your GPU is correctly bound to the vfio-pci driver. Verify this by checking the output of: lspci -nnk | grep -A 3 'VGA compatible controller' running this command verifies GPU binding to VFIO. You should see vfio-pci listed as the kernel driver that is used for your GPU.
Update Kernel and Packages
Ensure your system is current with the latest kernel and virtualization packages. Sometimes, updates can resolve underlying compatibility issues: sudo apt update sudo apt upgrade sudo apt dist-upgrade running these commands usually will update most Linux kernels and packages.

if the above suggestions don’t fix your issue, there are other ideas I could suggest. I did not want to overwhelm you.

wwed26 · June 23, 2024, 9:38pm

Much thanks @Shadowbane for offering your assistance.

Item 1: Done

Item 2: Error

Jun 23 17:28:14 wwed-Z590-AORUS-MASTER libvirtd[8599]: libvirt version: 10.0.0, package: 10.0.0-2ubuntu8.2 (Ubuntu)
Jun 23 17:28:14 wwed-Z590-AORUS-MASTER libvirtd[8599]: hostname: wwed-Z590-AORUS-MASTER
Jun 23 17:28:14 wwed-Z590-AORUS-MASTER libvirtd[8599]: Client hit max requests limit 5. This may result in keep-alive timeouts. Consider tuning the max_client_requests server parameter
Jun 23 17:28:56 wwed-Z590-AORUS-MASTER libvirtd[8599]: internal error: connection closed due to keepalive timeout
lines 979-1034/1034 (END)

Done
???

Result (however I am sure I am doing this step incorrectly :

wwed@wwed-Z590-AORUS-MASTER:~$ lsmod | grep vfio—e nvidia—e nouveau
grep: nvidia—e: No such file or directory
grep: nouveau: No such file or directory

Step 5:

result:

wwed@wwed-Z590-AORUS-MASTER:~$ lspci -nnk | grep -A 3 'VGA compatible controller'
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3080 Ti] [10de:2208] (rev a1)
	Subsystem: Gigabyte Technology Co., Ltd GA102 [GeForce RTX 3080 Ti] [1458:4087]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
--
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP108 [GeForce GT 1030] [10de:1d01] (rev a1)
	Subsystem: Gigabyte Technology Co., Ltd GP108 [GeForce GT 1030] [1458:3767]
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
06:00.1 Audio device [0403]: NVIDIA Corporation GP108 High Definition Audio Controller [10de:0fb8] (rev a1)

Note 1: To the aforementioned blacklist, does it pose greater challenges that I am running dual NVIDIA cards?

Note 2: I ref’ed a 1080 in prior post. Just for maneuvering around my case I threw in a single slot 1030. Results are the same.

Step 6: Always.

Note 3: After running into the “Error starting domain: Cannot recv data: Connection reset by peer” error. When trying to start/restart libvirtd. The command just hangs in the terminal.

Also yielding mysterious behavior and intermittently locks up my computer.

Shadowbane · June 24, 2024, 4:21am

I read your reply, it is late so I do not have time right now to really look into this, i had a few questions, which Linux distro are you useing, have you installed the Nvidia drivers for your Linux distro, and what are the specs of your system?

To answer your question yes using two Nvidia graphic cards does make it more challenging especially if you are using two of the same card, for example two Nvidia 1080’s.

wwed26 · June 24, 2024, 2:52pm

Sorry mate, no malice or “hey give me attention” intent. I didn’t realize or failed to realized or didn’t understand if I replied directly to you or the thread.

Spec’s per below:

--------------------------- 
OS: Ubuntu MATE 24.04 LTS x86_64 
Host: Z590 AORUS MASTER -CF 
Kernel: 6.8.0-35-generic 
Uptime: 13 hours, 4 mins 
Packages: 3320 (dpkg), 24 (flatpak), 
Shell: bash 5.2.21 
Resolution: 3840x2160, 3840x2160 
DE: MATE 1.26.1 
WM: Metacity (Marco) 
Theme: Yaru-purple-dark [GTK2/3] 
Icons: Yaru-purple-dark [GTK2/3] 
Terminal: mate-terminal 
 Terminal Font: Monospace 15 
CPU: 11th Gen Intel i5-11600K (12)
GPU: NVIDIA GeForce GT 1030 
GPU: NVIDIA GeForce RTX 3080 Ti 
Memory: 9530MiB / 64166MiB

wwed@wwed-Z590-AORUS-MASTER:~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.183.01  Sun May 12 19:39:15 UTC 2024
GCC version:

Apologies again and I appreciate your time,

WWED

EDIT: signature

Shadowbane · June 24, 2024, 5:18pm

No problem; you weren’t applying directly to me. When I wrote my last post and read your previous post, I was on my way to bed. Sorry for the misunderstanding.

I have another question. I see you have four different graphic cards available. Three of the Wwed26 graphic cards are Nvidia cards, and the last one is a graphic card on the CPU, which is manufactured by Intel. Which graphic card is the host going to use, and which graphic card are you trying to pass to the guest?

wwed26 · June 24, 2024, 5:28pm

3 cards

iGPU - Using for Jellyfin transcode
Host: Nvidia 3080 TI
Guest - W10: 1030 or 1080, intend to use 1080. I just dropped in the 1030 for ease of case maneuverability (gets tight in the Corsair 5000D) we can exclude that if it makes a difference. For all intensive purposes, I can take out the 1030 and just consider it was never part of the convo.

EDIT: clarity; grammar

Shadowbane · June 24, 2024, 6:28pm

I think I see one of your issues: You may be using too many graphic cards at once. I advise finishing all your Jellyfin transcodes before you try setting up the passthrough. Then, I would pick the IGPU as your host GPU and one of your Nvidia cards as the guest GPU. The way you are now trying to set up GPU passthrough is making the task more complicated than it needs to be. Remember, I would only have one Nvidia GPU for the guest and the IGPU for the host. You will only be able to run one virtual machine at a time.

wwed26 · June 24, 2024, 8:26pm

Dang, thats a hope and dream crusher.

I use the 3080 TI for gaming in Mate, the igpu is there for transcodes only when necessary. I haven’t had to transcode when trying to launch these VMs.

Seems having a 3080TI for the VM would be a waste of resources and my iGPU cant support 2 4k displays at high refresh rates if I were to use it for the host.

Shadowbane · June 24, 2024, 9:39pm

Don’t give up hope just yet. I have an artificially intelligent hardware bot. I might be able to program the AI to solve your issues. The AI uses some tools that I have limited access to per day, so it will take a few days before I can find solutions to your issues. Did you use any guide to help set graphic card passthrough? If you did, could you please post a link to the guide? It would help me program the AI.

dawe · June 25, 2024, 1:17pm

Looking at your step 5 result; I see both cards have nvidia modules loaded.

This one’s a long read, but similar setup (2x nvidia GPUs)
See " Isolation of the guest GPU":

Shadowbane · June 25, 2024, 1:32pm

@dawe, thanks for your post. You just saved me an hour programming my hardware bot to find what I suspected was one of @wwed26’s issues. Unfortunately, the link you posted does not work. I had to Google search for the article’s title before the article would load up.

dawe · June 25, 2024, 1:47pm

No worries.
I’m looking at the sources section of that article now, and it links back here, because why wouldn’t it? LOL

@wwed26 should be able to isolate the guest card with vfio via a kernel arg, and may also need to add the appropriate iommu arg for their hardware. After that it probably just works ™️

wwed26 · June 26, 2024, 2:40pm

Sorry y’all, been afk. Preg wife has needed help for a few days.

I will take a look at the link, haven’t had a chance yet. Separately recalling something from my unraid days on binding a certain pci id.

Back to work for me!

wwed26 · July 2, 2024, 10:27pm

Something bad happened

wwed26 · July 2, 2024, 10:30pm

seems to have had the inverse effect. Trying to figure out how to get my 3080 TI off “manual driver” and start from step one. Can someone help me to get back to the NVIDIA driver

EDIT: Removing the 1080, the intended pass-through card, the 3080 TI picks up the driver. Having issues now with slower boot times and I no longer have a grub splash screen to allow me to pick between OSs

Shadowbane · July 2, 2024, 11:06pm

According to my hardware bot, you can reset your RTX 3080 Ti to using the NVIDIA driver and undo the VFIO bindings.

Here are the steps my bot suggested.
To get your RTX 3080 Ti back to using the NVIDIA driver and undo the VFIO bindings, you first need to unbind the GPU from VFIO. You can do this by executing the following commands: echo "0000:01:00.0" | sudo tee /sys/bus/pci/devices/0000:01:00.0/driver/unbind and echo "0000:01:00.1" | sudo tee /sys/bus/pci/devices/0000:01:00.1/driver/unbind. After unbinding the GPU, you need to bind it back to the NVIDIA driver using these commands: echo "0000:01:00.0" | sudo tee /sys/bus/pci/drivers/nvidia/bind and echo "0000:01:00.1" | sudo tee /sys/bus/pci/drivers/nvidia/bind.

Next, you need to remove the VFIO configuration to prevent the system from binding the GPU to VFIO on boot. Edit the GRUB configuration file by running sudo nano /etc/default/grub and remove or comment out the vfio-pci.idsand disable_vga options from the GRUB_CMDLINE_LINUX_DEFAULT line, leaving it as GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=on iommu=pt". After saving the changes, update GRUB with sudo update-grub.

If you created a VFIO configuration file, remove it by running sudo rm /etc/modprobe.d/vfio.conf. Update the initramfs to reflect these changes by running sudo update-initramfs -u, and then reboot the system with sudo reboot to apply the changes.

After the system restarts, verify that the GPU is bound to the NVIDIA driver by running lspci -nnk -d 10de:*. You should see the NVIDIA driver listed under the kernel driver in use. If the GPU is not using the NVIDIA driver, you might need to reinstall the NVIDIA drivers with sudo apt update followed by sudo apt install nvidia-driver-XXX, replacing XXX with the appropriate driver version for your system.

Additionally, ensure that any manual entries or scripts you added to bind the GPU to VFIO are removed, including any systemd service files. Following these steps should restore your RTX 3080 Ti to using the NVIDIA driver, allowing you to start fresh with your GPU passthrough setup if needed.

wwed26 · July 3, 2024, 12:03am

Messages cross paths, I’ve completed about 90% of what you shared, with the 1080 removed, all good.

Installing the 1080, rolled back to the manual driver for the 3080 TI.

The only item I’ve received resistance on is

echo "0000:01:00.0" | sudo tee /sys/bus/pci/devices/0000:01:00.0/driver/unbind and echo "0000:01:00.1" | sudo tee /sys/bus/pci/devices/0000:01:00.1/driver/unbind. After unbinding the GPU : echo "0000:01:00.0" | sudo tee /sys/bus/pci/drivers/nvidia/bind and echo "0000:01:00.1" | sudo tee /sys/bus/pci/drivers/nvidia/bind

The commands just hang and never seem to complete.

That element seems to be the show stopper

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3080 Ti] [10de:2208] (rev a1)
	Subsystem: Gigabyte Technology Co., Ltd GA102 [GeForce RTX 3080 Ti] [1458:4087]
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
--
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
	Subsystem: Gigabyte Technology Co., Ltd GP104 [GeForce GTX 1080] [1458:3702]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Note: this behavior only occurs with both GPUs installed now, it didnt when we started.

wwed26 · July 3, 2024, 1:04am

This is so weird, I updated grub.cfg for isolating the 1080 (intended guest gpu) instead of /etc/modprobe.d/vfio.conf and now the 3080 TI is using the nvidia driver, am I doing everything backwards!

GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt vfio-pci.ids=10de:1b80,10de:10f0"

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3080 Ti] [10de:2208] (rev a1)
	Subsystem: Gigabyte Technology Co., Ltd GA102 [GeForce RTX 3080 Ti] [1458:4087]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
--
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
	Subsystem: Gigabyte Technology Co., Ltd GP104 [GeForce GTX 1080] [1458:3702]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

P.S. please don’t tell me to go back to windows for being a noob