Recently a friend of mine bought hardware for a VFIO setup and I’ll be helping him set it up, feels like I might as well write about it here.
So, first things first, everyone has different hardware, so I absolutely cannot guarentee my setup will work perfectly on another machine, or it even has the configuration options at all, but things should be overall very similar. I’m also assuming it’s a Linux kernel running on the host, and use libvirtd with qemu-kvm for virtualization. Also, very important, do not just paste example commands from this blog post directly into your terminal, at least look at it and see if it does the right thing for your particular setup first.
Isolating the devices
Anyways, first thing you wanna do is to make the firmware and the Linux kernel to not initialize the graphics card you want to pass through, that’s not required, and not possible on single card passthrough scenarios, but it does make life a lot easier by not tainting the graphics card firmware thing thus not requiring sideloading of firmware into the graphics card. So, go into your firmware’s configuration interface (usually a key to press when you boot, but if you use systemd you can also execute
systemctl reboot --firmware-setup as root to enter it) and look for a section to configure PCI Express devices, if that exists, find the slot your passthrough graphics card is on, and select
Disable. This will prevent the firmware from initializing that graphics card on boot. Note that if this doesn’t exist (which apparently is the case for a lot of consumer boards) it doesn’t matter that much, the thing will still work. After that’s done, the monitor attached to the graphics card should no longer turn on on boot, but will still get initialized by the driver in the Linux kernel, which brings us to the next step.
Next you’ll want to prevent the Linux kernel from initializing it. Now, I don’t know how other initrd implementations work, but the idea is to make the modules
vfio_pci, vfio, vfio_iommu_type1, vfio_virqfd load before the graphics drivers so the graphics drivers won’t end up claiming them first, you’d also want to make the module know what to isolate, which can be done by putting the modprobe config thingy into the initrd or just have the file there on systems with no initrd, but I personally pass them in kernel cmdline, which isn’t great but it works. If you’re using the mkinitcpio script to generate your initial ramdisk image it’s more than likely that all you need to do is to add
vfio_pci vfio vfio_iommu_type1 vfio_virqfd in the modules array (make sure they are before the drivers of the graphics card if present) and it should do the trick, if not you’ll need to find a way to get them load before the graphics card drivers. Then you need to figure out your card’s device ID, for that just pipe the output of
lspci -nnk into
less and search for your graphics card, you’d want to isolate both the audio device and the graphics card itself. Then copy the part that looks like USB device IDs, and note them somewhere. Next will be dependent on your setup, but basically add
vfio-pci.ids=<device IDs> to your kernel cmdline, separating each device ID with a comma, then reboot and the graphics card should be binded to the vfio-pci module. Oh, don’t forget to remove any reference to that graphics card from your Xorg config if you have any.
Configuring the virtual machine
Basic stuff and permissions
So that’s done, now to move on to the actual virtual machine, for that just do any setup you want as long as it uses a UEFI firmware, but make sure to include at least a mouse and keyboard (no, tablets don’t count) and do not include a virtual graphics card, audio card or spice/VNC in there (unless you gonna use it for input). And depending on your method of input, you might want to relay your keyboard and mouse events to the virtual machine. Libvirtd does a bunch of security stuff that basically prevents the emulator process from accessing anything it does not absolutely require to run, and it also runs as the
nobody user, so by default it can’t write to (or even see) the input devices, so you need to add those devices to the cgroup device acl thingy, then make qemu run under some suer that can write to those devices. So, create a system user with whatever name you like, then add it to the
input group (I know, I know, this compromises security and all, but I don’t run too sketchy stuff on my virtual machines anyways and I update frequently, if you know of better ways that doesn’t break anything feel free to tell me and I’ll add it here.), you can do that by running the following command as root:
useradd -c "Virtual Machine User" -d / -M -r -s /usr/sbin/nologin -u 172 virtual-machine
Great, now open your
/etc/libvirt/qemu.conf in some text editor (as root obviously, I personally recommend vim or a mechanism like
sudoedit, avoid running your graphical applications as root) and find the
group options, and change it to your virtual machine user’s username (which in my case is just
virtual-machine), then search for
cgroup_device_acl and add your input devices to it (it will most likely be commented, so uncomment all of them first, and for the devices, just list everything in
/dev/input/by-id/, I’m sure you’ll spot them), and this sounds obvious but, remember to use absolute paths, some people forget that, somehow.
OK, permission stuff done, now, libvirt doesn’t appear to natively support this, so you gotta pass some qemu args directly. The arguments should look something like
-object input-linux,id=<name>,evdev=<path to the device>. Of course, fill in the placeholders with your thing, and there’s no real requirement in the ID thingy but keep them consistent and different. Also, you need to append
grab_all=on,repeat=on to your keyboard as that helps you switch the controls around by pressing both control keys at the same time. And for people who don’t know how to do it, basically you need to add qemu namespace thingy to your domain tag, which should look like this:
<domain xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0" type="kvm">
Then you need to add some stuff containing the args at the end of domain, something like this:
<qemu:commandline> <qemu:arg value="-object"/> <qemu:arg value="input-linux,id=mouse,evdev=/dev/input/by-id/usb-Wacky_Vendor_Name-event-mouse"/> <qemu:arg value="-object"/> <qemu:arg value="input-linux,id=kbd,evdev=/dev/input/by-id/usb-Wacky_Vendor_Name-event-kbd,grab_all=on,repeat=on"/> </qemu:commandline>
Alright, that’s all done, now to the countermeasures, since Nvidia hates users, if you’re gonna passthrough an Nvidia card, you need to hide virtualization, or the drivers won’t load. So, first of all you’d want to hide the KVMKVMKVM thingy, so add:
<kvm> <hidden state="on"/> </kvm>
to your features, then add:
<vendor_id state="on" value="RandomChars"/>
to your hyperv inside features, this can be any 11 length string, RandomChars fit in there perfectly somehow but you can type anything. and finally, you need to add:
to your features if you are using the q35 chipset, not sure why, if anyone knows the reason behind this, please tell me.
And also, if you want to look all fancy and make Windows task manager not show
Virtual Machine: Yes or some other crap, add:
<feature policy="disable" name="hypervisor"/>
to your cpu, though this is not required, and at the moment it seems like Nvidia drivers doesn’t pick it up anyways.
Finally, add the graphics cards as hostdev, and boot the thing, install drivers, and you’re probably good to go.
Other things worthy of note
Well there are probably still other important things I did not cover, this is my first blog post and also I did this so long ago so sorry if I missed anything, will probably write another post regarding this once I’m better at this and get to know my way around a bit more, but until then, this is gonna have to do.
Here are some things you’ll probably want after this setup:
CPU pinning and memory hugepages
Will probably do a standalone post on these two in the future.
Unable to fully isolate the graphics card
In that case there’s no big deal, you just need to sideload the clean firmware of your graphics card on every boot, unfortunately the method of obtaining this is very card specific and it’s not practical for me to cover it here, but an alternative way is to just find your graphics card on this site and download the firmware. And as of how to sideload it, just add:
<rom bar='on' file='path to firmware'/>
to your hostdev.
For mirroring the virtual machine framebuffer to your host via ASHMEM, very low latency, nothing much to write here, the official documentation is pretty clear, so just go take a look.