[VFIO] add basic implementation#5870
Draft
ShadowCurse wants to merge 22 commits into
Draft
Conversation
Codecov Report❌ Patch coverage is Please upload reports for the commit 50e789e to get more accurate results.
Additional details and impacted files@@ Coverage Diff @@
## main #5870 +/- ##
==========================================
- Coverage 82.84% 81.18% -1.66%
==========================================
Files 277 280 +3
Lines 29912 30814 +902
==========================================
+ Hits 24781 25017 +236
- Misses 5131 5797 +666
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ab237e9 to
8949587
Compare
Add the VfioConfig and VfioConfigs types for describing VFIO device configuration. Wire them into VmResources and VmmConfig so that VFIO devices can be specified before boot. Actual device setup will be added in later commits. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
Add PUT /vfio/{id} API endpoint for configuring VFIO passthrough
devices.
Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
The future commits VFIO code wil use ArrayVec for BAR mappings and MSI-X hole tracking, so make it required. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
VFIO devices can expose both 32-bit and 64-bit BARs. The existing Bars type only handled 64-bit BARs. Add set_bar_32() for 32-bit BARs and get_bar_addr() that works with both widths. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
VFIO passthrough needs to emulate the MSI-X table and PBA regions within device BARs. Add accessor methods to MsixCap for extracting table/PBA BIR, offset, size, and the enabled/masked status bits. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
VFIO BAR regions containing MSI-X table/PBA must be split into mmappable and emulated parts. KVM memory slots require host-page alignment, but MSI-X structures can sit at arbitrary offsets within a BAR. Add align_up_host_page, align_down_host_page, and offset_from_lower_host_page helpers to expand emulated regions to page boundaries. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
Add the rust-vmm vfio-bindings (0.6.2) and vfio-ioctls (0.6.0) crates that provide wrappers around VFIO kernel interfaces. These are needed by the VFIO code. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
Add the core VFIO passthrough implementation. This allows physical PCI devices bound to vfio-pci on the host to be presented to the guest with minimal overhead. The implementation covers: - PCI config space: most reads/writes are proxied to the physical device. BARs, MSI-X capability, and select extended capabilities are emulated or masked by Firecracker. - BAR regions: device MMIO regions are mmap'd from the VFIO device fd and mapped into guest address space as KVM memory slots. BARs containing MSI-X table/PBA are split around the emulated regions using either sparse-mmap caps or manual hole calculation. - MSI-X interrupts: the table and PBA are emulated in Firecracker. Physical device interrupts are delivered via eventfds wired through KVM irqfd. - DMA: guest RAM regions are mapped into the VFIO container's IOMMU so the device can DMA directly to guest memory. Only MSI-X interrupts are supported. IO BARs, ROM BARs, legacy INTx, and MSI (non-X) are not handled. Also no support for hot-plug/unplug of VFIO devices is present at this point, so no cleanup for created VFIO devices is present. Only part which is concerned with cleanup is the device setup code which ensures that all resources are cleaned up if there are any errors during device set-up. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
Add devtool options for preparing a PCI device for VFIO passthrough testing. --vfio-device accepts a block device path (e.g. /dev/nvme1n1) or a PCI SBDF, resolves it to a PCI device, binds it to vfio-pci, and passes the SBDF and sysfs path to the test container via environment variables. --first-vfio-pci-device is a fallback that searches for the first NVMe device already bound to vfio-pci if the primary device is not found. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
Add an integration tests that verify VFIO passthrough with a physical NVMe device. Tests are gated behind the `vfio` pytest mark and FC_VFIO_PCI_SBDF environment variable so they only run when a suitable device is available. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
Current VFIO implementation has some restrictions: - Does not work without PCI since VFIO devices are PCI devices - Does not work with virtio-mem device since we don't update DMA mappings on hot-plug/unplug - Does not work with virtio-balloon since it can `fadvise` on memory In order to prevent VMs being launched with invalid configurations, implement multiple checks for invalid configurations: - At API level, prevent adding of incompatible combinations (VFIO after balloon/mem or in reverse) - At vm creation or snapshot restoraton since they get VmResources from other sources. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
VFIO device state is opaque to the VMM and cannot be serialized or restored. Add VFIO devices to the list of snapshot-incompatible devices so that snapshot requests are rejected with a clear error instead of producing a corrupt snapshot. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
The vfio-ioctls crate uses syscalls that were not previously in the seccomp allowlists. Add them for both architectures: - dup: used by vfio-ioctls to duplicate file descriptors during VFIO device setup. - ioctl(VFIO_GROUP_UNSET_CONTAINER): used during VfioDevice drop to detach the group from the container. - ioctl(VFIO_IOMMU_UNMAP_DMA): used during cleanup to unmap DMA regions. - pread64/pwrite64: used for reading/writing device PCI config and BAR regions, both on the API thread during setup and on vCPU threads during MMIO/PIO exits. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
The VFIO integration tests use an NVMe device to verify passthrough functionality. Update a kernel config to enable the NVMe core and block device drivers so the guest can detect and use the passthrough device. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
VFIO tests need exclusive access to the passthrough device, so they cannot run in parallel with other tests. Add a separate Buildkite step in the PR pipeline that runs only the vfio-marked tests, similar to the existing performance step. CI instances will have an additional 1GB NVMe device at /dev/nvme1n1 for this purpose. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
11aee55 to
63c3bd5
Compare
Add docs/vfio.md covering how VFIO passthrough works in Firecracker, prerequisites (IOMMU, vfio-pci binding), configuration via API and config file, security considerations, snapshot incompatibility, and current limitations. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
Add a changelog entry for the new VFIO PCI device passthrough feature. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
do not merge: point to vfio artifacts Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
63c3bd5 to
f6d6fea
Compare
Wire up the code to allow after VM boot hot-plugging of VFIO devices. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
With hot-plug support, we need to expand the list of syscalls to allow new VFIO specific ones. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
Code that cleans up BARs allocations will need to know both addr and size for each bar. Add additional utility functions to get these values. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
f6d6fea to
50e789e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
Add basic implementation of the VFIO device pass-through.
Current version only allows devices to be added before VM boot.
Other limitations:
Reason
Provide a way to pass physical PCI devices into VM
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.PR Checklist
tools/devtool checkbuild --allto verify that the PR passesbuild checks on all supported architectures.
tools/devtool checkstyleto verify that the PR passes theautomated style checks.
how they are solving the problem in a clear and encompassing way.
in the PR.
CHANGELOG.md.Runbook for Firecracker API changes.
integration tests.
TODO.rust-vmm.