feat(memory): use fallocate(PUNCH_HOLE) for guest_memfd discard#5792
Draft
JackThomson2 wants to merge 326 commits into
Draft
feat(memory): use fallocate(PUNCH_HOLE) for guest_memfd discard#5792JackThomson2 wants to merge 326 commits into
JackThomson2 wants to merge 326 commits into
Conversation
Simplify the docker-popular rootfs building using common functions. Define a new file `setup-minimal.sh` that is responsible for the image-specific setups. Also, use squashfs for test-popular-containers tests, as there is no specific reason for them to be ext4. Signed-off-by: James Curtis <jxcurtis@amazon.co.uk>
In 1.14.0 we seemed to change convention to vX.Y.Z, this also followed with 1.15.0. Update these to use the old previous convention without the v prefix. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Bumps the firecracker group with 9 updates: | Package | From | To | | --- | --- | --- | | [zerocopy](https://github.com/google/zerocopy) | `0.8.40` | `0.8.42` | | [quote](https://github.com/dtolnay/quote) | `1.0.44` | `1.0.45` | | [uuid](https://github.com/uuid-rs/uuid) | `1.21.0` | `1.22.0` | | [libc](https://github.com/rust-lang/libc) | `0.2.182` | `0.2.183` | | [glam](https://github.com/bitshifter/glam-rs) | `0.32.0` | `0.32.1` | | [jiff](https://github.com/BurntSushi/jiff) | `0.2.22` | `0.2.23` | | [jiff-static](https://github.com/BurntSushi/jiff) | `0.2.22` | `0.2.23` | | [winnow](https://github.com/winnow-rs/winnow) | `0.7.14` | `0.7.15` | | [zerocopy-derive](https://github.com/google/zerocopy) | `0.8.40` | `0.8.42` | Updates `zerocopy` from 0.8.40 to 0.8.42 - [Release notes](https://github.com/google/zerocopy/releases) - [Changelog](https://github.com/google/zerocopy/blob/main/CHANGELOG.md) - [Commits](google/zerocopy@v0.8.40...v0.8.42) Updates `quote` from 1.0.44 to 1.0.45 - [Release notes](https://github.com/dtolnay/quote/releases) - [Commits](dtolnay/quote@1.0.44...1.0.45) Updates `uuid` from 1.21.0 to 1.22.0 - [Release notes](https://github.com/uuid-rs/uuid/releases) - [Commits](uuid-rs/uuid@v1.21.0...v1.22.0) Updates `libc` from 0.2.182 to 0.2.183 - [Release notes](https://github.com/rust-lang/libc/releases) - [Changelog](https://github.com/rust-lang/libc/blob/0.2.183/CHANGELOG.md) - [Commits](rust-lang/libc@0.2.182...0.2.183) Updates `glam` from 0.32.0 to 0.32.1 - [Changelog](https://github.com/bitshifter/glam-rs/blob/main/CHANGELOG.md) - [Commits](bitshifter/glam-rs@0.32.0...0.32.1) Updates `jiff` from 0.2.22 to 0.2.23 - [Release notes](https://github.com/BurntSushi/jiff/releases) - [Changelog](https://github.com/BurntSushi/jiff/blob/master/CHANGELOG.md) - [Commits](BurntSushi/jiff@jiff-static-0.2.22...jiff-static-0.2.23) Updates `jiff-static` from 0.2.22 to 0.2.23 - [Release notes](https://github.com/BurntSushi/jiff/releases) - [Changelog](https://github.com/BurntSushi/jiff/blob/master/CHANGELOG.md) - [Commits](BurntSushi/jiff@jiff-static-0.2.22...jiff-static-0.2.23) Updates `winnow` from 0.7.14 to 0.7.15 - [Changelog](https://github.com/winnow-rs/winnow/blob/main/CHANGELOG.md) - [Commits](winnow-rs/winnow@v0.7.14...v0.7.15) Updates `zerocopy-derive` from 0.8.40 to 0.8.42 - [Release notes](https://github.com/google/zerocopy/releases) - [Changelog](https://github.com/google/zerocopy/blob/main/CHANGELOG.md) - [Commits](google/zerocopy@v0.8.40...v0.8.42) --- updated-dependencies: - dependency-name: zerocopy dependency-version: 0.8.42 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: firecracker - dependency-name: quote dependency-version: 1.0.45 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: firecracker - dependency-name: uuid dependency-version: 1.22.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: firecracker - dependency-name: libc dependency-version: 0.2.183 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: firecracker - dependency-name: glam dependency-version: 0.32.1 dependency-type: indirect update-type: version-update:semver-patch dependency-group: firecracker - dependency-name: jiff dependency-version: 0.2.23 dependency-type: indirect update-type: version-update:semver-patch dependency-group: firecracker - dependency-name: jiff-static dependency-version: 0.2.23 dependency-type: indirect update-type: version-update:semver-patch dependency-group: firecracker - dependency-name: winnow dependency-version: 0.7.15 dependency-type: indirect update-type: version-update:semver-patch dependency-group: firecracker - dependency-name: zerocopy-derive dependency-version: 0.8.42 dependency-type: indirect update-type: version-update:semver-patch dependency-group: firecracker ... Signed-off-by: dependabot[bot] <support@github.com>
This commit introduces the ability to override the vsock's backing Unix Domain Socket (UDS) path when restoring a VM from a snapshot. This is useful in scenarios where the original UDS path is not available on the host where the snapshot is being restored, for example when restoring on a different machine. A new `vsock_override` field has been added to the `/snapshot/load` API endpoint to specify the new UDS path. Authored-by: Sheng-Wei (Way) Chen <waychensw@gmail.com> Co-authored-by: James Curtis <jxcurtis@amazon.co.uk> Signed-off-by: James Curtis <jxcurtis@amazon.co.uk>
Add integration tests for overriding the host UDS path used by the vsock device. Signed-off-by: James Curtis <jxcurtis@amazon.co.uk>
Add a section that highlights the new vsock renaming capabilities. It somewhat mirrors the TAP renaming documentation. Signed-off-by: James Curtis <jxcurtis@amazon.co.uk>
Add a changelog entry for UDS renaming that links to the relevant PR and documentation. Signed-off-by: James Curtis <jxcurtis@amazon.co.uk>
The UDP stack is dead code and unreachable. This commit has no functional change. Signed-off-by: Riccardo Mancini <mancio@amazon.com>
10 us absolute difference threshold is too high, and we would like for the A/B test to detect smaller changes that are still significant in relative terms. This essentially reverts d073c2b (" ci(perf/net_latency): increase absolute delta threshold to 10us"). Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
PortIODeviceManager::register_devices() creates 2 dummy serial devices, in addition to the usable one created in create_legacy_devices(). Remove these 2 dummy devices as they are unreachable and serve no purpose. Their input is hard-coded to None and their output is hard-coded to SerialOut::Sink, meaning no data can ever be received or transmitted from these devices. Removing the addresses ranges from the I/O bus is okay because the top-level handler in run_arch_emulation() ignores any port I/O accesses when there's no associated handler behind the address (similar to how there's no handler for COM5 to COM8 or other legacy devices). Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
There's no reason for a kbd_evt field in PortIODeviceManager, since the EventFd can be accessed from the i8042 field. Remove it. Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Now that PortIODeviceManager::new() can no longer return an error and it simply creates a new instance of PortIODeviceManager, the constructor can be removed. Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Overlapping descriptors within a single chain can cause buffer.len() to exceed the distinct guest memory backing the request. Without a bound, handle_one() allocates a vec proportional to the inflated length, which can reach ~4 GiB from a 17 MiB guest. Introduce MAX_ENTROPY_BYTES (64 KiB) and clamp the allocation in handle_one() to that limit. Legitimate requests are unaffected since a 256-entry descriptor chain with typical page-sized buffers fits well within the cap. Add tests covering the capped path, the large inflated buffer path, and the pass-through for small requests. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Bumps the firecracker group with 11 updates: | Package | From | To | | --- | --- | --- | | [clap](https://github.com/clap-rs/clap) | `4.5.60` | `4.6.0` | | [gdbstub](https://github.com/daniel5151/gdbstub) | `0.7.9` | `0.7.10` | | [gdbstub_arch](https://github.com/daniel5151/gdbstub) | `0.3.2` | `0.3.3` | | [anstyle](https://github.com/rust-cli/anstyle) | `1.0.13` | `1.0.14` | | [cc](https://github.com/rust-lang/cc-rs) | `1.2.56` | `1.2.57` | | [clap_builder](https://github.com/clap-rs/clap) | `4.5.60` | `4.6.0` | | [clap_derive](https://github.com/clap-rs/clap) | `4.5.55` | `4.6.0` | | [clap_lex](https://github.com/clap-rs/clap) | `1.0.0` | `1.1.0` | | [colorchoice](https://github.com/rust-cli/anstyle) | `1.0.4` | `1.0.5` | | [once_cell](https://github.com/matklad/once_cell) | `1.21.3` | `1.21.4` | | [portable-atomic-util](https://github.com/taiki-e/portable-atomic-util) | `0.2.5` | `0.2.6` | Updates `clap` from 4.5.60 to 4.6.0 - [Release notes](https://github.com/clap-rs/clap/releases) - [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md) - [Commits](clap-rs/clap@clap_complete-v4.5.60...clap_complete-v4.6.0) Updates `gdbstub` from 0.7.9 to 0.7.10 - [Release notes](https://github.com/daniel5151/gdbstub/releases) - [Changelog](https://github.com/daniel5151/gdbstub/blob/master/CHANGELOG.md) - [Commits](daniel5151/gdbstub@0.7.9...0.7.10) Updates `gdbstub_arch` from 0.3.2 to 0.3.3 - [Release notes](https://github.com/daniel5151/gdbstub/releases) - [Changelog](https://github.com/daniel5151/gdbstub/blob/master/CHANGELOG.md) - [Commits](https://github.com/daniel5151/gdbstub/commits) Updates `anstyle` from 1.0.13 to 1.0.14 - [Commits](rust-cli/anstyle@v1.0.13...v1.0.14) Updates `cc` from 1.2.56 to 1.2.57 - [Release notes](https://github.com/rust-lang/cc-rs/releases) - [Changelog](https://github.com/rust-lang/cc-rs/blob/main/CHANGELOG.md) - [Commits](rust-lang/cc-rs@cc-v1.2.56...cc-v1.2.57) Updates `clap_builder` from 4.5.60 to 4.6.0 - [Release notes](https://github.com/clap-rs/clap/releases) - [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md) - [Commits](clap-rs/clap@v4.5.60...v4.6.0) Updates `clap_derive` from 4.5.55 to 4.6.0 - [Release notes](https://github.com/clap-rs/clap/releases) - [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md) - [Commits](clap-rs/clap@v4.5.55...v4.6.0) Updates `clap_lex` from 1.0.0 to 1.1.0 - [Release notes](https://github.com/clap-rs/clap/releases) - [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md) - [Commits](clap-rs/clap@clap_lex-v1.0.0...clap_lex-v1.1.0) Updates `colorchoice` from 1.0.4 to 1.0.5 - [Commits](rust-cli/anstyle@colorchoice-v1.0.4...colorchoice-v1.0.5) Updates `once_cell` from 1.21.3 to 1.21.4 - [Changelog](https://github.com/matklad/once_cell/blob/master/CHANGELOG.md) - [Commits](matklad/once_cell@v1.21.3...v1.21.4) Updates `portable-atomic-util` from 0.2.5 to 0.2.6 - [Release notes](https://github.com/taiki-e/portable-atomic-util/releases) - [Changelog](https://github.com/taiki-e/portable-atomic-util/blob/main/CHANGELOG.md) - [Commits](taiki-e/portable-atomic-util@v0.2.5...v0.2.6) --- updated-dependencies: - dependency-name: clap dependency-version: 4.6.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: firecracker - dependency-name: gdbstub dependency-version: 0.7.10 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: firecracker - dependency-name: gdbstub_arch dependency-version: 0.3.3 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: firecracker - dependency-name: anstyle dependency-version: 1.0.14 dependency-type: indirect update-type: version-update:semver-patch dependency-group: firecracker - dependency-name: cc dependency-version: 1.2.57 dependency-type: indirect update-type: version-update:semver-patch dependency-group: firecracker - dependency-name: clap_builder dependency-version: 4.6.0 dependency-type: indirect update-type: version-update:semver-minor dependency-group: firecracker - dependency-name: clap_derive dependency-version: 4.6.0 dependency-type: indirect update-type: version-update:semver-minor dependency-group: firecracker - dependency-name: clap_lex dependency-version: 1.1.0 dependency-type: indirect update-type: version-update:semver-minor dependency-group: firecracker - dependency-name: colorchoice dependency-version: 1.0.5 dependency-type: indirect update-type: version-update:semver-patch dependency-group: firecracker - dependency-name: once_cell dependency-version: 1.21.4 dependency-type: indirect update-type: version-update:semver-patch dependency-group: firecracker - dependency-name: portable-atomic-util dependency-version: 0.2.6 dependency-type: indirect update-type: version-update:semver-patch dependency-group: firecracker ... Signed-off-by: dependabot[bot] <support@github.com>
Update the release policy to reference the newest 1.14.3 patch release Signed-off-by: Jack Thomson <jackabt@amazon.com>
Currently, Firecracker sets "FCVMGID" as HID (Hardware ID) and
"VM_Gen_Counter" as CID (Compatbile ID). Linux kernel [1] specifies
"VMGENCTR" and "VM_GEN_COUNTER" as ACPI IDs to bind the driver to the
VMGenID device.
If the VMGenID driver is implemented as platform driver, Linux kernel
checks whether either HID or CID matches the ACPI IDs. On the other
hand, if implemented as ACPI driver, only HID is checked for the match.
Linux kernel 6.10 [2] re-implemented it from ACPI driver to platform
driver in order to support devcie tree. As a result, prior to Linux
kernel 6.10, the driver isn't bound correctly.
We didn't see any issue due to HID mismatch, because we backported the
above kernel patches to our 6.1 guest kernel [3]. VMGenID itself is
only supported since upstream Linux kernel 5.18+ [4] and we don't test
VMGenID on our 5.10 guest kernel.
Note that Amazon Linux-provided microVM kernel 5.10 [5][6] actually
implements VMGenID driver but it is a downstream implementation. It
is never used by customers (instead SysGenID is used) and it specifies
yet another set of ACPI IDs ("VMGENID" and "QEMUVMGID"). So, more
precisely, that is why we don't test VMGenID on our 5.10 guest kernel.
[1]: https://elixir.bootlin.com/linux/v6.19.7/source/drivers/virt/vmgenid.c#L162-L163
[2]: torvalds/linux@e076067
[3]: https://github.com/firecracker-microvm/firecracker/blob/81236d82b1640cfa41f825f50be5585c758e165b/docs/snapshotting/random-for-clones.md?plain=1#L132-L139
[4]: torvalds/linux@af6b54e
[5]: https://github.com/amazonlinux/linux/blob/microvm-kernel-5.10.245-268.975.amzn2/drivers/virt/vmgenid.c#L121-L122
[6]: amazonlinux/linux@c9b81dc
Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Although Firecracker doesn't do anything for SysGenID, it is used for the snapsafety issue by guest userspace in Amazon Linux-provided microVM kernels [1][2]. Let's test them to detect functional regression that might be introduced in microVM kernels. [1]: https://github.com/amazonlinux/linux/blob/microvm-kernel-5.10.245-268.975.amzn2/drivers/misc/sysgenid.c [2]: https://github.com/amazonlinux/linux/blob/microvm-kernel-6.1.164-23.303.amzn2023/drivers/misc/sysgenid.c Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
A vulnerability was found in aws-lc-sys [1]. Although Firecracker isn't affected as it doesn't use AWS-LC to validate CN, update aws-lc-rs (and aws-lc-sys indirectly) to suppress cargo-audit failure. [1]: https://rustsec.org/advisories/RUSTSEC-2026-0044.html Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Since host kernel 6.3 (commit 7af0c2534f4c), KVM fabricates CLIDR_EL1 instead of passing through the host's real value. On hosts with IDC=1 and DIC=0 (e.g. Neoverse V1), the fabricated CLIDR exposes only L1=Unified when the host actually has separate L1d+L1i, L2, and L3. Guest kernels >= 6.1.156 backported init_of_cache_level() which counts cache leaves from the DT, while populate_cache_leaves() uses CLIDR_EL1. When the DT (built from host sysfs) describes more cache entries than CLIDR_EL1, the mismatch causes cache sysfs entries to not be created, breaking /sys/devices/system/cpu/cpu*/cache/* in the guest. Fix this by reading the current CLIDR_EL1 from vCPU 0, merging in the ctype and LoC fields derived from the host's sysfs cache topology, and writing the result back to each vCPU via KVM_SET_ONE_REG. Fields that cannot be derived from sysfs (LoUU, LoUIS, ICB, Ttype) are preserved from the original CLIDR_EL1. This makes CLIDR_EL1 consistent with the FDT, which already describes the real host caches. On pre-6.3 kernels, KVM passes through the real host CLIDR rather than fabricating one. Since the sysfs cache topology already matches the real CLIDR, the merge produces the same value, the write is skipped, and the override is effectively a no-op. This approach preserves the full host cache information for the guest rather than stripping the FDT to match the fabricated CLIDR. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
There's currently no check when creating a BusRange object that the range is valid. Add a new() method that makes sure that the base + len arguments don't overflow and that length is non-zero. To make things simpler and avoid potential overflows in the future store the end address in BusRange, rather than a length to avoid extra checks for when base+len is exactly (1<<64). Finally, make the base and end fields private so that new BusRange objects can only be created via the constructor (and hence are always valid) and add a test case for verifying valid ranges. Similarly, make BusRange::overlaps() take a BusRange rather than a base and len. BurRange::overlaps() had the potential of overflowing without this change, however that wouldn't happen in practice because the resource allocator generates valid ranges. Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
With the previous loop the uvm's were not being killed creating pressure for the host the number of uvms actively running. The timing of hot(un)plug operations gradually increased with each iteration as more pressure was created. Signed-off-by: Jack Thomson <jackabt@amazon.com>
We are generating multiple vms in a loop however the factory holds onto a reference to these so are not killed as they go out of scope. Explicitly kill the vm after each loop. Signed-off-by: Jack Thomson <jackabt@amazon.com>
We are generating multiple vms in a loop however the factory holds onto a reference to these so are not killed as they go out of scope. Explicitly kill the vm after each loop. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## feature/secret-hiding #5792 +/- ##
=========================================================
- Coverage 82.03% 81.06% -0.98%
=========================================================
Files 277 278 +1
Lines 30093 30821 +728
=========================================================
+ Hits 24688 24984 +296
- Misses 5405 5837 +432
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fa4fe9d to
eab0a13
Compare
The previous implementation checked whether either slot endpoint fell inside the requested range. This missed the containment case where a slot fully contains the range (neither endpoint inside it), causing update_kvm_slots to silently skip KVM slot registration/removal for any block not aligned to a slot boundary. Replace the two addr_in_range endpoint checks with a proper half-open interval intersection test: slot_start < range_end && range_start < slot_end. Remove the now-unused addr_in_range helper and add a table-driven unit test covering boundary, interior, cross-slot, full-region, outside, and zero-length ranges. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
process_stats_queue() used the guest-provided descriptor len field as the loop bound without validation. A misbehaving guest could set this to u32::MAX, causing excessive iterations that temporarily monopolise the VMM event loop. Add a MAX_STATS_DESC_LEN check before entering the loop. The limit uses a generous upper bound (256 tags) rather than the current spec count, so future kernel additions won't silently break stats collection. Oversized descriptors are logged and held without updating stats, preserving the stats request/response protocol. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
When a non-compliant driver submits more than one stats buffer, process_stats_queue returns the previous descriptor via add_used but never calls advance_used_ring_idx or signal_used_queue. The write to the used ring is therefore invisible to the guest, which can never reclaim the buffer. Add the missing advance_used_ring_idx and signal_used_queue calls so the guest actually sees the returned descriptor. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
This updates UFFD patches from "v3 UFFD minor support" to Mike's respin RFC that adds support for both major and minor faults. It also adds a missing NUMA dependency patch that exports MM symbols to KVM and a fixup for UFFD series that does the same. Current patch set (based on v6.18): - NUMA support in guest_memfd (from v6.19-rc5), new: missing patch - v10 direct map removal + fixup - x86: configurable TLB flushes after direct map removal - v2 kvmclock - v1 KVM userfault - v7 write syscall - fixup for direct map removal and write to work together - RFC UFFD support with fixups (new) - PoC srcu_synchronize optimisation from Sean Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
This is because guest_memfd now supports both modes. We still need minor handling in case prepopulation logic adds a page in the page cache and the VMM accesses it later via user mappings. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Now that guest_memfd has support for major faults, we can resolve all on-demand user mapping faults with UFFDIO_COPY. Remove the code that uses memcpy memory population. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
With adding major fault support in guest_memfd, when a major fault is generated, no page is added in the page cache by the kernel. It means that proactive population with write() will succeed for the entire region but possibly one page (kvmclock on x86). We just need to call UFFDIO_CONTINUE in response to such a major fault because we already populated the faulting page via write() earlier. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
This is because when upgrading the guest kernel from 6.1.155 to 6.1.163, /sys/devices/system/cpu/cpu?/cache/* files disappeared. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
- Drop the avoidance of silent kvm-clock activation failure as Sean commented - Fix a compile error when CONFIG_KVM_SW_PROTECTED_VM=y Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Add the APF patch series to be built into our SH AMI. Signed-off-by: Jack Thomson <jackabt@amazon.com>
This reverts commit 5417c27. Reverting for now while we investigate the regressions in hotplugging Signed-off-by: Jack Thomson <jackabt@amazon.com>
- Upgrade UFFD patches from RFC to v1 including fixes from Harry, Edward and Mike - Replace Sean's SRCU fix PoC with his RFC Current patch set (based on v6.18): - NUMA support in guest_memfd (from v6.19-rc5) - v10 direct map removal + fixup - x86: configurable TLB flushes after direct map removal - v3 kvmclock - v1 KVM userfault - v7 write syscall - fixup for direct map removal and write to work together - v1 UFFD support with fixups (new: RFC -> v1) - RFC srcu_synchronize optimisation from Sean (new: PoC -> RFC) Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Current patch set (based on v6.18): - NUMA support in guest_memfd (from v6.19-rc5) - v11 direct map removal (new: v10 -> v11) - x86: configurable TLB flushes after direct map removal - v3 kvmclock - v1 KVM userfault - v7 write syscall - fixup for direct map removal and write to work together - v2 UFFD support with fixups - RFC srcu_synchronize optimisation from Sean Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Add APF patches for the kernel Signed-off-by: Jack Thomson <jackabt@amazon.com>
Replace the JSON-based UFFD protocol with length-prefixed bitcode encoding for better performance and type safety. Extract UffdMessageBroker from lib.rs into its own module (uffd_broker.rs) with proper error handling — send_fault_request() returns Result instead of panicking, and the iterator logs errors instead of unwinding. Also adds APF socket support to UFFD handler examples and exitless APF ring buffer structures to the handler side. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add the synchronous/fallback path for KVM async page faults. When a guest vCPU hits a userfault page and the exitless ring buffer is full (or not configured), KVM exits with KVM_EXIT_MEMORY_FAULT + KVM_MEMORY_EXIT_FLAG_APF. The VMM sends the fault request to the UFFD handler over a Unix socket, issues KVM_APF_OP_ACCEPT so the vCPU can re-enter in a halted state, and processes the handler's reply with KVM_APF_OP_READY to wake the vCPU. Includes KvmAPFReq ioctl definitions, APF handling in vcpu handle_userfault() (gated to x86_64), SharedApfStream type, updated create_vcpus signature, and seccomp rules for the new ioctls. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add the fast path for async page faults that avoids KVM_RUN exit/ re-entry overhead. Each vCPU gets an ExitlessApfContext backed by a memfd-mapped shared page containing notify and completion ring buffers (32 entries each). KVM writes fault GPAs to the notify ring and signals an eventfd; the UFFD handler resolves the page and writes to the completion ring; KVM drains completions via a workqueue and wakes the halted vCPU. Three fds per vCPU (notify eventfd, completion eventfd, shared page memfd) are passed to the UFFD handler via SCM_RIGHTS over the existing Unix socket. Key design: exitless APF setup and handler unblocking happens BEFORE vCPU state restore, since MSR restore (kvm-clock) triggers userfaults on UFFD-registered memory that the handler must be ready to service. APF capability is detected at runtime — gracefully falls back to synchronous UFFD on kernels without KVM_CAP_ASYNC_PF_USERFAULT. Fix default_vmm() test helper to create a real pipe for _apf_pipe_reader instead of stealing fd 0 (stdin), which caused an IO Safety violation (double-close SIGABRT) during test runs. Signed-off-by: Jack Thomson <jackabt@amazon.com>
GUEST_MEMFD_FLAG_WRITE was added accidentally on a rebase on the booted path. Remove this to fixup booted path. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Previously we never passed the APF socket to the handler so we weren't using APF properly Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add integration and performance tests for APF to compare the perf impact Signed-off-by: Jack Thomson <jackabt@amazon.com>
There were a couple of error on the non exitless path for APF. We saw these with the memory hotplug which flodded the APF buffer. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Holding the spinlock while calling the dequeue method caused massive regressions in the hotplugging tests. It was discovered that the root was the spinlock disabling preempt, this had a knock on effect when calling the dequeue. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Current patch set (based on v6.18): - NUMA support in guest_memfd (from v6.19-rc5) - v12 direct map removal (new: v11 -> v12) - x86: configurable TLB flushes after direct map removal - v3 kvmclock - v1 KVM userfault - v7 write syscall - fixup for direct map removal and write to work together - v2 UFFD support with fixups - RFC srcu_synchronize optimisation from Sean - Tmp RFCv2 Async PF (non-published) Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
MADV_DONTNEED is a no-op for MAP_SHARED mappings, which means discard_range() previously did nothing for guest_memfd-backed memory. This prevented virtio-mem unplug and balloon inflate from actually freeing physical pages back to the host when secret_free is enabled. Add a fallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE) path for MAP_SHARED file-backed regions, which punches holes in the guest_memfd backing file and releases the pages from the page cache. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Now that discard_range() uses fallocate(PUNCH_HOLE) for MAP_SHARED guest_memfd regions, the balloon can properly reclaim memory when secret_free is enabled. Remove the restriction that prevented combining balloon with secret_free. Signed-off-by: Jack Thomson <jackabt@amazon.com>
When secret_free is enabled, guest memory is backed by guest_memfd. Hotplug memory regions are never mapped into the VMM's userspace, so host VmRSS doesn't reflect those pages. Guest-side (total - available) is too noisy because mem_available tracks mem_total closely when the guest is idle, so unplug doesn't produce a reliable delta. Use the FC process's cgroup memory.current, which accounts for kernel pages allocated by guest_memfd and correctly drops after fallocate(PUNCH_HOLE). This gives a monotonic, block-accurate signal for virtio-mem unplug and balloon inflate under secret_free. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add the secret_free fixture parameter to all balloon functional tests, so they run with both SF_OFF and SF_ON variants. This exercises the fallocate(PUNCH_HOLE) discard path for guest_memfd-backed memory during balloon inflate/deflate. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Now that discard_range() uses fallocate(PUNCH_HOLE) for guest_memfd, and get_resident_memory() uses guest meminfo for secret_free VMs, the RSS decrease assertion in check_hotunplug works correctly for secret_free. Remove the skip. Signed-off-by: Jack Thomson <jackabt@amazon.com>
eab0a13 to
c60c65a
Compare
After discard_range() punches a hole in a guest_memfd-backed region via fallocate(PUNCH_HOLE), the folios for those pages are released. The userfault bitmap bits for those pages, however, may already be cleared from a previous UFFDIO_COPY (the handler clears the bit on successful copy). With a cleared bit and no folio, the next guest access takes the UFFD MINOR-fault path and the handler calls UFFDIO_CONTINUE, which returns EFAULT because there is no folio to install. Re-set the userfault bitmap bits covering the punched range so the next access takes the MISSING-fault path and the handler re-populates via UFFDIO_COPY. Signed-off-by: Jack Thomson <jackabt@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add support for fallocate(PUNCH_HOLE) this will expand our tests to also include balloon and memory hotplugging tests
...
Reason
...
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.PR Checklist
tools/devtool checkbuild --allto verify that the PR passesbuild checks on all supported architectures.
tools/devtool checkstyleto verify that the PR passes theautomated style checks.
how they are solving the problem in a clear and encompassing way.
in the PR.
CHANGELOG.md.Runbook for Firecracker API changes.
integration tests.
TODO.rust-vmm.MADV_DONTNEED is a no-op for MAP_SHARED mappings, which means
discard_range() previously did nothing for guest_memfd-backed memory.
This prevented virtio-mem unplug and balloon inflate from actually
freeing physical pages back to the host when secret_free is enabled.
Add a fallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE) path for
MAP_SHARED file-backed regions, which punches holes in the guest_memfd
backing file and releases the pages from the page cache.
Signed-off-by: Jack Thomson jackabt@amazon.com