Skip to content

feat(memory): use fallocate(PUNCH_HOLE) for guest_memfd discard#5792

Draft
JackThomson2 wants to merge 326 commits into
firecracker-microvm:feature/secret-hidingfrom
JackThomson2:sh/support_punch_hole
Draft

feat(memory): use fallocate(PUNCH_HOLE) for guest_memfd discard#5792
JackThomson2 wants to merge 326 commits into
firecracker-microvm:feature/secret-hidingfrom
JackThomson2:sh/support_punch_hole

Conversation

@JackThomson2
Copy link
Copy Markdown
Contributor

Add support for fallocate(PUNCH_HOLE) this will expand our tests to also include balloon and memory hotplugging tests

...

Reason

...

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkbuild --all to verify that the PR passes
    build checks on all supported architectures.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

MADV_DONTNEED is a no-op for MAP_SHARED mappings, which means
discard_range() previously did nothing for guest_memfd-backed memory.
This prevented virtio-mem unplug and balloon inflate from actually
freeing physical pages back to the host when secret_free is enabled.

Add a fallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE) path for
MAP_SHARED file-backed regions, which punches holes in the guest_memfd
backing file and releases the pages from the page cache.

Signed-off-by: Jack Thomson jackabt@amazon.com

JamesC1305 and others added 23 commits March 10, 2026 12:03
Simplify the docker-popular rootfs building using common functions.
Define a new file `setup-minimal.sh` that is responsible for the
image-specific setups.

Also, use squashfs for test-popular-containers tests, as there is no
specific reason for them to be ext4.

Signed-off-by: James Curtis <jxcurtis@amazon.co.uk>
In 1.14.0 we seemed to change convention to vX.Y.Z, this also followed
with 1.15.0.

Update these to use the old previous convention without the v prefix.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Bumps the firecracker group with 9 updates:

| Package | From | To |
| --- | --- | --- |
| [zerocopy](https://github.com/google/zerocopy) | `0.8.40` | `0.8.42` |
| [quote](https://github.com/dtolnay/quote) | `1.0.44` | `1.0.45` |
| [uuid](https://github.com/uuid-rs/uuid) | `1.21.0` | `1.22.0` |
| [libc](https://github.com/rust-lang/libc) | `0.2.182` | `0.2.183` |
| [glam](https://github.com/bitshifter/glam-rs) | `0.32.0` | `0.32.1` |
| [jiff](https://github.com/BurntSushi/jiff) | `0.2.22` | `0.2.23` |
| [jiff-static](https://github.com/BurntSushi/jiff) | `0.2.22` | `0.2.23` |
| [winnow](https://github.com/winnow-rs/winnow) | `0.7.14` | `0.7.15` |
| [zerocopy-derive](https://github.com/google/zerocopy) | `0.8.40` | `0.8.42` |


Updates `zerocopy` from 0.8.40 to 0.8.42
- [Release notes](https://github.com/google/zerocopy/releases)
- [Changelog](https://github.com/google/zerocopy/blob/main/CHANGELOG.md)
- [Commits](google/zerocopy@v0.8.40...v0.8.42)

Updates `quote` from 1.0.44 to 1.0.45
- [Release notes](https://github.com/dtolnay/quote/releases)
- [Commits](dtolnay/quote@1.0.44...1.0.45)

Updates `uuid` from 1.21.0 to 1.22.0
- [Release notes](https://github.com/uuid-rs/uuid/releases)
- [Commits](uuid-rs/uuid@v1.21.0...v1.22.0)

Updates `libc` from 0.2.182 to 0.2.183
- [Release notes](https://github.com/rust-lang/libc/releases)
- [Changelog](https://github.com/rust-lang/libc/blob/0.2.183/CHANGELOG.md)
- [Commits](rust-lang/libc@0.2.182...0.2.183)

Updates `glam` from 0.32.0 to 0.32.1
- [Changelog](https://github.com/bitshifter/glam-rs/blob/main/CHANGELOG.md)
- [Commits](bitshifter/glam-rs@0.32.0...0.32.1)

Updates `jiff` from 0.2.22 to 0.2.23
- [Release notes](https://github.com/BurntSushi/jiff/releases)
- [Changelog](https://github.com/BurntSushi/jiff/blob/master/CHANGELOG.md)
- [Commits](BurntSushi/jiff@jiff-static-0.2.22...jiff-static-0.2.23)

Updates `jiff-static` from 0.2.22 to 0.2.23
- [Release notes](https://github.com/BurntSushi/jiff/releases)
- [Changelog](https://github.com/BurntSushi/jiff/blob/master/CHANGELOG.md)
- [Commits](BurntSushi/jiff@jiff-static-0.2.22...jiff-static-0.2.23)

Updates `winnow` from 0.7.14 to 0.7.15
- [Changelog](https://github.com/winnow-rs/winnow/blob/main/CHANGELOG.md)
- [Commits](winnow-rs/winnow@v0.7.14...v0.7.15)

Updates `zerocopy-derive` from 0.8.40 to 0.8.42
- [Release notes](https://github.com/google/zerocopy/releases)
- [Changelog](https://github.com/google/zerocopy/blob/main/CHANGELOG.md)
- [Commits](google/zerocopy@v0.8.40...v0.8.42)

---
updated-dependencies:
- dependency-name: zerocopy
  dependency-version: 0.8.42
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: quote
  dependency-version: 1.0.45
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: uuid
  dependency-version: 1.22.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: firecracker
- dependency-name: libc
  dependency-version: 0.2.183
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: glam
  dependency-version: 0.32.1
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: jiff
  dependency-version: 0.2.23
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: jiff-static
  dependency-version: 0.2.23
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: winnow
  dependency-version: 0.7.15
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: zerocopy-derive
  dependency-version: 0.8.42
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
...

Signed-off-by: dependabot[bot] <support@github.com>
This commit introduces the ability to override the vsock's backing
Unix Domain Socket (UDS) path when restoring a VM from a snapshot.

This is useful in scenarios where the original UDS path is not
available on the host where the snapshot is being restored, for
example when restoring on a different machine.

A new `vsock_override` field has been added to the `/snapshot/load`
API endpoint to specify the new UDS path.

Authored-by: Sheng-Wei (Way) Chen <waychensw@gmail.com>
Co-authored-by: James Curtis <jxcurtis@amazon.co.uk>
Signed-off-by: James Curtis <jxcurtis@amazon.co.uk>
Add integration tests for overriding the host UDS path used by the
vsock device.

Signed-off-by: James Curtis <jxcurtis@amazon.co.uk>
Add a section that highlights the new vsock renaming capabilities. It
somewhat mirrors the TAP renaming documentation.

Signed-off-by: James Curtis <jxcurtis@amazon.co.uk>
Add a changelog entry for UDS renaming that links to the relevant
PR and documentation.

Signed-off-by: James Curtis <jxcurtis@amazon.co.uk>
The UDP stack is dead code and unreachable. This commit has no
functional change.

Signed-off-by: Riccardo Mancini <mancio@amazon.com>
10 us absolute difference threshold is too high, and we would like for
the A/B test to detect smaller changes that are still significant in
relative terms.

This essentially reverts d073c2b ("
ci(perf/net_latency): increase absolute delta threshold to 10us").

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
PortIODeviceManager::register_devices() creates 2 dummy serial devices,
in addition to the usable one created in create_legacy_devices().

Remove these 2 dummy devices as they are unreachable and serve no
purpose. Their input is hard-coded to None and their output is
hard-coded to SerialOut::Sink, meaning no data can ever be received or
transmitted from these devices.

Removing the addresses ranges from the I/O bus is okay because the
top-level handler in run_arch_emulation() ignores any port I/O accesses
when there's no associated handler behind the address (similar to how
there's no handler for COM5 to COM8 or other legacy devices).

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
There's no reason for a kbd_evt field in PortIODeviceManager, since the
EventFd can be accessed from the i8042 field. Remove it.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Now that PortIODeviceManager::new() can no longer return an error and it
simply creates a new instance of PortIODeviceManager, the constructor
can be removed.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Overlapping descriptors within a single chain can cause buffer.len() to
exceed the distinct guest memory backing the request. Without a bound,
handle_one() allocates a vec proportional to the inflated length, which
can reach ~4 GiB from a 17 MiB guest.

Introduce MAX_ENTROPY_BYTES (64 KiB) and clamp the allocation in
handle_one() to that limit.  Legitimate requests are unaffected since a
256-entry descriptor chain with typical page-sized buffers fits well
within the cap.

Add tests covering the capped path, the large inflated buffer path, and
the pass-through for small requests.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Bumps the firecracker group with 11 updates:

| Package | From | To |
| --- | --- | --- |
| [clap](https://github.com/clap-rs/clap) | `4.5.60` | `4.6.0` |
| [gdbstub](https://github.com/daniel5151/gdbstub) | `0.7.9` | `0.7.10` |
| [gdbstub_arch](https://github.com/daniel5151/gdbstub) | `0.3.2` | `0.3.3` |
| [anstyle](https://github.com/rust-cli/anstyle) | `1.0.13` | `1.0.14` |
| [cc](https://github.com/rust-lang/cc-rs) | `1.2.56` | `1.2.57` |
| [clap_builder](https://github.com/clap-rs/clap) | `4.5.60` | `4.6.0` |
| [clap_derive](https://github.com/clap-rs/clap) | `4.5.55` | `4.6.0` |
| [clap_lex](https://github.com/clap-rs/clap) | `1.0.0` | `1.1.0` |
| [colorchoice](https://github.com/rust-cli/anstyle) | `1.0.4` | `1.0.5` |
| [once_cell](https://github.com/matklad/once_cell) | `1.21.3` | `1.21.4` |
| [portable-atomic-util](https://github.com/taiki-e/portable-atomic-util) | `0.2.5` | `0.2.6` |


Updates `clap` from 4.5.60 to 4.6.0
- [Release notes](https://github.com/clap-rs/clap/releases)
- [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md)
- [Commits](clap-rs/clap@clap_complete-v4.5.60...clap_complete-v4.6.0)

Updates `gdbstub` from 0.7.9 to 0.7.10
- [Release notes](https://github.com/daniel5151/gdbstub/releases)
- [Changelog](https://github.com/daniel5151/gdbstub/blob/master/CHANGELOG.md)
- [Commits](daniel5151/gdbstub@0.7.9...0.7.10)

Updates `gdbstub_arch` from 0.3.2 to 0.3.3
- [Release notes](https://github.com/daniel5151/gdbstub/releases)
- [Changelog](https://github.com/daniel5151/gdbstub/blob/master/CHANGELOG.md)
- [Commits](https://github.com/daniel5151/gdbstub/commits)

Updates `anstyle` from 1.0.13 to 1.0.14
- [Commits](rust-cli/anstyle@v1.0.13...v1.0.14)

Updates `cc` from 1.2.56 to 1.2.57
- [Release notes](https://github.com/rust-lang/cc-rs/releases)
- [Changelog](https://github.com/rust-lang/cc-rs/blob/main/CHANGELOG.md)
- [Commits](rust-lang/cc-rs@cc-v1.2.56...cc-v1.2.57)

Updates `clap_builder` from 4.5.60 to 4.6.0
- [Release notes](https://github.com/clap-rs/clap/releases)
- [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md)
- [Commits](clap-rs/clap@v4.5.60...v4.6.0)

Updates `clap_derive` from 4.5.55 to 4.6.0
- [Release notes](https://github.com/clap-rs/clap/releases)
- [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md)
- [Commits](clap-rs/clap@v4.5.55...v4.6.0)

Updates `clap_lex` from 1.0.0 to 1.1.0
- [Release notes](https://github.com/clap-rs/clap/releases)
- [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md)
- [Commits](clap-rs/clap@clap_lex-v1.0.0...clap_lex-v1.1.0)

Updates `colorchoice` from 1.0.4 to 1.0.5
- [Commits](rust-cli/anstyle@colorchoice-v1.0.4...colorchoice-v1.0.5)

Updates `once_cell` from 1.21.3 to 1.21.4
- [Changelog](https://github.com/matklad/once_cell/blob/master/CHANGELOG.md)
- [Commits](matklad/once_cell@v1.21.3...v1.21.4)

Updates `portable-atomic-util` from 0.2.5 to 0.2.6
- [Release notes](https://github.com/taiki-e/portable-atomic-util/releases)
- [Changelog](https://github.com/taiki-e/portable-atomic-util/blob/main/CHANGELOG.md)
- [Commits](taiki-e/portable-atomic-util@v0.2.5...v0.2.6)

---
updated-dependencies:
- dependency-name: clap
  dependency-version: 4.6.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: firecracker
- dependency-name: gdbstub
  dependency-version: 0.7.10
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: gdbstub_arch
  dependency-version: 0.3.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: anstyle
  dependency-version: 1.0.14
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: cc
  dependency-version: 1.2.57
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: clap_builder
  dependency-version: 4.6.0
  dependency-type: indirect
  update-type: version-update:semver-minor
  dependency-group: firecracker
- dependency-name: clap_derive
  dependency-version: 4.6.0
  dependency-type: indirect
  update-type: version-update:semver-minor
  dependency-group: firecracker
- dependency-name: clap_lex
  dependency-version: 1.1.0
  dependency-type: indirect
  update-type: version-update:semver-minor
  dependency-group: firecracker
- dependency-name: colorchoice
  dependency-version: 1.0.5
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: once_cell
  dependency-version: 1.21.4
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
- dependency-name: portable-atomic-util
  dependency-version: 0.2.6
  dependency-type: indirect
  update-type: version-update:semver-patch
  dependency-group: firecracker
...

Signed-off-by: dependabot[bot] <support@github.com>
Update the release policy to reference the newest 1.14.3 patch release

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Currently, Firecracker sets "FCVMGID" as HID (Hardware ID) and
"VM_Gen_Counter" as CID (Compatbile ID).  Linux kernel [1] specifies
"VMGENCTR" and "VM_GEN_COUNTER" as ACPI IDs to bind the driver to the
VMGenID device.

If the VMGenID driver is implemented as platform driver, Linux kernel
checks whether either HID or CID matches the ACPI IDs.  On the other
hand, if implemented as ACPI driver, only HID is checked for the match.
Linux kernel 6.10 [2] re-implemented it from ACPI driver to platform
driver in order to support devcie tree.  As a result, prior to Linux
kernel 6.10, the driver isn't bound correctly.

We didn't see any issue due to HID mismatch, because we backported the
above kernel patches to our 6.1 guest kernel [3].  VMGenID itself is
only supported since upstream Linux kernel 5.18+ [4] and we don't test
VMGenID on our 5.10 guest kernel.

Note that Amazon Linux-provided microVM kernel 5.10 [5][6] actually
implements VMGenID driver but it is a downstream implementation.  It
is never used by customers (instead SysGenID is used) and it specifies
yet another set of ACPI IDs ("VMGENID" and "QEMUVMGID").  So, more
precisely, that is why we don't test VMGenID on our 5.10 guest kernel.

[1]: https://elixir.bootlin.com/linux/v6.19.7/source/drivers/virt/vmgenid.c#L162-L163
[2]: torvalds/linux@e076067
[3]: https://github.com/firecracker-microvm/firecracker/blob/81236d82b1640cfa41f825f50be5585c758e165b/docs/snapshotting/random-for-clones.md?plain=1#L132-L139
[4]: torvalds/linux@af6b54e
[5]: https://github.com/amazonlinux/linux/blob/microvm-kernel-5.10.245-268.975.amzn2/drivers/virt/vmgenid.c#L121-L122
[6]: amazonlinux/linux@c9b81dc

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Although Firecracker doesn't do anything for SysGenID, it is used for
the snapsafety issue by guest userspace in Amazon Linux-provided microVM
kernels [1][2].  Let's test them to detect functional regression that
might be introduced in microVM kernels.

[1]: https://github.com/amazonlinux/linux/blob/microvm-kernel-5.10.245-268.975.amzn2/drivers/misc/sysgenid.c
[2]: https://github.com/amazonlinux/linux/blob/microvm-kernel-6.1.164-23.303.amzn2023/drivers/misc/sysgenid.c
Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
A vulnerability was found in aws-lc-sys [1].  Although Firecracker isn't
affected as it doesn't use AWS-LC to validate CN, update aws-lc-rs (and
aws-lc-sys indirectly) to suppress cargo-audit failure.

[1]: https://rustsec.org/advisories/RUSTSEC-2026-0044.html
Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Since host kernel 6.3 (commit 7af0c2534f4c), KVM fabricates CLIDR_EL1
instead of passing through the host's real value. On hosts with IDC=1
and DIC=0 (e.g. Neoverse V1), the fabricated CLIDR exposes only
L1=Unified when the host actually has separate L1d+L1i, L2, and L3.

Guest kernels >= 6.1.156 backported init_of_cache_level() which counts
cache leaves from the DT, while populate_cache_leaves() uses CLIDR_EL1.
When the DT (built from host sysfs) describes more cache entries than
CLIDR_EL1, the mismatch causes cache sysfs entries to not be created,
breaking /sys/devices/system/cpu/cpu*/cache/* in the guest.

Fix this by reading the current CLIDR_EL1 from vCPU 0, merging in the
ctype and LoC fields derived from the host's sysfs cache topology, and
writing the result back to each vCPU via KVM_SET_ONE_REG. Fields that
cannot be derived from sysfs (LoUU, LoUIS, ICB, Ttype) are preserved
from the original CLIDR_EL1. This makes CLIDR_EL1 consistent with the
FDT, which already describes the real host caches.

On pre-6.3 kernels, KVM passes through the real host CLIDR rather than
fabricating one. Since the sysfs cache topology already matches the real
CLIDR, the merge produces the same value, the write is skipped, and the
override is effectively a no-op.

This approach preserves the full host cache information for the guest
rather than stripping the FDT to match the fabricated CLIDR.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
There's currently no check when creating a BusRange object that the
range is valid. Add a new() method that makes sure that the base + len
arguments don't overflow and that length is non-zero. To make things
simpler and avoid potential overflows in the future store the end
address in BusRange, rather than a length to avoid extra checks for when
base+len is exactly (1<<64). Finally, make the base and end fields
private so that new BusRange objects can only be created via the
constructor (and hence are always valid) and add a test case for
verifying valid ranges.

Similarly, make BusRange::overlaps() take a BusRange rather than a base
and len. BurRange::overlaps() had the potential of overflowing without
this change, however that wouldn't happen in practice because the
resource allocator generates valid ranges.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
With the previous loop the uvm's were not being killed creating pressure
for the host the number of uvms actively running.

The timing of hot(un)plug operations gradually increased with each
iteration as more pressure was created.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
We are generating multiple vms in a loop however the factory holds onto
a reference to these so are not killed as they go out of scope.

Explicitly kill the vm after each loop.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
We are generating multiple vms in a loop however the factory holds onto
a reference to these so are not killed as they go out of scope.

Explicitly kill the vm after each loop.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.06%. Comparing base (0d74254) to head (c6c3c63).
⚠️ Report is 14 commits behind head on feature/secret-hiding.

Additional details and impacted files
@@                    Coverage Diff                    @@
##           feature/secret-hiding    #5792      +/-   ##
=========================================================
- Coverage                  82.03%   81.06%   -0.98%     
=========================================================
  Files                        277      278       +1     
  Lines                      30093    30821     +728     
=========================================================
+ Hits                       24688    24984     +296     
- Misses                      5405     5837     +432     
Flag Coverage Δ
5.10-m5n.metal 81.18% <ø> (-1.04%) ⬇️
5.10-m6a.metal 80.44% <ø> (-1.04%) ⬇️
5.10-m6g.metal 77.93% <ø> (-1.11%) ⬇️
5.10-m6i.metal 81.15% <ø> (-1.03%) ⬇️
5.10-m7a.metal-48xl 80.43% <ø> (-1.04%) ⬇️
5.10-m7g.metal 77.93% <ø> (-1.11%) ⬇️
5.10-m7i.metal-24xl 81.13% <ø> (-1.02%) ⬇️
5.10-m7i.metal-48xl 81.13% <ø> (-1.02%) ⬇️
5.10-m8g.metal-24xl 77.93% <ø> (-1.11%) ⬇️
5.10-m8g.metal-48xl 77.93% <ø> (-1.11%) ⬇️
5.10-m8i.metal-48xl 81.16% <ø> (?)
5.10-m8i.metal-96xl 81.16% <ø> (?)
6.1-m5n.metal 81.17% <ø> (-1.03%) ⬇️
6.1-m6a.metal 80.46% <ø> (-1.05%) ⬇️
6.1-m6g.metal 77.93% <ø> (-1.11%) ⬇️
6.1-m6i.metal 81.18% <ø> (-1.07%) ⬇️
6.1-m7a.metal-48xl 80.45% <ø> (-1.04%) ⬇️
6.1-m7g.metal 77.93% <ø> (-1.11%) ⬇️
6.1-m7i.metal-24xl 81.19% <ø> (-1.06%) ⬇️
6.1-m7i.metal-48xl 81.19% <ø> (-1.02%) ⬇️
6.1-m8g.metal-24xl 77.93% <ø> (-1.11%) ⬇️
6.1-m8g.metal-48xl 77.93% <ø> (-1.11%) ⬇️
6.1-m8i.metal-48xl 81.22% <ø> (?)
6.1-m8i.metal-96xl 81.22% <ø> (?)
6.18-m5n.metal 81.18% <ø> (-1.01%) ⬇️
6.18-m6a.metal 80.46% <ø> (-1.05%) ⬇️
6.18-m6g.metal 77.93% <ø> (-1.11%) ⬇️
6.18-m6i.metal 81.17% <ø> (-1.02%) ⬇️
6.18-m7a.metal-48xl 80.45% <ø> (-1.05%) ⬇️
6.18-m7g.metal 77.93% <ø> (-1.11%) ⬇️
6.18-m7i.metal-24xl 81.19% <ø> (-1.03%) ⬇️
6.18-m7i.metal-48xl 81.19% <ø> (-1.03%) ⬇️
6.18-m8g.metal-24xl 77.93% <ø> (-1.11%) ⬇️
6.18-m8g.metal-48xl 77.93% <ø> (-1.11%) ⬇️
6.18-m8i.metal-48xl 81.22% <ø> (?)
6.18-m8i.metal-96xl 81.22% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JackThomson2 JackThomson2 force-pushed the sh/support_punch_hole branch 3 times, most recently from fa4fe9d to eab0a13 Compare March 26, 2026 12:18
The previous implementation checked whether either slot endpoint fell
inside the requested range. This missed the containment case where a
slot fully contains the range (neither endpoint inside it), causing
update_kvm_slots to silently skip KVM slot registration/removal for any
block not aligned to a slot boundary.

Replace the two addr_in_range endpoint checks with a proper half-open
interval intersection test: slot_start < range_end && range_start <
slot_end.

Remove the now-unused addr_in_range helper and add a table-driven unit
test covering boundary, interior, cross-slot, full-region, outside, and
zero-length ranges.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
process_stats_queue() used the guest-provided descriptor len field as
the loop bound without validation. A misbehaving guest could set this to
u32::MAX, causing excessive iterations that temporarily monopolise the
VMM event loop.

Add a MAX_STATS_DESC_LEN check before entering the loop. The limit uses
a generous upper bound (256 tags) rather than the current spec count, so
future kernel additions won't silently break stats collection. Oversized
descriptors are logged and held without updating stats, preserving the
stats request/response protocol.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
When a non-compliant driver submits more than one stats buffer,
process_stats_queue returns the previous descriptor via add_used but
never calls advance_used_ring_idx or signal_used_queue. The write to the
used ring is therefore invisible to the guest, which can never reclaim
the buffer.

Add the missing advance_used_ring_idx and signal_used_queue calls so the
guest actually sees the returned descriptor.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
kalyazin and others added 25 commits April 28, 2026 16:16
This updates UFFD patches from "v3 UFFD minor support" to Mike's
respin RFC that adds support for both major and minor faults.
It also adds a missing NUMA dependency patch that exports MM symbols to
KVM and a fixup for UFFD series that does the same.

Current patch set (based on v6.18):
 - NUMA support in guest_memfd (from v6.19-rc5), new: missing patch
 - v10 direct map removal + fixup
 - x86: configurable TLB flushes after direct map removal
 - v2 kvmclock
 - v1 KVM userfault
 - v7 write syscall
 - fixup for direct map removal and write to work together
 - RFC UFFD support with fixups (new)
 - PoC srcu_synchronize optimisation from Sean

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
This is because guest_memfd now supports both modes.
We still need minor handling in case prepopulation logic adds a page in
the page cache and the VMM accesses it later via user mappings.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Now that guest_memfd has support for major faults, we can resolve all
on-demand user mapping faults with UFFDIO_COPY.  Remove the code that
uses memcpy memory population.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
With adding major fault support in guest_memfd, when a major fault is
generated, no page is added in the page cache by the kernel.  It means
that proactive population with write() will succeed for the entire
region but possibly one page (kvmclock on x86).  We just need to call
UFFDIO_CONTINUE in response to such a major fault because we already
populated the faulting page via write() earlier.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
This is because when upgrading the guest kernel from 6.1.155 to 6.1.163,
/sys/devices/system/cpu/cpu?/cache/* files disappeared.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
- Drop the avoidance of silent kvm-clock activation failure as Sean
  commented
- Fix a compile error when CONFIG_KVM_SW_PROTECTED_VM=y

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Add the APF patch series to be built into our SH AMI.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
This reverts commit 5417c27.

Reverting for now while we investigate the regressions in hotplugging

Signed-off-by: Jack Thomson <jackabt@amazon.com>
 - Upgrade UFFD patches from RFC to v1 including fixes from Harry,
   Edward and Mike
 - Replace Sean's SRCU fix PoC with his RFC

Current patch set (based on v6.18):
 - NUMA support in guest_memfd (from v6.19-rc5)
 - v10 direct map removal + fixup
 - x86: configurable TLB flushes after direct map removal
 - v3 kvmclock
 - v1 KVM userfault
 - v7 write syscall
 - fixup for direct map removal and write to work together
 - v1 UFFD support with fixups (new: RFC -> v1)
 - RFC srcu_synchronize optimisation from Sean (new: PoC -> RFC)

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Current patch set (based on v6.18):
 - NUMA support in guest_memfd (from v6.19-rc5)
 - v11 direct map removal (new: v10 -> v11)
 - x86: configurable TLB flushes after direct map removal
 - v3 kvmclock
 - v1 KVM userfault
 - v7 write syscall
 - fixup for direct map removal and write to work together
 - v2 UFFD support with fixups
 - RFC srcu_synchronize optimisation from Sean

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Add APF patches for the kernel

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Replace the JSON-based UFFD protocol with length-prefixed bitcode
encoding for better performance and type safety. Extract
UffdMessageBroker from lib.rs into its own module (uffd_broker.rs)
with proper error handling — send_fault_request() returns Result
instead of panicking, and the iterator logs errors instead of
unwinding.

Also adds APF socket support to UFFD handler examples and exitless
APF ring buffer structures to the handler side.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add the synchronous/fallback path for KVM async page faults. When a
guest vCPU hits a userfault page and the exitless ring buffer is full
(or not configured), KVM exits with KVM_EXIT_MEMORY_FAULT +
KVM_MEMORY_EXIT_FLAG_APF. The VMM sends the fault request to the UFFD
handler over a Unix socket, issues KVM_APF_OP_ACCEPT so the vCPU can
re-enter in a halted state, and processes the handler's reply with
KVM_APF_OP_READY to wake the vCPU.

Includes KvmAPFReq ioctl definitions, APF handling in vcpu
handle_userfault() (gated to x86_64), SharedApfStream type, updated
create_vcpus signature, and seccomp rules for the new ioctls.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add the fast path for async page faults that avoids KVM_RUN exit/
re-entry overhead. Each vCPU gets an ExitlessApfContext backed by a
memfd-mapped shared page containing notify and completion ring
buffers (32 entries each). KVM writes fault GPAs to the notify ring
and signals an eventfd; the UFFD handler resolves the page and
writes to the completion ring; KVM drains completions via a
workqueue and wakes the halted vCPU.

Three fds per vCPU (notify eventfd, completion eventfd, shared page
memfd) are passed to the UFFD handler via SCM_RIGHTS over the
existing Unix socket.

Key design: exitless APF setup and handler unblocking happens BEFORE
vCPU state restore, since MSR restore (kvm-clock) triggers userfaults
on UFFD-registered memory that the handler must be ready to service.

APF capability is detected at runtime — gracefully falls back to
synchronous UFFD on kernels without KVM_CAP_ASYNC_PF_USERFAULT.

Fix default_vmm() test helper to create a real pipe for
_apf_pipe_reader instead of stealing fd 0 (stdin), which caused an
IO Safety violation (double-close SIGABRT) during test runs.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
GUEST_MEMFD_FLAG_WRITE was added accidentally on a rebase on the booted
path. Remove this to fixup booted path.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Previously we never passed the APF socket to the handler so we weren't
using APF properly

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add integration and performance tests for APF to compare the perf impact

Signed-off-by: Jack Thomson <jackabt@amazon.com>
There were a couple of error on the non exitless path for APF. We saw
these with the memory hotplug which flodded the APF buffer.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Holding the spinlock while calling the dequeue method caused massive
regressions in the hotplugging tests.

It was discovered that the root was the spinlock disabling preempt, this
had a knock on effect when calling the dequeue.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Current patch set (based on v6.18):
 - NUMA support in guest_memfd (from v6.19-rc5)
 - v12 direct map removal (new: v11 -> v12)
 - x86: configurable TLB flushes after direct map removal
 - v3 kvmclock
 - v1 KVM userfault
 - v7 write syscall
 - fixup for direct map removal and write to work together
 - v2 UFFD support with fixups
 - RFC srcu_synchronize optimisation from Sean
 - Tmp RFCv2 Async PF (non-published)

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
MADV_DONTNEED is a no-op for MAP_SHARED mappings, which means
discard_range() previously did nothing for guest_memfd-backed memory.
This prevented virtio-mem unplug and balloon inflate from actually
freeing physical pages back to the host when secret_free is enabled.

Add a fallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE) path for
MAP_SHARED file-backed regions, which punches holes in the guest_memfd
backing file and releases the pages from the page cache.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Now that discard_range() uses fallocate(PUNCH_HOLE) for MAP_SHARED
guest_memfd regions, the balloon can properly reclaim memory when
secret_free is enabled. Remove the restriction that prevented
combining balloon with secret_free.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
When secret_free is enabled, guest memory is backed by guest_memfd.
Hotplug memory regions are never mapped into the VMM's userspace, so
host VmRSS doesn't reflect those pages. Guest-side (total - available)
is too noisy because mem_available tracks mem_total closely when the
guest is idle, so unplug doesn't produce a reliable delta.

Use the FC process's cgroup memory.current, which accounts for kernel
pages allocated by guest_memfd and correctly drops after
fallocate(PUNCH_HOLE). This gives a monotonic, block-accurate signal
for virtio-mem unplug and balloon inflate under secret_free.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add the secret_free fixture parameter to all balloon functional tests,
so they run with both SF_OFF and SF_ON variants. This exercises the
fallocate(PUNCH_HOLE) discard path for guest_memfd-backed memory during
balloon inflate/deflate.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Now that discard_range() uses fallocate(PUNCH_HOLE) for guest_memfd,
and get_resident_memory() uses guest meminfo for secret_free VMs,
the RSS decrease assertion in check_hotunplug works correctly for
secret_free. Remove the skip.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
After discard_range() punches a hole in a guest_memfd-backed region via
fallocate(PUNCH_HOLE), the folios for those pages are released. The
userfault bitmap bits for those pages, however, may already be cleared
from a previous UFFDIO_COPY (the handler clears the bit on successful
copy). With a cleared bit and no folio, the next guest access takes the
UFFD MINOR-fault path and the handler calls UFFDIO_CONTINUE, which
returns EFAULT because there is no folio to install.

Re-set the userfault bitmap bits covering the punched range so the next
access takes the MISSING-fault path and the handler re-populates via
UFFDIO_COPY.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
@JackThomson2 JackThomson2 marked this pull request as draft May 14, 2026 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.