Skip to content

Fix: Prevent infinite idle holdoff loop blocking D3cold re-entry after AC replug#1074

Open
DhineshPonnarasan wants to merge 2 commits intoNVIDIA:mainfrom
DhineshPonnarasan:fix/rtd3-d3cold-reentry-issue-1071
Open

Fix: Prevent infinite idle holdoff loop blocking D3cold re-entry after AC replug#1074
DhineshPonnarasan wants to merge 2 commits intoNVIDIA:mainfrom
DhineshPonnarasan:fix/rtd3-d3cold-reentry-issue-1071

Conversation

@DhineshPonnarasan
Copy link
Copy Markdown

@DhineshPonnarasan DhineshPonnarasan commented Mar 23, 2026

Problem

On Turing GPUs with:

  • NVreg_DynamicPowerManagement=0x02 (FINE mode)
  • NVreg_EnableGpuFirmware=0 (GSP disabled)

the GPU fails to return to D3cold after AC power is reinserted.

Observed behavior

  • Boot (AC connected) → GPU correctly enters D3cold
  • Unplug AC → GPU behavior remains correct
  • Reinsert AC → GPU transitions to D0
  • GPU never returns to D3cold afterward

This results in persistent power usage (~7–10W) until reboot.


Root Cause Analysis

The issue originates in the idle holdoff removal logic inside: RmRemoveIdleHoldoff()

Failure scenario

After AC replug:

  • GPU transitions to D0 and enters an active state
  • Idle detection relies on:
    • RmCheckForGcxSupportOnCurrentState()
    • idle_precondition_check_callback_scheduled

In AC-powered mode:

  • GC6 (deep idle) may be unavailable
  • RmCheckForGcxSupportOnCurrentState() repeatedly returns false

As a result, the following loop occurs:
RmRemoveIdleHoldoff()
→ GC6 not available
→ idle precondition callback not scheduled
→ reschedule RmRemoveIdleHoldoff()
→ repeat indefinitely

Consequence

  • nv_indicate_idle() is never called
  • pm_runtime_put_noidle() is never triggered
  • Runtime suspend is never reached
  • GPU remains stuck in D0

Solution

Introduce a bounded retry mechanism to break the infinite rescheduling loop.

Key idea

Allow a limited number of retries for GC6 eligibility, then force idle indication.

Implementation details

  • Add a counter: idle_holdoff_reschedule_count

  • Define a retry limit: MAX_IDLE_HOLDOFF_RESCHEDULES (e.g., 4)

  • Modify RmRemoveIdleHoldoff()

Behavior

  • If GC6 becomes available OR idle preconditions are met:

  • Proceed normally

  • Call nv_indicate_idle()

  • Reset counter

  • If GC6 is still unavailable:

  • Retry up to N times (~20 seconds total)

  • After threshold is reached:

  • Force nv_indicate_idle()

  • Reset counter

  • Allow autosuspend fallback


Why this works

The fix ensures:

  • Infinite rescheduling is eliminated
  • Idle indication is eventually triggered
  • Runtime PM flow resumes correctly

Resulting flow

nv_indicate_idle()
→ pm_runtime_put_noidle()
→ runtime suspend scheduled
→ nv_pmops_runtime_suspend()
→ GPU transitions to D3cold


Safety & Impact Analysis

No functional regression

  • Battery mode behavior unchanged
  • GC6-enabled systems unaffected
  • Default and disabled modes unaffected

Minimal scope

  • Change localized to RmRemoveIdleHoldoff()
  • No modification to core RM or PM logic

Safe fallback behavior

  • Only triggers when GC6 is persistently unavailable
  • Uses existing autosuspend path
  • Avoids introducing new power states or transitions

Verification

This fix has been:

  • Verified via detailed static analysis of execution flow
  • Validated for:
    • loop termination
    • correct counter handling
    • safe state transitions
    • absence of race conditions

Note:
This change has not been tested on real hardware due to environment limitations (WSL).


Expected Outcome

After applying this fix:

  • GPU may wake to D0 on AC replug
  • After ~20 seconds of inactivity:
    • GPU correctly returns to D3cold
  • Eliminates persistent power drain

Request for Validation

Testing on affected systems (Turing + RTD3 FINE mode) would be greatly appreciated to confirm:

  • D3cold re-entry after AC replug
  • No regressions in suspend/resume or idle behavior

References

Related components:

  • dynamic-power.c
  • runtime PM (RTD3)
  • GC6 idle state handling

Summary

This change resolves a timing-dependent infinite rescheduling condition by introducing a bounded retry mechanism, ensuring the GPU can always re-enter a low-power state even when GC6 is unavailable.


Hi @dagbdagb ,
Please review these changes and let me know if any further modifications are needed. If you notice any issues, please leave a comment below and I’ll address them. Thank you!

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 23, 2026

CLA assistant check
All committers have signed the CLA.

@DhineshPonnarasan DhineshPonnarasan marked this pull request as ready for review March 23, 2026 06:12
@dagbdagb
Copy link
Copy Markdown

dagbdagb commented Mar 23, 2026

The suggested patch applies cleanly on top of main, but it does not build for me.

 [ nvidia            ]  CC           arch/nvalloc/unix/src/osapi.c
 [ nvidia            ]  CC           arch/nvalloc/unix/src/osinit.c
arch/nvalloc/unix/src/dynamic-power.c: In function ‘RmCanEnterGcxUnderGpuLock’:
arch/nvalloc/unix/src/dynamic-power.c:326:48: error: ‘nv_dynamic_power_t’ {aka ‘struct nv_dynamic_power_s’} has no member named ‘gcoff_max_fb_size’
  326 |               (usedFbSize <= nvp->dynamic_power.gcoff_max_fb_size) &&
      |                                                ^
arch/nvalloc/unix/src/dynamic-power.c:327:34: error: ‘nv_dynamic_power_t’ {aka ‘struct nv_dynamic_power_s’} has no member named ‘clients_gcoff_disallow_refcount’
  327 |               (nvp->dynamic_power.clients_gcoff_disallow_refcount == 0)))
      |                                  ^
arch/nvalloc/unix/src/dynamic-power.c: In function ‘osClientGcoffDisallowRefcount’:
arch/nvalloc/unix/src/dynamic-power.c:690:27: error: ‘nv_dynamic_power_t’ {aka ‘struct nv_dynamic_power_s’} has no member named ‘clients_gcoff_disallow_refcount’
  690 |         nvp->dynamic_power.clients_gcoff_disallow_refcount++;
      |                           ^
arch/nvalloc/unix/src/dynamic-power.c:694:27: error: ‘nv_dynamic_power_t’ {aka ‘struct nv_dynamic_power_s’} has no member named ‘clients_gcoff_disallow_refcount’
  694 |         nvp->dynamic_power.clients_gcoff_disallow_refcount--;
      |                           ^
arch/nvalloc/unix/src/dynamic-power.c: In function ‘rm_init_dynamic_power_management’:
arch/nvalloc/unix/src/dynamic-power.c:935:23: error: ‘nv_dynamic_power_t’ {aka ‘struct nv_dynamic_power_s’} has no member named ‘gcoff_max_fb_size’
  935 |     nvp->dynamic_power.gcoff_max_fb_size =
      |                       ^
arch/nvalloc/unix/src/dynamic-power.c: In function ‘RmInitDeferredDynamicPowerManagement’:
arch/nvalloc/unix/src/dynamic-power.c:2202:31: error: ‘nv_dynamic_power_t’ {aka ‘struct nv_dynamic_power_s’} has no member named ‘clients_gcoff_disallow_refcount’
 2202 |             nvp->dynamic_power.clients_gcoff_disallow_refcount = 0;
      |                               ^
arch/nvalloc/unix/src/dynamic-power.c: In function ‘RmCheckForGcOffPM’:
 [ nvidia-modeset    ]  CC           _out/Linux_x86_64/g_nvid_string.c
arch/nvalloc/unix/src/dynamic-power.c:2244:31: error: ‘nv_dynamic_power_t’ {aka ‘struct nv_dynamic_power_s’} has no member named ‘clients_gcoff_disallow_refcount’
 2244 |         if (nvp->dynamic_power.clients_gcoff_disallow_refcount != 0)
      |                               ^
arch/nvalloc/unix/src/dynamic-power.c:2247:47: error: ‘nv_dynamic_power_t’ {aka ‘struct nv_dynamic_power_s’} has no member named ‘gcoff_max_fb_size’
 2247 |         gcoff_max_fb_size = nvp->dynamic_power.gcoff_max_fb_size;
      |                                               ^
 [ nvidia-modeset    ]  LD           _out/Linux_x86_64/nv-modeset-kernel.o
make[1]: *** [Makefile:203: _out/Linux_x86_64/dynamic-power.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/home/dagb/gits/open-gpu-kernel-modules/src/nvidia-modeset'
cd kernel-open/nvidia-modeset/ && ln -sf ../../src/nvidia-modeset/_out/Linux_x86_64/nv-modeset-kernel.o nv-modeset-kernel.o_binary
make[1]: Leaving directory '/home/dagb/gits/open-gpu-kernel-modules/src/nvidia'
make: *** [Makefile:34: src/nvidia/_out/Linux_x86_64/nv-kernel.o] Error 2

If this is sufficient to get my hardware to behave nicely with 590/main and GSP firmware, I would of course use it.
But out of the box, I have had more success with 580.

My observations and workarounds in #1071 are made in the context of 580 and with firmware loading disabled.

@DhineshPonnarasan
Copy link
Copy Markdown
Author

Hi @dagbdagb ,
Thanks for catching the build break.
You were right: the issue was that two existing members in the dynamic power struct were accidentally dropped while adding the new retry counter.

I have restored both fields in nv-priv.h:

  • clients_gcoff_disallow_refcount
  • gcoff_max_fb_size

So the references in dynamic-power.c now resolve again.

Could you please pull the latest commit from PR #1074 and rebuild?
The previous missing-member errors in dynamic-power.c should be gone.

Also, thanks for the note about 580 and firmware-disabled behavior in #1071. That context is very useful and I’ll keep it in mind while validating this on 590/main.

@dagbdagb
Copy link
Copy Markdown

dagbdagb commented Mar 23, 2026

First things first:
I can now pull/reinsert power and have the card come back in d3cold.

Second:
I have not actually verified if this patch was what fixed it, but the RTD3 kernel messages are massively helpful.

I did as follows:

  1. removed all installed nvidia-drivers
  2. pulled open-gpu-kernel-modules from GH and merged this PR on top
  3. built the open-gpu-kernel-modules drivers package make modules -j$(nproc)
  4. installed the nvidia-drivers package (595.45.04)
  5. deleted the 5 nvidia*.ko drivers in /lib/modules/linux......
  6. installed the newly built kernel drivers in this repo: make modules_install -j$(nproc)
  7. reboot

With this, I saw the following:

Booting ok:
[    1.588070] nvidia: loading out-of-tree module taints kernel.
[    1.603218] nvidia-nvlink: Nvlink Core is being initialized, major device number 243
[    1.606502] nvidia 0000:01:00.0: enabling device (0000 -> 0003)
[    1.608927] nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[    1.659522] nvidia 0000:01:00.0: NVRM: [RTD3] nv_indicate_idle: pm_runtime_put_noidle, usage_count=1
[    1.710253] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  595.45.04  Release Build  (dagb@gillette)  ma. 23. mars 16:15:51 +0100 2026
[    1.715624] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    1.716201] [drm] Initialized nvidia-drm 0.0.0 for 0000:01:00.0 on minor 2
[    2.714169] nvidia 0000:01:00.0: NVRM: [RTD3] pmops_runtime_suspend: entry, usage_count=0
[    2.714766] nvidia 0000:01:00.0: NVRM: [RTD3] transition_dynamic_power: enter(suspend) skipped (not initialized)
[    2.715416] nvidia 0000:01:00.0: NVRM: [RTD3] pmops_runtime_suspend: exit ok, err=0

(card in d3cold at this time)

pulling out power
reinserting power
[  130.777115] nvidia 0000:01:00.0: NVRM: [RTD3] pmops_runtime_resume: entry, usage_count=0
[  130.777129] nvidia 0000:01:00.0: NVRM: [RTD3] transition_dynamic_power: exit(resume) skipped (not initialized)
[  130.777145] nvidia 0000:01:00.0: NVRM: [RTD3] pmops_runtime_suspend: entry, usage_count=0
[  130.777150] nvidia 0000:01:00.0: NVRM: [RTD3] transition_dynamic_power: enter(suspend) skipped (not initialized)
[  130.777156] nvidia 0000:01:00.0: NVRM: [RTD3] pmops_runtime_suspend: exit ok, err=0
[  130.903665] nvidia 0000:01:00.0: NVRM: [RTD3] pmops_runtime_resume: entry, usage_count=1
[  130.903680] nvidia 0000:01:00.0: NVRM: [RTD3] transition_dynamic_power: exit(resume) skipped (not initialized)

(card in d0 at this time)


starting llama.cpp

[  214.460616] nvidia 0000:01:00.0: NVRM: [RTD3] nv_indicate_not_idle: pm_runtime_get_noresume, usage_count=2
[  214.461140] Loading firmware: nvidia/595.45.04/gsp_tu10x.bin
[  214.513509] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20250807/nsarguments-61)
[  214.513776] ACPI Warning: \_SB.NPCF._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20250807/nsarguments-61)
[  215.447652] NVRM: GPU0 nvAssertOkFailedNoLog: Assertion failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from PlatformRequestHandler failed to get target temp from SBIOS @ platform_request_handler_ctrl.c:2171
[  215.447662] NVRM: GPU0 nvAssertOkFailedNoLog: Assertion failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from PlatformRequestHandler failed to get platform power mode from SBIOS @ platform_request_handler_ctrl.c:2114

exiting llama.cpp

[  324.059517] llama-server (1520) used greatest stack depth: 7320 bytes left
[  324.431218] nvidia 0000:01:00.0: NVRM: [RTD3] nv_indicate_idle: pm_runtime_put_noidle, usage_count=1

(card remains in d0)

A couple of observations at this point in time:

  • firmware loading is enforced, but delayed. NVreg_EnableGpuFirmware=0 is silently ignored.
    (the README in this repo appears to state that the firmware now is mandatory)
  • My UEFI firmware is slightly buggy(?)
  • something appears to start talking to the dGPU on power insert

I spent all of 14 seconds scanning my process list, before I realized TLP was running.
I deinstalled TLP and rebooted.

After reboot, when I unplug/replug power, dmesg is quiet.

And the dGPU now remains in d3cold.

Additonal testing:

  • suspend when dGPU is in use:
    works only after setting NVreg_PreserveVideoMemoryAllocations=0
    (will not suspend at all if set to 1, I think this may be documented somewhere)
  • if dGPU isn't in use, dGPU comes back in d3cold after having been suspended
  • /proc/driver/nvidia/gpus/0000\:01\:00.0/power is/becomes confused:
cat /proc/driver/nvidia/gpus/0000\:01\:00.0/power
Runtime D3 status:          ?
Tegra iGPU Rail-Gating:     Disabled
Video Memory:               ?

GPU Hardware Support:
 Video Memory Self Refresh: ?
 Video Memory Off:          ?

S0ix Power Management:
 Platform Support:          Not Supported
 Status:                    ?

Notebook Dynamic Boost:     ?

@dagbdagb
Copy link
Copy Markdown

Please let me know what you want me to test, and in what sequence.

@DhineshPonnarasan
Copy link
Copy Markdown
Author

DhineshPonnarasan commented Mar 24, 2026

Hi @dagbdagb ,
Thanks a lot for the detailed testing and logs, this is extremely helpful.

Your results already show a strong improvement (power reinsert can return to d3cold), but because tlp and workload activity were confounding factors, I would really appreciate one final controlled run to isolate this PR’s behavior.

Please run this exact sequence on the latest PR head and share results:

  1. Environment

    • Confirm kernel version, GPU model, firmware mode (EnableGpuFirmware=0 or default), and whether tlp is removed/disabled.
  2. Clean rebuild sanity

    • Build from latest PR tip with clean tree.
    • Expected: no missing struct field build errors.
  3. Idle baseline on AC

    • Boot on AC and leave system idle.
    • Expected: GPU reaches low-power state.
  4. AC unplug then AC replug while fully idle

    • No GPU workload running.
    • Wait at least 60 seconds after replug.
    • Expected: GPU may wake briefly but should return to low-power state (not stay in D0 indefinitely).
  5. Repeatability

    • Repeat unplug/replug 3 times under same idle conditions.
    • Expected: no permanent stuck-in-D0 case.
  6. Evidence to share

    • dmesg excerpt containing [RTD3] lines around runtime resume/suspend and idle transitions.
    • Final GPU power state after each cycle.
    • If failure occurs, include last 100 RTD3 lines and exact final state.

Also, your notes about GSP behavior, ACPI warnings, PreserveVideoMemoryAllocations interaction, and /proc power output are valuable. We should track those separately if they reproduce independently of this RTD3 issue.

@dagbdagb
Copy link
Copy Markdown

dagbdagb commented Mar 28, 2026

  1. Environment
  • 6.19.10-gentoo
  • TU106M [GeForce RTX 2070 Mobile]
  • Firmware loading enforced on by driver, EnableGpuFirmware=0 has no effect
  • driver version 595.58.03 (unpatched. That is: this PR is not added)
  • tlp not running
  • upower not running
  • acpid not running
  1. Clean rebuild sanity
  • drivers not rebuilt
  1. Idle baseline on AC
  • card in d3cold after reboot
  1. Testing
    1. AC unplug then AC replug while fully idle
    • card remains in d3cold after an unplug/replug cycle
    1. System suspend and wakeup
    • card remains in d3cold after wakeup
    1. Start program utilizing dGPU, exit program
    • card returns to d3cold after program exits
    1. Start program utilizing dGPU, suspend, wakeup, exit program
    • card returns to d3cold after program exits

At this stage, the following occured:

 1580.905299] NVRM: GPU at PCI:0000:01:00: GPU-793f405e-739f-f65c-4045-23dabed6259e
[ 1580.905303] NVRM: Xid (PCI:0000:01:00): 31, pid=822, name=modprobe, channel 0x0a000009, intr 00000000. MMU Fault: ENGINE HOST11 HUBCLIENT_HOST faulted @ 0x1_210d0000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[ 1580.905720] NVRM: nvGpuOpsReportFatalError: uvm encountered global fatal error 0x60, requiring os reboot to recover.
[ 1580.910600] NVRM: Xid (PCI:0000:01:00): 154, GPU recovery action changed from 0x0 (None) to 0x2 (Node Reboot Required)
[ 1581.010052] NVRM: Xid (PCI:0000:01:00): 31, pid=822, name=modprobe, channel 0x09000007, intr 00000000. MMU Fault: ENGINE HOST10 HUBCLIENT_HOST faulted @ 0x1_21070000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[ 1591.172844] Loading firmware: nvidia/595.58.03/gsp_tu10x.bin
[ 1592.120698] NVRM: GPU0 nvAssertOkFailedNoLog: Assertion failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from PlatformRequestHandler failed to get target temp from SBIOS @ platform_request_handler_ctrl.c:2171
[ 1592.120709] NVRM: GPU0 nvAssertOkFailedNoLog: Assertion failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from PlatformRequestHandler failed to get platform power mode from SBIOS @ platform_request_handler_ctrl.c:2114
[ 1595.009670] Loading firmware: nvidia/595.58.03/gsp_tu10x.bin
[ 1596.000270] NVRM: GPU0 nvAssertOkFailedNoLog: Assertion failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from PlatformRequestHandler failed to get target temp from SBIOS @ platform_request_handler_ctrl.c:2171
[ 1596.000279] NVRM: GPU0 nvAssertOkFailedNoLog: Assertion failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from PlatformRequestHandler failed to get platform power mode from SBIOS @ platform_request_handler_ctrl.c:2114
[ 1598.342854] Loading firmware: nvidia/595.58.03/gsp_tu10x.bin
[ 1599.302842] NVRM: GPU0 nvAssertOkFailedNoLog: Assertion failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from PlatformRequestHandler failed to get target temp from SBIOS @ platform_request_handler_ctrl.c:2171
[ 1599.302852] NVRM: GPU0 nvAssertOkFailedNoLog: Assertion failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from PlatformRequestHandler failed to get platform power mode from SBIOS @ platform_request_handler_ctrl.c:2114

I'll retest and check if I can provoke this on command.

    1. Start program utilizing dGPU, unplug AC, replug AC, exit program
    • not tested yet.
  1. Repeatability
  • card remains in d3cold after repeated unplug/replug AC cycles.
  • card remains in d3cold after repeated suspend/wakup cycles.
  1. Evidence to share
    As noted above, this testing was performed with 595.58.03 on top of 6.19.10.
    I wanted to check if this saga is caused by user-space software preventing the card from entering d3cold. As of right now, this does indeed appear to be the case. The proposed PR did nevertheless offer the RTD3 kernel messages highlighting this fact in my earlier testing. I find those messages to carry sufficient value to deserve a separate PR.

@dagbdagb
Copy link
Copy Markdown

I see that options nvidia-drm modeset=1 prevents d3cold with unpatched drivers.
Is this expected?

If not, I can redo this particular test with this PR on top of this driver tree.

@DhineshPonnarasan
Copy link
Copy Markdown
Author

Thank you @dagbdagb for the detailed testing. Your results confirm that, with user-space power managers disabled, the GPU reliably enters D3cold after all tested scenarios, both with and without the patch. The Xid and SBIOS assertion errors appear unrelated to the RTD3 patch and are likely firmware or platform issues.

Regarding nvidia-drm modeset=1, it is expected that enabling KMS can prevent D3cold, as the kernel needs to keep the device active for display management. This is standard behavior.

The main value of the PR is to improve kernel log visibility for RTD3 state transitions and blockers, which your earlier testing confirmed. If you’d like to rerun the modeset=1 test with the PR applied, it would be helpful, but the current results already support the PR’s value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Yet another rtd3/d3cold bug variant with Turing/580

3 participants