Fix: Prevent infinite idle holdoff loop blocking D3cold re-entry after AC replug#1074
Fix: Prevent infinite idle holdoff loop blocking D3cold re-entry after AC replug#1074DhineshPonnarasan wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
|
The suggested patch applies cleanly on top of main, but it does not build for me. If this is sufficient to get my hardware to behave nicely with 590/main and GSP firmware, I would of course use it. My observations and workarounds in #1071 are made in the context of 580 and with firmware loading disabled. |
|
Hi @dagbdagb , I have restored both fields in nv-priv.h:
So the references in dynamic-power.c now resolve again. Could you please pull the latest commit from PR #1074 and rebuild? Also, thanks for the note about 580 and firmware-disabled behavior in #1071. That context is very useful and I’ll keep it in mind while validating this on 590/main. |
|
First things first: Second: I did as follows:
With this, I saw the following: A couple of observations at this point in time:
I spent all of 14 seconds scanning my process list, before I realized TLP was running. After reboot, when I unplug/replug power, dmesg is quiet. And the dGPU now remains in d3cold. Additonal testing:
|
|
Please let me know what you want me to test, and in what sequence. |
|
Hi @dagbdagb , Your results already show a strong improvement (power reinsert can return to d3cold), but because Please run this exact sequence on the latest PR head and share results:
Also, your notes about GSP behavior, ACPI warnings, |
At this stage, the following occured: I'll retest and check if I can provoke this on command.
|
|
I see that If not, I can redo this particular test with this PR on top of this driver tree. |
|
Thank you @dagbdagb for the detailed testing. Your results confirm that, with user-space power managers disabled, the GPU reliably enters D3cold after all tested scenarios, both with and without the patch. The Xid and SBIOS assertion errors appear unrelated to the RTD3 patch and are likely firmware or platform issues. Regarding The main value of the PR is to improve kernel log visibility for RTD3 state transitions and blockers, which your earlier testing confirmed. If you’d like to rerun the |
Problem
On Turing GPUs with:
NVreg_DynamicPowerManagement=0x02(FINE mode)NVreg_EnableGpuFirmware=0(GSP disabled)the GPU fails to return to D3cold after AC power is reinserted.
Observed behavior
This results in persistent power usage (~7–10W) until reboot.
Root Cause Analysis
The issue originates in the idle holdoff removal logic inside: RmRemoveIdleHoldoff()
Failure scenario
After AC replug:
RmCheckForGcxSupportOnCurrentState()idle_precondition_check_callback_scheduledIn AC-powered mode:
RmCheckForGcxSupportOnCurrentState()repeatedly returnsfalseAs a result, the following loop occurs:
RmRemoveIdleHoldoff()
→ GC6 not available
→ idle precondition callback not scheduled
→ reschedule RmRemoveIdleHoldoff()
→ repeat indefinitely
Consequence
nv_indicate_idle()is never calledpm_runtime_put_noidle()is never triggeredSolution
Introduce a bounded retry mechanism to break the infinite rescheduling loop.
Key idea
Allow a limited number of retries for GC6 eligibility, then force idle indication.
Implementation details
Add a counter: idle_holdoff_reschedule_count
Define a retry limit: MAX_IDLE_HOLDOFF_RESCHEDULES (e.g., 4)
Modify
RmRemoveIdleHoldoff()Behavior
If GC6 becomes available OR idle preconditions are met:
Proceed normally
Call
nv_indicate_idle()Reset counter
If GC6 is still unavailable:
Retry up to N times (~20 seconds total)
After threshold is reached:
Force
nv_indicate_idle()Reset counter
Allow autosuspend fallback
Why this works
The fix ensures:
Resulting flow
nv_indicate_idle()
→ pm_runtime_put_noidle()
→ runtime suspend scheduled
→ nv_pmops_runtime_suspend()
→ GPU transitions to D3cold
Safety & Impact Analysis
No functional regression
Minimal scope
RmRemoveIdleHoldoff()Safe fallback behavior
Verification
This fix has been:
Note:
This change has not been tested on real hardware due to environment limitations (WSL).
Expected Outcome
After applying this fix:
Request for Validation
Testing on affected systems (Turing + RTD3 FINE mode) would be greatly appreciated to confirm:
References
Related components:
dynamic-power.cSummary
This change resolves a timing-dependent infinite rescheduling condition by introducing a bounded retry mechanism, ensuring the GPU can always re-enter a low-power state even when GC6 is unavailable.
Hi @dagbdagb ,
Please review these changes and let me know if any further modifications are needed. If you notice any issues, please leave a comment below and I’ll address them. Thank you!