Laptop keeps eating NVMe drives via 1000 power cycles per hour

One of my notebooks is a Lenovo Yoga Slim 7 13ACN5 with a Ryzen 5800U APU, which actually saw relatively little action: it only had around 280 hours on the clock when it started to run into problems booting Windows while Ubuntu still worked, but also reported drive errors (both OS shared the same 500GB WDC SN730 drive).

When I tried to find out what was going on I saw that there was an insane number of power cycles on the drive, close to 250,000 power cycles for those 280 hours! Only around 7.3TB had been read and 6.6TB written, not much use by any count.

SMART gave a critical warning so I transferred the Windows installation to a new Micron NVMe drive which came with another Lenovo laptop, while Ubuntu was let go.

But on that other drive a similar story keeps repeating: after 1061 hours of operation the power cycles have risen to almost 80,000 and the other day the machine wouldnā€™t come back from hibernation, evidently because the hibernation file contained a replaced sector: the available spare percentage was 83%, which seems quite low.

I also noticed that out of the near 80,000 power cycles, around 60,000 were listed as ā€œunsafeā€ and that there were 30,000 error log entries, which I havenā€™t yet tried to look at (no nice Windows GUI that I know for that).

The count seems to increase at the rate of minutes when in energy saving state, without power (shut-down or hibernation) little seems to change.

Iā€™ve compared to the various other laptops and systems I operate and their numbers seem entirely sane, the other extreme is a corporate laptop, which has run 24x7 for years and has only 134 power cycles for nearly 17,000 hours of operation, but most machines tend to have 2-3x the hours than power cycles and rarely these high numbers.

My impression is that every time Windows wants to tell the SSD that little is currently going and and it may want to save some power, itā€™s actually cutting power, and unsafely, too.

Itā€™s the only AMD notebook I have in operation, but currently the majority of my desktops are Ryzens and show no such behavior.

All BIOS are checked and updated on those monthly patch days, likewise the NVMe drives all have their most up-to-date firmwares. Energy saving is set to ā€œbalancedā€ and/or ā€œintelligentā€ wherever Iā€™m given a choice.

Iā€™m also running the newest drivers from Lenovo (laptops) or AMD (desktops).

Of course the machine is from 2021 and out of warranty, Lenovo online support chat people feign ā€œcurrently experiencing technical difficultiesā€ when I try that avenue, but I havenā€™t really found a lot of similar stories.

When the topic is raised at all, most responses say not to worry, but at nearly 1000 power cycles per hour, clearly some drives are throwing the towel.

To me it sounds like a firmware issue where the wrong commands are sent to the NVMe drive during power management, but as an end user I donā€™t see how I could diagnose that.

So, could you guys have a look and see if you notice similarly high power cycles/operation hours on some of your machines and if certain Lenovos or AMD systems stick out?

2 Likes

Disable PCIe power management under power plan config, this is standard protocol in enterprise.

then do the typical:
make sure BIOS is up to date, etc.

Oh! and replace that drive, itā€™s dead

1 Like

Pcie power management does not power cycle drive during use, much less cause unclean shutdown. Drives support sleep and deep sleep modes for reason. Thats what pcie power management controls and they do not count as power cycles.

Unless this is sensor readout failure (i.e reported power cycle count is nonsense), this seem like hardware issue leading to power delivery failures, and pretty hardcore ones.

Computer I am sitting in front of has less logged power cycles over year of daily use that you do in single hour.

Your laptop is killing your drives, junk it as soon as possible. And avoid that vendor if able.

Also I would be wary of drives, since unclean power shutdown is one of few thing that borks ssd reliably over time. Its kinda surprising you were able to use for so long as you did.

Client ssds do not ship with power capacitors like early models did, so if power is cut during activity, data loss or corruption is expected.

6 Likes

Thanks for your replies!

Iā€™ve tried modifying the power plan, setting both battery and wall plug intervals to 0 or ā€œneverā€, but that doesnā€™t do anything noticeable. As soon as I shut the lid (which has the laptop go into suspend to RAM energy savings), it will add a power cycle every couple of minutes, easily a dozen within the 30 minutes I tested. Hibernation will increase the count by one as expected.

Junking the laptop isnā€™t really high on my option list, the machine is practically new, having been used very little because my travel habits changed with the pandemic.

Iā€™m trying to prepare it as a student notebook for one of my kids, because it has much better battery performance than a Whiskey Lake variant which never did more than 4 hours even on a fresh battery (but doesnā€™t suffer the power cycles issue).

Iā€™ll try next to see if the power cycle also increases under continued operations which would confirm the theory that itā€™s a hardware power deliver issue rather than some BIOS ACPI misunderstanding between the drives and the Laptop. In that case I guess it could be fixed by a mainboard replacement, which unfortunately makes very little economic sense.

Whatā€™s fascinating in that case is the fact that I never really noticed issues under normal operations with the machine, it seems to run just fine, no stutters or data loss with either Windows or Linux. The booting issues on Windows seem mostly related to Microsoft potentially re-enabling ā€œfast startā€ or hybrid standby on OS updates and I can imagine how power cycles on the SSD might mess with the synchronicity of in-chip and on-disk RAM copies as it tries to power down, resulting in a notebook that canā€™t wake up properly without disconnecting storage.

Only Linux gave me the proper SMART warnings which had me investigate and find those crazy power cycle numbers. Iā€™ve since put the ā€˜failedā€™ drive as a secondary in a desktop and did some testing with HD Tune Pro, an older tool, which ran just fine and reported no bad sectors (evidently enough spares left). No amount of vendor testing and wiping returned it to health, but itā€™s not brutally dead, either, with only 280 hours on the clock and around 7TB read and written but those crazy power cycles.

The issue canā€™t be the NVMe drive(s), because the replacement drive is showing the same trend, except it hasnā€™t triggered a health warning from the power cycles yet.

Iā€™ll also try to see if I can read some of the 30.000 log messages via smartctl under Linux.

I find it hard to go off Lenovo, because their spread of models is so huge. But the next one due on Friday is a ThinkPad x13 with a Hawk Point APU, currenty selling at 30% over the cost of a mainboard replacement for the Cezanne unit but with 32GB instead of 16GB of RAM, a better iGPU and three years of fresh warranty.

Iā€™ll test that very thoroughly before the return window expires. Letā€™s see if that does better than the cute and shiny models I fell for previously.

2 Likes

rma it, itā€™s defective

Warranty expired in July 2023ā€¦ RMA is not an option, Iā€™m afraid.

Did a little more testing:

I deactivated energy saving or suspend to RAM and had the machine idle on battery for two hours mostly with the lid closed to turn off the display. Average power consumption according to HWinfo was slightly below 5 Watts and there was no increase in NVMe power cycles.

This seems to confirm that during normal operations there is no issue with the power being cut from the drive e.g. via a flaky power line, the power cycles correlate with the attempt of putting the SSD into a power saving state.

My understanding here would be that it should be a set of commands sent to the SSD to enable and disable potentially different levels of power save, but that the SoC shouldnā€™t cut the power to the NVMe (that shouldnā€™t even be physically possible).

Actually I wonder if anything has the ability to physically cut the SSDs power e.g. like a PMIC and if e.g. the service processor as part of its ACPI capabilities could do so. Without flaky hardware, it would be necessary, because a soft power off command to the drive shouldnā€™t have the drive execute an ā€œunsafeā€ power off. And according to the SMART counters the majority of all power-offs are ā€œunsafeā€, which I guess can only happen via a power cut, not from a wrong sequence of SATA/NVMe commands, right?

Yet if the hardware is flaky, why does it not happen during normal operations?

As counter check Iā€™ll now re-enable HDD power savings yet keep the rest of the laptop from entering energy savings. If that increases the power cycles count, something definitely goes wrong in the execution of these power save commands.

3 Likes

Im suprised it isnt standard for ssds to have a small capacitor inside explicitly to prevent bricking from sudden power loss.

That was how early SSDs did it, but then stuff got better so to safe a few bucks, the capacitors got kicked again.

One consumer-ish SSD (= 2.5" SATA) that has caps is the Kingston DC600m

In a laptop there should never be an issue with SSDs not having enough power to shut down safely: the UPS is built-in, after all.

Of course it needs to be told to shut down and I guess thatā€™s where things to awry here. AFIAK there is no such thing as a power-fail signal line on SATA or PCIe/M.2.

Now who would actually be able to send such a command, if itā€™s the OS or the ACPI part of the BIOS I never wanted to know before: it would be useful to know to help diagnose the issue here.

2 Likes

Ok, so here are new datapoints:

Iā€™ve re-enabled HDD power saving under Windows 11, kicks in after 5 minutes.

Iā€™ve also disabled energy saving (suspend to RAM) and left it for 30 minutes without any application running at desktop idle.

LED (solidly lit) confirms after 30 minutes laptop never entered suspend.

Zero power cycle increase. Even installing the September patches and rebooting after didnā€™t trigger a power cycle as the system remained physically on all the time.

I am assuming that Windows asked the drive to do energy saving (I wouldnā€™t know how to monitor that: can you set event triggers on this or do you need a checked OS build?), and that did not increase power cycles.

So here is the next question: who is in control when the laptop is in suspend to RAM?
Because thatā€™s the only time when bad things happen!
Not during normal operations, not in full hibernation when the power is off.

Suspend to RAM/energy saving is trouble and hybrid standby is even worse (because the laptop sometimes canā€™t even properly restart).

My theory is that itā€™s completely outside OS control, Windows or Linux doesnā€™t matter. The BIOS is taking over and itā€™s actually the little ARM service processor (for AMD APUs, used to be a 80486 on the PCH on older Intels) which is given the task of monitoring the hardware and checking for wake-up events. Now that ARM service processor may wake up the CPU from time to time to do some things, too, perhaps even go as far as waking up the OS for ā€œmodern standbyā€ things, but itā€™s in this ā€˜mostly firmwareā€™ state where bad things happen with NVMe power cycles.

Any views on this? Anone with a better understanding of this matter?

Now proceeding to Linux testsā€¦

So the cycles come from suspend?

While the laptop is in suspend the cpu is in a low C state and the motherboard has to handle the power down for all the component, except the ram and cpu. Because the ram controller is part of the cpu.

These are all paths that go from the BIOS into the motherboard and cpu and other components. The same as a normal power saving state.

Iā€™m guessing a component is broken on the motherboard, giving power to the ssd and also not enough power so it keeps cycling off while it thinks its starting up, which also causes the unsafe shutdowns.

If the laptop has a second m2 slot you can try it in that.

General advice, always assume all laptops will have buggy BIOS

1 Like

I have a drive with similar interestingly high numbers in my laptop. Didnā€™t even notice until last year when I glanced at the numbers on crystal disk info while checking other drives. Currently standing at 466,178 power on count :slight_smile:

Been fine for 5 years. Not sure if it has any PCIE errors, not sure where to check in windows lol.

So after checking with Linux here is the new picture:

Linux supports full suspend to RAM, which I guess is ACPI S3.

That means every suspend results in one power cycle, but they arenā€™t judged ā€œunsafeā€ nor do they add entries into the NVMe error logs.

BTW reading the error logs is much easier using the ā€˜nvme error-logsā€™ command on Linux, but doesnā€™t yield anything interesting. They only real error entry complains about invalid commands, which is most likely from me trying to get logs that donā€™t actually exist on the Micron drive.

But hereā€™s where Iā€™m currently seeing the issue: Modern Standby

This laptop is supporting modern standby and thatā€™s where the power cycles seem to come from: every time it wakes to check on something interesting going on, it will cycle the NVMe drive.

Perhaps those should be designed to handle this, after all this isnā€™t exactly a destructive action, and just perhaps that initial WDC drive just failed fully independent of the power cycle count.

Unfortunately disabling Modern Standby didnā€™t just return me to ā€œold fashioned standbyā€, which is actually what Iā€™d preferr, but disabled all standby, because evidently S3/S2/S1 arenā€™t supported by the Slim 7 firmware, itā€™s either S0ix or S4 (hibernation).

So ā€œRatā€, perhaps you could check for Modern Standby?

How do I check for modern standby? This isnā€™t something I know much at all about, sorry.

I didnā€™t either, but Google helped:
https://www.tenforums.com/tutorials/145891-how-check-if-modern-standby-supported-windows-10-a.html

What is looking much better on your disk than in my case is the number of ā€œunsafe shutdownsā€: those have me a bit worried, even if they never seem to increase under normal operations.

Those are, as mentioned here, potentially very dangerous, even if SSD designs have generally improved to keep the critical sections in their firmware and the on-disk state as short as possible.

And there is no error log enties either, which is a rather reddish flag on my drives.

One power cycle or ā€œonline checkā€ every 30 seconds may be what Microsoft deems a proper balance, but Iā€™d really want far more control over this and certainly the ability to disable these online checks.

Iā€™ve had laptop batteries killed by Windows when it decided to power them up in the middle of a flight while stored in the overhead compartment. Tightly folded into luggage they managed to overheat and drain batteries that didnā€™t take kindly to that treatment.

Thatā€™s when I learned to hate hybrid standby or ā€œfast startā€, which I would have never chosen had I known it was one of those ā€œimprovementsā€ Windows 10 slipped in after upgrading from Windows 7ā€¦

Ah okay. Thought you were referring to something different, my bad. Here is what I have on it (asus g15 laptop):

Itā€™s an APU, so there isnā€™t much of any motherboard components: itā€™s still discrete RAM chips, but very nearly everything else is on-chip. Yes, there is in all likelyhood still a power management chip and I have no idea if it just tries to keep voltages stable while current draw is within spec or if it can be told to cut physical power.

That doesnā€™t seem very problable to me, because a device entirely without power couldnā€™t soft resume easily. But honestly, I just donā€™t know how modern laptops handle hibernation, if that keeps the service processor running at ultra low power or really just checks some ā€œCMOSā€ state during a cold boot to reload state from storage.

A broken motherboard component is much more likely to cause trouble during normal operations, where there is no trouble, so Iā€™m not currently leaning in that direction .

1 Like