Post-mortem of server lock-up, caused by ZFS scrub

My VPN/seedbox provider had an outage and I thought it was cool of them to inform users of the issue. Thought some of you might find it useful or interesting. :slight_smile:

Link to the email sent to customers. Copy-pasted below.

Post-mortem and analysis of the outage

What happened and when?

We’ve been analysing the issues which caused yesterday’s (8th January 2023) outage, so we can learn from any mistakes and also put in place fail-safes to ensure it does not happen again. Our goal is always to increase stability, so we’re also looking for ways of developing new systems to ensure everything is running smoothly.

First we’ll look at the timeline of events which occurred (all times are UTC) the issues are quite technical and we’ve tried to include as much detail as possible:

January 8th

00:00-01:00: All user servers (except one) begin locking up, our team is immediately notified but do not receive the notifications due to “do not disturb” on their devices.

08:00: Team members awake and immediately start working on the issue, initial efforts of diagnosis are hampered by the fact that servers lock-up immediately after booting.

09:00: The first outage update email is sent out, explaining that we are investigating the cause.

09:00-11:00: Our team identifies the cause of the lock-ups being due to extreme RAM usage leading to OOM (Out-Of-Memory) killing happening before the server finishes booting. We were fairly confident that this was caused by a recent update on January 5th regarding the upgrade of ZFS.

11:20: The second outage email is sent out, explaining that all servers would need to be rebooted into rescue mode for a fix to be applied.

11:30-14:30: Our team develops automation for automatically launching rescue mode on all affected servers. We were able to do this thanks to our datacenter partner LeaseWeb’s useful server management API.

We identified the bug causing the issue (https://github.com/openzfs/zfs/issues/11574 more on this below) and the fix (zfs_scan_legacy=1 in the ZFS module parameters). We created automation to automatically apply the fix on boot, immediately after importing the zpools. We also added automation to cancel the existing scrubs (as you cannot change the scrub type during a scrub).

The ZFS developers were notified and provided some input to confirm our fix. Which was then rolled out across all servers except 9.

We identified a bug in our datacenter partner LeaseWeb’s rescue mode automation system which prevented 9 servers from booting into rescue mode. LeaseWeb were immediately contacted in order to provide assistance.

14:30: The third outage email is sent out, explaining that we have identified the cause and were rolling out a fix.

There were a range of errors happening once the servers began booting successfully back into production mode. We utilise a containerisation system called Docker. Docker is not very resilient to hard reboots. Even though we had set all applications to restart automatically upon failure, many apps were stopped with error 255. This error code is ambiguous, but we believe that when the applications were killed by the OOM killer previously mentioned Docker did not start them again (even though it was supposed to).

We then developed further automation to check all apps for exit codes 255 and 127 and start any apps which were offline.

However when starting some apps we found a further Docker bug where the Docker daemon was already using the ports reserved for the app, causing a “port already allocated” error. Further automation was added to account for this bug and restart any appboxes in this state.

16:30: The fourth outage email is sent out, explaining that 75% of services are now back online and that some services are taking longer due to the above manual tasks needing to be completed.

A contingency plan was made for the 9 servers unable to boot into rescue mode, it took more time and involved manually mounting a rescue image and booting from there, thankfully it was not required. We worked with LeaseWeb to eventually get the remaining 9 servers into rescue mode and applied the fix.

20:18: The fifth outage email is sent out, explaining that all services should be back online and to contact support if that was not the case.

20:30: We identified that one server (m377.ata.ams-1.nl) had not had the automation properly applied to fix the Docker issues, we went ahead and ran the automation.

January 9th

08:00: Our team identified a group of disks on server m431.ata.ams-1.nl which had not imported correctly, the automation used the command “zpool import -a” which assumes the disks had no corruption, “zpool import -aF” was needed in this case as the previous 5 seconds of data before the server went offline were corrupted. After importing the disks the services were started and online.

09:00: Our team identified further bugs in the Docker system causing some apps to remain offline despite being restarted. One of these was related to the previous mentioned “port already allocated” bug, we added code to our app start system to resolve this bug if the app is restarted/started on our system.

We identified another bug regarding containerd (the container runtime we use) and leftover run files which were stopping some apps from being started. We added further code to our system to resolve this issue when an app is started/restarted.

We are also aware that a few apps may still be offline, this is due to them exiting with the code 137 which is the code for a normal app stoppage (such as when you stop an app via our dashboard). Restarting the app via our dashboard should resolve these issues. If you’re still having any issues please open a support ticket here.

Root cause analysis

We use the ZFS filesystem due to its resilience to hardware failures, due to it being a copy-on-write filesystem. Like most filesystems ZFS requires periodic maintenance to ensure that any errors on disk are corrected.

The process of scanning the disk for failures is called scrubbing, and it’s set to automatically occur on the second Sunday of every month (Bug #1860228 “addition of zfsutils-linux scrib every 2nd sunday” : Bugs : zfs-linux package : Ubuntu).

Normally this process runs and nobody notices (it may increase the stress on the disk by less than 10MB/s for 3-4 days), the disks stay extremely resilient to all failures except for sudden catastrophic failures. We’ve only lost 2 disks to extreme failure in the past 5 years, and never to on-disk corruption.

However in April 2021 a new scrub mode was introduced as default (Release zfs-2.1.99 · openzfs/zfs · GitHub), up until this point we had been using and older ZFS version on Ubuntu 20.04 (0.8.3). In November we had upgraded our test servers to run the new 2.1.6 version of ZFS, everything was working fine, so we rolled out the upgrade to a single production server (m377.ata.ams-1.nl) in mid-December. Again everything was running fine.

On the 5th January 2023 we deployed the upgraded ZFS version to all servers. It wasn’t until 8th January that the scheduled scrub started and the RAM issues began occurring. The new scrub mode uses 10% of RAM per zpool, each of our servers has 12 zpools, so the scrubbing tries to use 120% of total RAM, causing catastrophic failure and constant OOM killing of processes.

Having now reverted all servers back to the old less memory intensive but slightly slower zfs_scan_legacy mode, we are confident that the issue will not reoccur.

Lessons learned and roadmap

Going forward we’re going to initiate a number of processes and develop systems to ensure this does not happen again, and make the system more stable and resilient to failure.

Here’s a rough roadmap of what we’ll be doing:

  • Our upgrade testing will now involve testing scrubbing
  • Upgrades will be tested on production-like servers for more than one month before deployment
  • A new synchronisation system will be developed to ensure that started/stopped applications on the dashboard will be in the same state on the server and account for any bugs in Docker/containerd
  • More error reporting on applications failures will be implemented to alert our team to any apps which are not behaving as they should be
  • Scans will be implemented to ensure that any of the bugs we’ve discovered will not impact users
  • New alert systems have been implemented to ensure that critical issues will be noticed by our team despite using “do not disturb” on devices

We will continue to stay vigilant to any more bugs which may appear following the incident and apply fixes where necessary.

3 Likes

Sounds like a busy couple of days.

Why does scrub need that much ram? As in, it sounds like it should need almost no ram maybe for some in flight buffers, why would it benefit from more ram?

Scrub starts at around the same time on all servers? … hmmm

It helps to check your “pager” / alerting app once in a while according to a doc that has best practices for how to setup your phone implemented.

… and then there’s the concept of “secondaries” e.g. in case the primary oncall person does not respond e.g. within 5 min or within 30 minutes whatever the setup, … the system issues another alert notification to the secondary. The secondary is frequently another person on the team, or a person on a neighboring team or similar. It’s their job to figure out what happened to the primary (slipped in a bath tub, didn’t realize they’re oncall, broken phone, …) as well as to find someone to start fixing the problem.