VM's going to read-only file system

I am really not sure where to turn here, so I hope some folks here can lend me a hand…

I have a homelab that has been stable for a few years now but I am now running into the strangest of issues. I run ESXi with ~4 Ubuntu LTSC 18.04’s and freenas with baremetal HDD access, along with a Win 10 LTSC for Veeam backup of the Ubuntu VM’s.

A few weeks ago, I would SSH into an Ubuntu VM and ZSH would prompt me of the file system being Read Only. I shrugged it off as a weird one time thing and restored from Veeam, but that it started happening more, and to more VM’s. Last night, sometime between when I went to sleep and woke up, all 4 VM’s had it happen…

Initially I thought maybe it was a Veeam backup durring an Ubuntu auto-security update (I do have that enabled), but Veeam did not run last night. I am restoring one of the smaller VM’s right now to an earlier state just to try and collect some info, but I am not even sure what to collect.

When I google the issue, it seems like the main answer is a mounted file system has errors, or FSTAB is the issue. I am not sure how this would be as one of the VM’s is not mounting any of my FreeNAS storage at all… It has no network links at all. I just got that VM restored (it just runs pihole… pretty simple VM, literally nothing to it at all except pihole), this is the fstab config:

UUID=2d945db3-1ff7-4c22-8022-0479207bb427 / ext4 defaults 0 0
/swap.img none swap sw 0 0

I run the VM’s (and ESXo) on a consumer SSD, and I upgraded to it maybe 5 months ago so its not exactly old, and its not like my VM’s hit it hard at all, its a very low use homelab, its on 24/7, but realistically its doing almost nothing 100% of the time. Could the VM’s be doing some sort of SMART check, seeing an issue with the SSD, and going to read-only as some sort of protection? How would I determine this? What else could be happening?

Trying some things out, I tried this. So, looks like Ubuntu put itself in read-only as it found some corruption? So is this indicative of a failing boot SSD?

With this information, and the fact the VM seems to be “ok” now that it rebooted, I am inclined to think these are the possible issues I am facing. Bad/loose SATA cable, dying SSD, bad RAM.

I am inclined to rule out RAM as FreeNAS and ZFS would likely be throwing all sorts of errors if it was seeing checksum issues… FreeNAS has bare metal access to my HBA, and I run ECC RAM. I would like to think somewhere along this chain FreeNAS would have been the first thing to throw issues at me if it was in fact a RAM issue. This leads me to believe its a bad SSD/SATA sable to the SSD.

Some info on the homelab if it will help:
Homelab/ Media Server: ESXi 6.5 - - 250 GB SSD for VM’s/ESXi boot - - FreeNAS 11.2-U5 - -HPE Proliant ML10 Gen 9 backbone - - i3 6100 - - 28 GB ECC - - 10x4 TB WD Red RAID Z2

It could be indicative of a few things.

Was there a power failure or unclean shutdown?

I would recommend checking the smart data on sda.

Let’s start with the SSD. Check smart, then check the cable, then run a memtest.

Do a long test, and I’ll be happy to help you analyze the results.

I just shut all VM’s down, ESXi is coming down, I will pull the drive and test it on my test bench now. If we want to go further down that rabbit hole, I could badblocks the SSD (once I properly backup everything possible…). I do have the VM’s backed up, minus the Veeam VM itself. But if it comes to that, I can spin up a mount point on my test bench to backup all the ESXi VM’s to as a new datastore, and could clone the ESXi instal, and then badblocks the SSD. But, I agree, lets try smart long first… I will get that going now and report back with the results when its finished.

Smart data should be enough. SSDs are quite intelligent. The controllers know exactly what’s going on most of the time, and if not, there’s huge performance hits while it tries to figure it out.

Gotcha. I just know on HDD’s I have seen write/read errors on bad blocks that SMART didn’t see. But, SSD’s are much more advanced, so I will take your advice and see how this goes first.

Also, no bad shutdowns as far as I am aware.

Both physical and virtual? Good.

Correct.

This is SMART Long, hmm, how do I paste as code in this forum? Thats ugly to read.

== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   000   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       4183
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       20
148 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
149 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
167 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
169 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       31
170 Unknown_Attribute       0x0000   100   100   010    Old_age   Offline      -       21
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       1376308
181 Program_Fail_Cnt_Total  0x0032   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0000   100   100   000    Old_age   Offline      -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       17
194 Temperature_Celsius     0x0022   072   040   000    Old_age   Always       -       28 (Min/Max 17/40)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
218 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
231 Temperature_Celsius     0x0000   097   097   000    Old_age   Offline      -       97
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       4480
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       2823
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       2643
244 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       21
245 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       52
246 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       164800

I’ve edited your post to fix the formatting. Feel free to edit it yourself to see what I’ve done.

Additionally, Eden made a nice guide, it’s not fully up-to date with all features, but everything in it is accurate.

2 Likes

Now, about the smart stuff:

So your disk has 6 months of power-on time. That’s not bad.

There’s a lot of unknown attrs, some of which have large values. I don’t really know if those are bad, but everything else seems good. I’d give the ram a stress test in your ESXi machine and double-check the cable.

What model SSD is it?

Kingston A400, not particularly an “amazing drive”, but I figured for the very low use it would see it should suffice.

Memtest for memory testing?

1 Like

I wasn’t judging the SSD, just going to use the model to see if I can figure out what some of those attrs are.

And yes, I’d just burn an ubuntu ISO/usb and use the memtest on that.

Yea, just giving you my take on the hardware and background for the choice. If this system saw massive reads/writes, I likely wouldn’t have gone with it. But my VM’s really don’t do anything at all. I am fairly certain a Windows based machine would make MANY more reads and writes, thus the hardware choice.

And yea, if you can find out info on them, that would be great! I will try and google around as well.

1 Like

SA400S3 is the model number. I am not finding correct data for this…

1 Like

I couldn’t find any data either. Sorry.

Any updates on memtest/cable?

I have not tried a new cable yet, and memtest is still running. I figure I will just let it run until I can get back to it later this afternoon, so far no errors after ~16 hours or so. I doubt it will have any, but as I said might as well let it keep going.

I will try a new cable, and try and reach out to Kingston support and see if I can get any answers from them, although I sorta doubt it…

1 Like

Ran 26 hours, no errors. Rebooting now with a new cable in a new SATA slot in the mobo. I sorta doubt its this, but I guess I can try and fix the VM’s and see if it happens again. If so, I guess we know the drive is on its way out…

Got the VM’s all back up, curiously only one of them “remembered” it was upset, ¯_(ツ)_/¯. Ran fsck on the one that was upset, it seemed to repair itself… I guess now we wait and see what happens.

Honestly, it’s starting to sound like it’s just a fluke.