(Truenas) How Screwed am I? (pool offline, please help)

My array has been having issues for some time (over a year) now with drives becoming unavailable and errors but the drives are all good when I do tests in windows so I’ve been reading them to the array

current system is

  • 2 vdevs of 10x WD red pro 20TB in raid-z2
  • lsi 9305-24i HBA
  • AMD Epic 7443P
  • ASRock ROMED8-2T motherboard
  • 512GB ECC Ram
  • Silverstone RM43-320-RS case

I’ve tried another 9305 HBA, sas cables, cases (i switched from norco 4224), re-setting the cpu. but this problem keeps prosisting

Hell broke loose yesterday

the pool went offline, restarts haven’t restored it.

As you can see from the image 2* drives are offline and 1 was in the process of replacing itself. 8 out of 10 drives still should be able to continue with a raid-z2 vdev but when i go to import the pool it doesnt give me the option in the gui so I tried using

zpool import

in the shell but i get the following error

How screwed am I? What can I do to recover my data? (of course no backups)

Only 8TB of the 120TB used is what I need to get off as it can’t be replaced

Over the course of the year I have replaced every single compnet in the system, i’ve tryed 2 differnt HBA’s new sas cables, new case (backplanes) fresh truenas install, new cpu platform, I did have the same issue with 8TB drives so upgraded to 20TB drives, I have ran out of things to replace.

Yes I know I should have had backups but I hadn’t got to the root of the problem with all the “old” parts so once I had I was going to use them to make a backup system
It is hard to backup when you have so much

1 is none and 2 is 1

Ive tred zpool import -f -o readonly=on -R /mnt Tank and I was able to see the datasets but wasn’t able to see the pool on the stoage page or create a share that I could use to access the data

[EINVAL] sharingsmb_create.path_local: The path must reside within a pool mount point

 Error: Traceback (most recent call last):
 File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 198, in call_method
 result = await self.middleware.call_with_audit(message[‘method’], serviceobj, methodobj, params, self)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1466, in call_with_audit
 result = await self._call(method, serviceobj, methodobj, params, app=app,
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1417, in _call
 return await methodobj(*prepared_call.args)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File “/usr/lib/python3/dist-packages/middlewared/service/crud_service.py”, line 179, in create
 return await self.middleware._call(
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File “/usr/lib/python3/dist-packages/middlewared/main.py”, line 1417, in _call
 return await methodobj(*prepared_call.args)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File “/usr/lib/python3/dist-packages/middlewared/service/crud_service.py”, line 210, in nf
 rv = await func(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 47, in nf
 res = await f(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^
 File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 187, in nf
 return await func(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File “/usr/lib/python3/dist-packages/middlewared/plugins/smb.py”, line 1022, in do_create
 verrors.check()
 File “/usr/lib/python3/dist-packages/middlewared/service_exception.py”, line 70, in check
 raise self
 middlewared.service_exception.ValidationErrors: [EINVAL] sharingsmb_create.path_local: The path must reside within a pool mount point

Blockquote

Did have similar issues, could finally track mine down to HBA overheating. LSI HBAs (as far as I could determine) don’t log their temp to the OS, so I had to resort to external temp measurements to identify issue.
Also, 20 HDDs + fans put quite some load on PSU. An earlier issue I had was related to having fans powered by the same cable that powered HDDs. Running fans and HDDs on separate cables improved my situation.

I have a 80mm noctura fan pointing directly at the heatsink

I think the fans are connected to the motherboard and not the backplane of the case, but Ill have to check after work.

My UPS says the whole rack is only is only drawing 400w or something like that so it shouldn’t be the power supply

:ok_hand:

Get a temp probe. I have the same and it’s still occasionally overheating at extended periods of full load, but much less.

What temp would be called overheating? 80°c ?

what sort of temp probe did you use? a none contact laser thermometer or one with a wire (K Type Digital Thermometer) ?
I don’t have anything like these so I would have to order one

https://www.amazon.com.au/Thermocouples-Thermometer-Thermocouple-Industrial-Temperature/dp/B0CQJYB71Q/ref=sr_1_5?crid=SWCSKBT8P80T&dib=eyJ2IjoiMSJ9.OEaXaxMpnNJtwYMgxHlhRnxDfxQ68-m9AtMoTyiKi0wecDiE-kGNaDxXz1gjImQYBWYKu-RP9F8RmurAqgvh9DAD00wNSb-rltZkv-KleqKPg_0gtOZm6qnGI43JijI_DRLzi5XMj2seQXg_RhmMvjIXK9HZnepoqBzGfgejrHU14JCv9AmDO8Lhi1RLonaqFYpUEVfcixrDW3dGpQEgeUMEk8W_16fdJBvv5zJXrw_u3wZRQp6UBOxz0vpABj-tC8ig3n0UMBwqkH_As5CREqW4sJCuo43S6ZzkI635VJo.SCC9FRWGgGSyBJ89Yn1_1yEf3scmG2raYYcB6O--b4o&dib_tag=se&keywords=temperature+probe+thermocouple&qid=1715817340&sprefix=temperature+probe+thermocoupl%2Caps%2C213&sr=8-5

My LSI SAS2308 based HBA will start resetting drives around 60°c as measured by a K Type Digital Thermometer.
The resets are clearly visible in the system log (dmesg | less).
Directing a fan will drop temp to 50°c when idle. But at sustained load (scrub) temps will creep until I start seeing events in the log. Much less though and without data corruption. So, for my purposes I call that good 'nuff.

I was able to mount the pool using zpool import -f -o readonly=on -R /mnt Tank
Then using SFTP over SSH and filezilla I was able to connect to the pool and start copying the critical data off

Just got to hope it holds together until I get the data off that I need

The next thing is finding the root cause and then rebuilding.

I’ve ordered a synology nas to set up as a backup so that I don’t have this happen again without a backup of my critical files

I’ve still got a long way to go but it’s looking positive

1 Like