My attempt at a perfect (for me) backup solution with TrueNAS, SyncThing and BackBlaze (ongoing)

Hey all!

Well, I’ve had some fun :roll_eyes:

For a long time I’ve been juggling the 20GB cloud storage that I’ve had with dropbox for the better part of 10 years. It’s been an amazing and free facility that I’ve genuinely run a business on. Unfortunately though, with this free account I have noticed a reduce service, which is fair enough as I don’t pay for it. I’ve been tempted to stump up the £7-10 a month for 2TB of storage, but I’ve always been conflicted with this and in the end went to the big G and got business email with 1TB of cloud storage.

The trouble is, I’ve had the odd problem with my apps when using files stored in cloud connected folders, so it’s not uncommon for me to pause sync to avoid this, work for a few hours and then un-pause. The other aspect is the speed of sync’ing across devices when the WAN connection is involved, on my 70Mb down/20Mb up connection. Dropbox does have a super facility for LAN sync’ing and if Google could do this too, I’d be keen to keep things as are.

To network drive or locally sync?

When I first thought about using an ‘always on’ server for daily accessed files, I thought that should be all I need. But then it occurred to me, what if the network went down? What if the server went down? I’d be screwed and unable to work. So I turned to SyncThing.

It has taken some time for me to get my head around it, but I think I’m there now and am just testing with a few text files. It’s a funny rollercoaster of learning, at times I thought it wasn’t going to be for me, too erratic. But then I gave it more effort and time and I’m now incredibly impressed with how it performs (and I was the erratic factor).

So I’ve now gotten the Sync’ing sorted, nice :slight_smile: But my next concern is backup - of course at minimum I have 2 copies (one on workstation, one on daily ‘always on’ TrueNAS server), but shouldn’t I be doing that 3-2-1 thingy-bob? Well, I do have a 2 other TrueNAS instances running, so OK, replication it is, sweet. But I typically only turn those machines on every once in a while, one is my media server - I’d love to keep them on all the time, but energy costs, yada yada.

TrueNAS Scale

Now, as a precaution I thought I’d buy a spare CPU for my daily server (an intel G4560), just in case :slight_smile: It was sitting around, looking a bit sad - AND I didn’t even know if it worked (second hand), so I built a low power machine after sourcing the RAM & Motherboard. Then I wondered what I could perhaps use it for - I have plenty of old HDD’s so that’s not a problem. Hmm, I’d quite like to play with TrueNAS Command, let’s see what it needs. Ah…bit sad you can’t run it on TrueNAS Core, may be I’ll learn VM’s using XCP-NG, the Tom from Lawrence Systems uses it and says it’s alright. Ahh, not so sure I wanna learn about that, how about something closer to home, I wonder how Scale is getting on? Oh, quite stable, let’s give that a crack then!

So the install went well, the GUI is semi-familiar, so I tried a few replications from Core machines to the Scale machine and they went very well. Next was to install SyncThing, god that was easy. Oh, whoops, forgot that my intention was to install TrueNAS Command. Got that working and it was really nice to see all my TrueNAS servers in one place. Of course though, I rarely have them on all, so that was a bit pointless wasn’t it…doh! :roll_eyes:

Back to SyncThing though, that could be quite a nice low power multiple copy solution, so what was a spare/play machine can now be a 3rd on-site location for Sync’d files. Nice.

BackBlaze

I’m fairly confident that I have redundancy, but there’s nothing like having an off-site back up eh? Yes, technically my servers are in the house and the Scale install is in the office - buildings separated by around 20 metres / 60 feet, but may be it’ll be nice to have a completely off-site copy. So I turned to backblaze, having used it for many years without issue. However it was the ‘home’ version and backing up around 12TB of data…switching this to B2 would be so expensive, so it’s time to scale back a little on what is sent to BackBlaze.

I’ve managed to ensure that SyncThing folders are sync’d with a backblaze account, yey! It was a fiddle as those pesky C Transactions have to be carefully monitored. Many users that have tried to combine SyncThing with Backblaze have suffered issues, so I’m trialling an exclusion policy and so far it seems OK. Of course though, it’s not backing up the superseded files, but hey, I have 2-3 copies of those, so no biggie.

I’m still playing with a free BackBlaze account, but so far it’s working well, I will of course have to keep a close eye on it when I put it into production, but fingers crossed! For the time being, it’s nice to simply use txt files and seeing how that goes, but so far, so good.

So, I think the final solution will be something like this:

TrueNAS Core 1 - media and cold storage pool. Then another pool for daily accessed and SyncThing (Building 1)
TrueNAS Core 2 - replication of above (Building 1)
TrueNAS Core Daily - on all the time and only has my daily pool (Building 1)
TrueNAS Scale 1 (may be on all the time) - SyncThing (Building 2)
Workstation - SyncThing (Building 2)
TrueNAS Core 3 (eventually) - another replication server, turned on every once in a while.

Once I’ve dug up the garden, I’m going to FINALLY lay the 2" conduit and have 1 x 10G fibre connection, 1 x 1G fibre connection and 1 or 2x CAT cable for redundancy. Of course, this might change, if I can shoe horn my servers in the store room, I might just keep the majority of servers in there and put fewer machines in the house (Building 1). Now that I’ve figured out TailScale, there’s always the chance that I put an off-site backup server in my parents house, who live in the same village as me. In which case, I could potentially have 20TB of off-site backup at a cost of…well, just electricity, which is cheaper than BackBlaze by a long way!

Anyway, I’d be curious if anyone has read this, but if you have, I welcome any comments.

Cheers! Chris

Why not replace the 2nd PC you made a new TrueNAS Core with and just run basic WIndows on it instead, use SyncThing on it to make that a 3rd replication target (instead of TrueNAS handling it), then install normal BackBlaze to get the whole thing backed up for $8/mo?
That way you have your workstation main backup using SyncThing, TrueNAS as your LAN backup, and the 2nd PC basically acting like a host to the cloud backup.

1 Like

Do you know Enigma, that’s a really good idea :+1: :clap: It would need a very careful setup, but I’m definitely going to experiment with the concept.

I had seen methods of using ISCUZI, making a windows machine thinking it’s just a normal external drive, but I wasn’t confident with building the devs.

Cheers though mate, I’m going to give it a go…and I still have a spare windows machine that’s still setup with backblaze too! Thanks again :+1:

Thinking more about this @EniGmA1987 , it really is a good idea. It does involve a fair bit of work though, with the creation of mount points within the jail, but I’m definitely giving it a go!

1 Like

Ahhhh, testing and practice is the interesting bit…but I’m apprehensive about implementing it. I’ve got a lot of files that I don’t want to risk, because I made a config mistake! :scream: :fearful:

I think I’ll do a little more testing and then go for it…it’ll be a stressful time though, wish I had someone looking over my shoulder for errors!

An exciting day…for me anyway! My first hard drive failure, yey!!! And only on a test system running TrueNAS Scale.

My only issue, a booting Scale seems to be a CPU hog, as it updates the catelogues, shame it can’t easily be disabled?

email:

The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted.

impact: Fault tolerance of the pool may be compromised.
eid: 776
class: statechange
state: FAULTED
host: Scale
time: 2022-10-20 18:07:36+0100
vpath: /dev/disk/by-partuuid/95a93c60-1054-457c-b0d5-6412e7b909f0
vguid: 0xE47F4DD24D42F7E4
pool: 0x56010A1AD3A2E309

Followed by:

Pool Daily-Replication-3TB-twinzies state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected

and finally:

The following devices are not healthy:
Disk WDC_WD30EFRX-68EUZN0 WD-WCC4N5YF8HR5 is FAULTED

It’s the stuff I practiced for…yeah OK, it’s simple stuff to you, but I needed a number of trial runs to feel confident!

So figure out the drive:

select the replacement:

and the physical bit is done!

I turned off all replication tasks, SMART and snapshots, just to ease the load. Also turned off SyncThing that was happily running in the background.

Initially it said 915 days to resilver (at 19:33hrs), at 19:27hrs it that came down to 12 hours. At 19:47hrs that became 2.5hrs remaining and as of now (20:00hrs) it’s saying just under 2 hours remaining.

Might be worth noting, these are truly old, worn out incorrect hard drives I’m using (for a NAS), hence it’s a test system! It includes 2 x 3TB Desktop SMR drives that are mirrored.

Fun stuff anyway, and way less drama than a lonely hard drive that failed on its own! :slight_smile:

1 Like

My only other disappointment with Scale, is that I got this message, when the resilver only just started. Kinda suggests it’s already re-silvered, when it hasn’t!

ZFS has finished a resilver:

eid: 208
class: resilver_finish
host: Scale
time: 2022-10-20 19:18:26+0100
pool: Daily-Replication-3TB-twinzies
state: ONLINE
scan: resilvered 324K in 00:00:01 with 0 errors on Thu Oct 20 19:18:26 2022
config:

NAME STATE READ WRITE CKSUM
Daily-Replication-3TB-twinzies ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
a47a08ba-fb29-442f-aeed-83d7492f74fe ONLINE 0 0 0
95a93c60-1054-457c-b0d5-6412e7b909f0 ONLINE 0 0 0

errors: No known data errors
1 Like

Wow, that’s a bit of a misleading message…

Yeah, just a little bit. :roll_eyes: I’m sure iXSystems know about it, I’m sure someone would have mentioned it by now.

1 Like

Minor update, the background stuff has finished now, and we’re hovering around 9-11% CPU usage while it continues to re-silver.

The only other issue with Scale, is that when the system booted up, it said the pool was healthy. I honestly can’t remember what happened the last time I had a Core hard drive fail, but I’m fairly sure it remained “unhealthy” when the system booted up again.

1 Like

Well, it finished, now the message is valid this time:

ZFS has finished a resilver:

eid: 655
class: resilver_finish
host: Scale
time: 2022-10-20 21:53:14+0100
pool: Daily-Replication-3TB-twinzies
state: ONLINE
scan: resilvered 961G in 02:32:14 with 0 errors on Thu Oct 20 21:53:14 2022
config:

NAME STATE READ WRITE CKSUM
Daily-Replication-3TB-twinzies ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
a47a08ba-fb29-442f-aeed-83d7492f74fe ONLINE 0 0 0
replacing-1 ONLINE 0 0 0
95a93c60-1054-457c-b0d5-6412e7b909f0 ONLINE 0 0 0
5d262524-cfa7-40fc-81c6-afc10d001f4e ONLINE 0 0 0
1 Like

Just following on from here, I thought I’d do a SMART test of the faulty drive before throwing it. I’m a bit confused as I don’t see many errors after doing long tests and badblock tests?

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       41
  3 Spin_Up_Time            0x0027   179   168   021    Pre-fail  Always       -       6041
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1516
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   080   080   000    Old_age   Always       -       14956
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1104
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       163
193 Load_Cycle_Count        0x0032   195   195   000    Old_age   Always       -       15843
194 Temperature_Celsius     0x0022   115   090   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       14

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     14909         -
# 2  Extended offline    Interrupted (host reset)      90%     14894         -
# 3  Extended offline    Completed without error       00%     14872         -
# 4  Extended offline    Interrupted (host reset)      90%     14863         -
# 5  Extended offline    Completed without error       00%     14849         -
# 6  Short offline       Completed without error       00%     14836         -
# 7  Short offline       Completed without error       00%     14656         -
# 8  Short offline       Completed without error       00%     14037         -
# 9  Short offline       Completed without error       00%     13388         -
#10  Short offline       Completed without error       00%     13021         -
#11  Short offline       Completed without error       00%     12158         -
#12  Short offline       Completed without error       00%     11506         -
#13  Short offline       Completed without error       00%     10521         -
#14  Short offline       Completed without error       00%      7829         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I always thought the following were the ones to watch out for, but the drive seem fine?

5 (Reallocated sector count)
196 (Reallocated event count)
197 (Current pending sector)
198 (Offline uncorrectable)

Welcome any thoughts! :+1:

Well it’s been a couple of months now using SyncThing, I’ve had only a few tiny niggles that were user error derived, but it’s been in full production mode for around 3 months without issue.

My little observations:

  • Using the exclude feature isn’t entirely worth it for workflow with others, if they’re not familiar with the file creation quirk.
  • It could be used to sync specific file types with an Android phone/tablet, but I have some concerns with this as issues did occur (fixed with an update though).
  • File versioning is really helpful, I haven’t had to use it, but it’s nice that it’s there.
  • My hardly ever used laptop is now part of the cluster (perhaps not the right word) and because I haven’t used a spoke layout, if my server dies, there is still at least another copy. I also have another server that I can fire up any time to pick up the duplication slack.
  • Backblaze does work with backing up SyncThing directories very well once you tweak it a little.
  • A big bonus is that it doesn’t sync instantly, some apps didn’t like having files sync’d so quickly when using Dropbox/Google Cloud. No problems at all with SyncThing
  • Downside with Windows is that the explorer window doesn’t auto-refresh, so if another user is viewing a file that has recently changed, they’d have to press F5 to get the correct file version.
  • SyncThing installed on TrueNAS Scale plays well with Core installs.

So there we go, over all very happy with the shift. My next plan is to have a 4th back-up in a separate structure (though on the same site). I feel this may make backblaze redundant, though I will probably still use it for critical files.

Only a minor update - nothing to report, SyncThing has been working faultlessly (touch wood).

Well, I thought I had a #SyncThing issue.

But in the end, it was me, just being a good little file manager!

So all still well.

TrueNAS Scale is a bit weird at times lately, but I don’t depend on it for anything other than a SyncThing secondary target:

Failed to sync TRUECHARTS catalog: [EFAULT] Invalid operation: ==

Bit of a shame.

Well, it looks like an overhaul is coming, but in the meantime I needed to replicate around 5.3TB of stuff. Two separate TrueNAS machines did it in 7 hours. That was:

FROM a 8x4TB RAIDZ2

TO a 3x14TB RAIDZ1

You probably guessed it, but it was over a 10G network :slight_smile: