Project Concept - Community Replicated Backup Storage

SgtAwesomesauce · May 7, 2024, 12:25am

I’ve been paying backblaze for B2 offsite storage for a while now. While it’s something I can afford, I also have excess local storage somewhere in the order of 15TB. While I don’t need to keep all my data backed up offsite, I’d really like the ability to back up, I don’t know, the most important 500GB or so. With all that said, I have the following idea:

If I were to offer 500GB of storage on my NAS to a community member, and they reciprocated, then I did it again for another 2-3 people, I’d have a more reliable backup, and it would cost me nothing but time!

So, how would we go about doing this? That’s the challenge. My initial thought was to build a container that can be run, into which, SSH keys are loaded and a volume is mounted. This would work best if you’re running ZFS or some other filesystem that supports quotas, so you can give each user a quota. Basically, we’d have some sort of coordination system, which will tell each user who they’re peering with and would help distribute username allocations.

Prototype spec for a join request:

name: sarge
email: [email@domain]
endpoint:
  address: [ IP addr or DNS record]
  port: [whatever port is open]
spaceOffered: 1.5Ti
spaceRequested: 500Gi
peersRequested: 3
egressThroughput: 500Mi
ingressThroughput: 1Gi
publicKey: |
	asdf

Rules would be as follows:

No illegal content. Nobody wants to get in trouble here. (Pirated material, other things that would make the police go are absolutely prohibited)
Encrypt your backups. Your peering partners are not responsible for encryption, that’s on you. They just host the bits as you send them.
Expectation of 95% uptime. We’re homelabbers, not datacenters, 100% uptime isn’t realistic, but we do need it to be somewhat reliable.
You should be reachable within 48 hours via either forum DM or email.

Requirements:

At least 100mbps of upload (so that if someone needs to recover data, they can do it in a reasonable timeframe)
Must run Linux
Must be able to peer for at least 2x the amount of space you’d like to consume.

Any thoughts on this? Would this be fun? Am I off my rocker?

jode · May 7, 2024, 12:47am

How to secure access to endpoint? VPN? SSH?

I monitor all the activity I get on my open ports and 99%+ probe known vulnerabilities.

I stopped exposing SSH for that reason.

SgtAwesomesauce · May 7, 2024, 12:52am

It’s just a concept right now. I’m thinking we could do sftp, minio, whatever. Could set up a wireguard group to interface on a VPN if we want that level of security. That would require multiple layers of Auth though.

Dynamic_Gravity · May 7, 2024, 1:21am

I’ve also thought about this same idea.

Ultimately, its a decision of trust.

regulareel · May 7, 2024, 1:33am

Isnt Storj something like this? Its a paid offering and if you allow them access to your homelab, Storj will pay you if you have good uptime or something.

I guess something community based is nice…

LiKenun · May 7, 2024, 1:40am

This rings a very faint bell. Some project of the same sort but with blockchain or something to keep everyone honest, balance storage contributed with storage used, and possibly zero-knowledge encrypted storage.

jode · May 7, 2024, 1:43am

Yes, I like the idea of it. That’s why I am replying to this thread.

Well, most “community members” have a very simple security setup: A firewall that prevents bad actors from accessing resources inside the firewall.
This proposal purposefully will allow community member inside the firewall with potential ssh/terminal access.

Before I agree to that I will have to think hard about how to make that work in my environment.

regulareel · May 7, 2024, 1:44am

I guess you should be paired with the same uptime like you, that that old BitTyrant concept (of torrenting where you have preferential seeding/leeching to IPs with good upload as you are).

But honestly, personally I’m not that good in maintaining systems and inviting you guys would be the equivalent of inviting aliens on earth. I mean you already know interstellar travel (good IT/infosec skills) and I havent figured it out yet (and I am eating glue in the corner right now).

SgtAwesomesauce · May 7, 2024, 1:45am

That’s why I’m considering other options. I had a thought to extend the restic REST server with user management and Auth.

My first thought was sftp, but that seems like a lot of trust. A rest api would be a lot more reasonable for people, plus, you can out it behind your existing reverse proxy.

SgtAwesomesauce · May 7, 2024, 1:51am

Yeah, but when I played with it, it felt really slow and awful to use.

Also don’t like the whole block chain thing.

SgtAwesomesauce · May 7, 2024, 2:11am

Home from work; let me try to flesh out some ideas here.

REST API extension:

Two applications: server and coordinator, both open source, BSD 2-clause

Coordinator would be responsible for:

User registration/onboarding process
Aggregation of Server uptimes
Peer selection
Config distribution (which users are allowed to push to which servers)

Server would be in charge of:

serving the API
authenticating users (users could generate a key/secret from the coordinator and it would propagate to the Servers on the next heartbeat)
enforcing quotas
publishing metrics (prometheus style, available for local network and coordinator)

Now the big question I have is this: Should the user back up to one peer (let’s call that the leader) and then that peer’s Server would then push to the other configured peers, or should the user be responsible for pushing to each peer individually? (or programmatically)

Onboarding Process:

A new user would be able to start onboarding via the coordinator webui. The coordinator would be publicly available and probably hosted by myself.

They’d register and make a username/password for the coordinator UI.

From there, existing members would get a notification that a new user is trying to join. We’d see their request and, most importantly, their forum username. My recommendation is we stick to forum members who are known quantities. People who’ve been around for a while and whatnot.

We would have the option to approve or deny them. This could either be a vote or just a queue for an “admin” user to approve or not. Once approved, their quota request and peer count would be taken into consideration to select their peers. The difficult part is this: Joining/Leaving peers would require a reshuffling of the peer allocation. How best to do that? That’s going to be the subject of a lot of thinking. I could design a function to select peers based on available space, network speed and current peer count (don’t want everyone on the one person with 5gig symmetric)

If denied, that’s the end of the story. They can re-apply 30 days later, but they can’t really do anything in the mean time.

Once approved, they’ll get an email confirmation. Once confirmed, they can log in and get their peer list and generate credentials. After that, they can set up a repository endpoint and either sync existing data or start a new backup.

It’s recommended that the new member set up a scheduling and job-management tool. If this actually happens, I’ll write docs about what’s recommended.

Automated integrity testing: stretch goal

A concept that just popped into my head; the coordinator could, on a schedule, say once a week, do a small backup of something like 1GB of data to all endpoints, and then do a deep verification of the repo, testing that everything’s working.

Thoughts?

Zszywany · May 7, 2024, 6:57am

My immediate thought is TIMEZONES.
Ensure that you don’t push data during business hours of the target. It could be relevant for people working from home.

nutral · May 7, 2024, 7:14am

I think this is only really doable for those that have reasonably fast internet. Then it just becomes throttling that connection so you don’t saturate someone elses internet. running at 10% of speed for example.

I don’t think i would do this with random people on the internet, but having a software system in place that i can use with my friends (and i can offer my storage for them) would be nice.

The easiest segmented way to do this is running an VM/LXC with docker that runs an ftp/fileserver and a wireguard instance and has only access to a certain folder and wireguard instance that just makes a peer to peer connection. You can also just add it to a computer by having a docker compose with wireguard/ftp server in 1 docker network with no outside access and only a tunnel.

The question is of course who runs the tunnel? That is also the part that is open to internet and has to forward ports.

vivante · May 7, 2024, 7:46am

I think there is (or was) some backup software that used your friend’s computer to backup. IIRC it was something like use their cloud storage for a fee or your friend’s computer for free. Maybe I’ve hallucinated it, I don’t know.

hdd-housing-eu · May 7, 2024, 8:41am

Interesting idea, but why not simply use S3, e.g. minio?
Or Syncthing, which has encryption implemented (but beta according docs).
That’s what I’m using for my offsite-backup-company, works great.

MetalizeYourBrain · May 7, 2024, 10:56am

I think it’s pretty spot on as an analysis, but I think the issues that might arise need to be considered:

What happens if someone suddenly drops out due to hardware issues, malice or personal issues?
These backups are allowed to run constantly or on a timer?
How would “you/we” keep all the people involved from, for example, exploiting this project beside using the honour system?
Is there a way to vet preemptively the uptime of participant’s systems?
Participating requires redundancy on the host system?
Setting stricter minimum requirements is needed to improve the network?
Including users from parts of the world where internet is excessively monitored by governments should be allowed?
Should the project include the minimum possible numbers of users in order to not needing to comply to privacy rules?

This is all I could think of and it’s not meant as a debby downer comment. I have my own “answers” (more opinions, there’s no one good answer if people don’t come together to find one) to these questions and I’d like to discuss them with everyone.

Marandil · May 7, 2024, 11:19am

There are some nice cryptographic concepts that could be applied to this topic, like Proof of Possession/Retrievability and secret sharing. You could, for instance, distribute your backup among N hosts s.t. you need at least K to fully retrieve your data (sort of like RAID5/6 or ZFS RAIDZ1-3).

In general I’d suggest a distributed system with coordinator/proxy s.t. peers don’t have to expose ports and instead just maintain links to the coordinator (+ maybe some NAT traversal methods for direct P2P connections between nodes).

Zszywany · May 7, 2024, 11:21am

Trying to solve a human problem with technology.

If the people that want to participate don’t trust each other then no software solution will be feasible. The architecture should be as simple as possible. Preferably people participating should meet in person, like “I don’t know my backup peers but I know 3 other people who have met them” thing.

Dynamic_Gravity · May 7, 2024, 1:17pm

Reason why its about trust, is because there is no legal entity shielding you from someone else putting illegal content on your server.

If you don’t 100% trust that person there is technically nothing stopping them from putting CP on there and calling the feds. If you don’t have proper logs/auditing then you’re going to prison.

Nice idea, but there needs to be some form of contract that is approved by a lawyer to CYA.

Also, IDK about you but I require full disk at rest encryption where I stick my data. So there needs to be some sort of data stewardship. The simplest thing would be to just expose an S3 API interface for people you trust.

H-i-v-e · May 7, 2024, 6:00pm

It is a nice idea, but I am not sure how much trust I’d place in that project and the forum members.

And this is where I am out, not gonna happen in my backwater first world country with shit tier ISPs for me anytime soon.