Seeking advice on data storage for a website

hsnyder · March 13, 2024, 11:55pm

I’m mainly a HPC/scientific computing/applied math guy, and while I do okay as a sysadmin if I’m all you’ve got, web infrastructure is not my primary area of competence. I’m pondering server infrastructure for a website project which will involve storing and sharing a lot of images. We have an Epyc 7302 system with 128GB of RAM and a few large NVMe drives available; since we’re storing a relatively large amount of data, the cost math favors a colocated server over cloud storage.

What I’m struggling with is how to handle the data storage. Obviously the simplest thing would be to just install a bare-metal OS on the server and host the data and the web server / database software on the same box. But I have an inkling that I want to separate the web server from the data storage, both for security reasons and to minimize headaches if we need to expand to multiple servers in the future. Right now, the host is a bare-metal Debian install that hosts the data on ZFS, and the web server & associated processes are running in a QEMU VM which accesses the data volumes over NFS from the host. It was set up this way because that was the kind of thing I am most familiar with. We have a separate system to store backups, as well.

I’m thinking that it might be preferable (and future-looking) to install a hypervisor, probably XCP-NG, and have two VMs for now: web server, and data storage. The web server VM probably won’t change, but I’d like some advice on how to handle the storage side.

A linux VM with ZFS, exporting the data over NFS (like now)?
TrueNAS or some other NAS OS?
MinIO or some sort of object storage?
Something totally else?

There are two concerns, I guess: what to host the data on (Debian? TrueNAS? Other?), and how to expose it to the web VM (NFS? iSCSI? Object storage/HTTP?). My preexisting knowledge/comfort includes manually administrating Linux servers, ZFS, NFS, iSCSI, etc, but I don’t want to make a suboptimal choice just because “it’s what I know”. Advice and input is very welcome.

Edit: I might also host a pfsense VM as a firewall, forbidden router style…

oguruma · March 14, 2024, 12:28am

I’m a bit lost… So you have an Epyc 7302 server that’s on-prem and you’re considering moving it to a colocation facility? Or is it already co-located somewhere?

When you say “a lot of images” what does that mean? 100? 100,000? 100,000,000,000?

hsnyder · March 14, 2024, 1:29am

The server is currently at someone’s house and we’re soon going to colocate it - the site is pre-release/under development so it’s fine for now. The location isn’t really the relevant bit, maybe I got ramble-y there, the uncertainty is around whether and how to separate data storage from server daemons.

“A lot” = around a million, at least in the short term.

cotton · March 15, 2024, 12:47pm

You could consider offloading your image storage into cloud object storage. This goes by a few different names depending on what cloud you’re using. For example, Azure = Blob Storage, GCP is Object Storage, and AWS calls it Simple Cloud Storage.

Quick blurb from AWS on this matter:

What is Object Storage? - Cloud Object Storage Explained - AWS.

Your use case about pictures could be a good use case for this.

The advantage of using this is that you store your files/images/“files” in the storage buckets and then you can access or interact with them by URL. The URL structure is similar to a file structure. In addition, it can be regionally distributed, and the storage is elastic, in the sense, you’ll “never run out of space” (but you can rack up the credit card bill if you don’t keep an eye on it). One way to say it is, object storage is massively scalable. In addition, you don’t have to worry about wasting time and resource on managing and “sysadmining” servers or physical infrastructure to do this.

I can tell you for certain data profiles, this is a common strategy application developers use, specifically, for image storage (see the tradeoffs section, though). You’ll still need to invest time into learning about setting up and securing cloud storage, but it’s an even trade off because you’ll not be researching how to setup and secure a Linux server to do this. In addition, I think you may save time in Day 2 operations because you won’t have to worry about system maintenance, patching, disk issues, memory problems, etc — because the cloud infrastructure from the cloud provider will handle that.

As I mentioned, it can be multi-regional which gives strong resiliency, and object versioning can be setup to recover files which have been modified or deleted (at least in Google Cloud Platform). Also, you can remove objects after a certain period of time in an automated way to help “clean up” older files (again, at least in Google Cloud Platform).

If you do consider object storage you’d probably want to be aware of the following two things:

Object storage isn’t a silver bullet. Meaning, it’s not a use case for everything.
Trade offs:

One trade off to consider is network latency. Let’s assume you have an NFS FileSystem with Fibre connections or fast ethernet, making calls to object storage will be slower because you have to communicate over the internet. This likely won’t matter if the images are small, however, if they are large image files, you might suffer a performance degradation because the files aren’t stored “as close”. Like I was saying, it’s not a silver bullet, but depending on your requirements and data profile, it might be a good one.
Ingress/Egress cost. If this service is highly transactional, and you do a lot of read and writes to the storage buckets - you’d not only want to consider the storage cost, but the network cost (ingress/egress costs).

You may want to take some time to look into “Object Storage” the different cloud providers have.

cotton

MadMatt · March 15, 2024, 1:53pm

My vote goes to proxmox and lxc containers , one for the web tier, one for the db tier, storage hosted directly on a proxmox ffs dataset passed through to the containers as plain filesystems. You don’t want the complexity and dependency of an NFS VM, at least not until you have more than one host…
If your backup solution turns out to be iffy, deploy a second server with proxmox backup and do proper backups …

jode · March 15, 2024, 2:10pm

You design any infrastructure with a load scenario in mind, yet you have not talked about that at all.

How many visits/hits do you expect your setup to receive over a given time period (e.g. daily)? Secondly, what happens (what’s the impact on the project/company) if/when the load turns out 10x or 100x of your estimate?
The answer to the first question allows you to design a setup that’s appropriate for your estimated workload.
The answer to the second question tells you if you need your design to be able to scale dynamically.

compy386 · March 15, 2024, 3:45pm

This seems to have the smallest attack surface. You could even restrict the storage VM to accept connections from the local subnet only (since it sounds like it will only be accessed by your web server VM). I think TrueNAS is quite overkill for your use case.

This would allow you to dedicate most of your time to managing the web server component which is probably going to require more attention anyway.

diizzy · March 15, 2024, 7:46pm

Why not just bare metal FreeBSD and possibly jail(s) for the httpd? It’s been known to serve a lot of sites with similar setups.

hsnyder · March 16, 2024, 2:43pm

Thanks for the replies so far. I’d love to use bare metal FreeBSD but the webdev guys are leaning pretty heavily on docker, and I have no interest in untangling that mess myself . Proxmox is an option I hadn’t considered, but I will now. As for object storage versus an NFS VM, I’m encouraged to see a vote for each one… it sounds like neither choice is obviously garbage.

pradeep_katiyar · April 25, 2024, 12:26pm

Cloud Storage is the Best option as far as I know!