The rumors exist that iCloud data is stored between Google and AWS.
Facebook, Google, Microsoft all use homegrown written from scratch proprietary distributed filesystems.
too much detail
Facebook used to use HDFS, now mostly HACFS. They’re both based on GFS and Colossus from Google in terms of design and architecture.
Google used GFS until 2010/2012, and have been using Colossus afterwards. These are not file systems in a POSIX “I will mount it on my workstation kind of way”, best way to think about them is to think of “object stores”, where objects have IDs that look like filenames and they’re really good for archival storage - can’t do random writes.
GFS design is simple, you take 5 of the biggest machines in the cluster make them hold all metadata and they’re kept in sync using Paxos libraries (think etcd), and you take another 10,000 machines and make them chunkservers. You can:
- “open and create a new file with some attributes”, “write data sequentially”, “close a file”.
- “open an existing file”, “read data at any offset”, “close a file”.
When opening and creating a new file, attributes include replication, engineers who are end users for GFS/Colossus choose their own encoding. In GFS master reply from creating a file, you’ll get the IP addresses of some chunk servers you should be contacting.
You’d them contact a chunk server, you’d give it data, tell it what file and what offset the data is from, and tell it to replicate the data to two of it’s friends and have 3 copies total, and you give it a crc32 checksum.
r=3 replication was typical for GFS, data is stored 3 times in the same cluster. In order to have good availability when reading (no raid reconstruction is needed, just look at another server). The 3 machines
When you close a file, you update the data on the GFS master with checksums of all chunks you’ve written so far.
Typical chunk size was 64M.
Chunkservers would store chunks as files on ext4, each disk would have its own filesystem. There’d be no raid, no read caching. There’d be some write buffering.
Google started experimenting with Reed-Solomon on GFS, popular encoding was 9.4. each 72M “stripe” of data" was split into 9 x 8MiB chunks and 4 recovery chunks were generated. These 13 chunks were spread across 13 racks.
A rack going offline (e.g. top of rack switch going away together with chunk servers)
Facebook likes using 10.4 .
Google later switched to 6.3 (with 48M(6x8MiB+3recovery) stripes; so BigTable can save ram and increase concurrency).
Azure used LRC(6,2,2) for a long time.
All of these distributed filesystems in use at the companies have limited semantics and pluggable encoding algorithms that you can change on a case by case basis by rewriting individual files differently.
Colossus in 2010 was a huge improvement.
Metadata for a cluster was no longer stored on a single machine (… or 5 because of Paxos), it was stored in BigTable. Each row in one of the tables holds 1 file’s metadata. (name, attributes incl encoding, list of chunks with checksums and hostnames for each chunk). BigTable itself is just a Log-Structured-Merge (LSM) system for storing large key-values. Data in BigTable (Colossus metadata) is sorted by key, and split into 64M tablets, such that each is responsible for a key range. So… each tablet is responsible for a small part of the overall filesystem metadata, and didn’t have to be the biggest machine in the cluster, it’s also usefully persisted outside of ram, and tablet data is stored in a GFS cluster and disaggregated from compute/serving it. So you have two layers, smaller GFS cell that houses metadata for a larger Colossus cell.
When writing or overwriting new row of data (Colossus metadata), you send a row to a BigTable tablet server responsible for the keyrange (filename) and it goes into two places, a hashmap in BigTable tablet server memory, and it gets appended to an already open GFS logfile (crash recovery/persistence). Eventually, in memory hashmap for a tablet would get sorted into an overlay tablet. When reading a row you’d contact the tablet server responsible for the row and it’d look at the hashmap, then the overlay tablet, then the base tablet. Sometimes you’d get more than one overlay tablet, and tablet server would merge them into base in the background and garbage collect the old ones - they’re sorted, merging is trivial.
First Colossus ran on Chunkservers, they were deemed too be overly smart and complicated and were replaced with D servers (think FTP servers). Chunkservers used to know how to replicate the data in GFS. In Colossus data replication and distribution is something that the File library does, and it allows for higher availability as it allows clients to read/write data independently of any single D server.
What used to be r=3 on GFS is replaced with r=3.2 with 8MiB chunks in Colossus. Send 3 copies (3x8MiB) of data down the network, Write call in API returns when any 2 work out, 3rd write can linger a bit and get there eventually… or not. Some kind of background curator thing will do a scrub and figure out if it needs to restore the chunk. The two copies aren’t even stored on disk, they just made it to ram and will make it to disk probably, eventually.
With Reed-Solomon the CPU cost for writing and read-recovery is on the clients. Some reads where data chunks are damaged can be slow.
Key lesson with Colossus is that Metadata is data, and you can layer systems to build bigger systems and de-amplify write rates and write volumes. For example, for every 8M of data written across 3 machines, you only store maybe 100bytes of metadata. To cause an 8M of metadata to be written you’d need 80,000 8M writes of data. Because clusters are big. If your metadata is larger than e.g. what fits on one machine, you can still store it on multiple machines… if you need to keep track of multiple machines worth of metadata, which is just data, one machine is probably enough, and that’s probably ok. One machine can do a lot these days, and this one machine is not in the critical path of some write happening 2 layers up.
With Google Cloud VM disks (aka. persistent disk), the actual data is stored on Colossus files over the network. When you write a block to a device, a log record is appended to a file, and a cached in memory map entry is modified to say what block, is in which log file at which offset… over time, these log files are compacted, so that you end up with contiguous ranges of blocks, which reduces the size of the map. And allows old data to be garbage collected/freeing up space. Not sure these days, but writes are not guaranteed to be on stable storage when the VM kernel issues a sync, they’re only guaranteed to be in multiple machines RAM… but they’ll get onto stable storage soon enough. Stable storage is Colossus on SSD, regardless of what you choose in the UI (there’s just artificial iops rate limiting happening for fairness between different users of storage).
Machine running your VM has no local storage (except a small crappy 64G boot ssd; I’m not sure if it’s usb2.0 or some other interface)
Also, @alpha754293 since you mentioned par2 specifically, keep in mind that it’s very slow, and you’re not going to be uploading each one of your datasets across a 100+ tapes, probably.
You may want to look at snapraid instead – it uses virtually no CPU (gigabytes per second per core) doing 10d+4p style raid computation (need 10/14 to recover).
They have this table:
Do I need one or more parity disks?
As a rule of thumb you can stay with one parity disk (RAID5) with up to four data disks, and then using one parity disk for each group of seven data disks, like in the table:
Parities |
Data disks |
1/Single Parity/RAID5 |
2 - 4 |
2/Double Parity/RAID6 |
5 - 14 |
3/Triple Parity |
15 - 21 |
4/Quad Parity |
22 - 28 |
5/Penta Parity |
29 - 35 |
6/Hexa Parity |
36 - 42 |