Which Filesystem

Hi All,

I am trying to look for a filesystem that suits my needs. I have project that has to analyze couple of dozens of terabytes of data. For this I need to setup a filesystem that will let me treat the system as a computing cluster, adding more nodes as needed for computing power. Each node will have to access the same files so I considered a NFS. Since I need to keep files under the same path, I need functionality inherit from a LFS that lets me add physical drives and treat the path as having enough storage as needed or required. But since I will do heavy reading of the data, I will need bandwidth scaling, inherit from HDFS.

Is there a solution that allows me to use the traditional NFS of computing clusters with the flexibility of addding pysical drives on logical volumes of LFS and scale the bandwidth of HDFS (Hadoop Distributed File System)?

Or perhaps I am missing the fact that perhaps one of them already offers all that functionality?

Thank you in advance for any help on the matter!

Would something like Gluster fit your needs? I'm still trying to wrap my head around exactly what you're looking for. I'm not quite clear on if you want a distributed filesystem where the disks are spread across several nodes, or just something like ZFS where there is a single storage server with the ability to grow the storage pool by simply adding disks. At second glance looks like you are seeking the latter. ZFS on something like OmniOS or FreeBSD is pretty great. The Linux port is not quite up to par. ZFS handles disks directly, and provides a single "pool" from all the available space, striping data across all the disks by default. Alternatively you can add groups of disks to the pool. For example, you want improved read performance and drive redundancy, so you can add two or more disks at a time as a mirror. Data gets striped across all the mirrors in the pool, so you can have say 12 mirrors in the pool. Instead of mirrors you can do parity configurations a well. Then on top of the pool if all you want is one shared dataset it's a piece of cake to just export it via NFS or whatever network filesystem best fits your use case.

https://en.wikipedia.org/wiki/ZFS#Storage_pools