Hello all!
Hearing Wendell use the term “Software Defined Storage” in the latest L1 YT video made me think of the related but much less established term - “Software-Defined Server”. Let me explain.
I’m not entirely sure that this is the most appropriate part of the forums to host this thread but Linux’s is generally the most stable and open virtualization environment. I’m talking about KVM and all that comes with it.
“Software-Defined Server” is a term I first came across while researching the idea of a distributed hypervisor - a hypervisor that could reside atop a number of physical machines instead of just one (even if that machines does have several physical processors). Knowing how difficult it would most likely be to orchestrate a number of servers to act a single logical unit, I did not expect to find much. What I did find was a company called Tidal Scale which claims to “Deliver a server of any size, completely on demand.”
https://www.tidalscale.com/what-is-a-software-defined-server/
Combining the computing power of several seperate machines has long been a kind of holy grail of computing. Highly performant computer clusters, the fastest machines on the planet, employ thousands of CPUs and/or GPUs or completely custom processors to perform all kinds
of scientific calculations. The Bitcoin network has nodes that confirm transactions. Folding at home lets you partake in scientific research by sending you a small part of a bigger problem to solve. Tidal Scale promises to use similar methods by combining several servers instead of needing one very powerful and (possibly) prohibitevly expensive one.
The virtualization technology embedded inside the Linux kernel allows one to spin up many separate virtual machines by - so far as I understand - treating those VMs as processes on the host machine.
I’m fascinated by the idea of combining a number of relatively weak and affordable computers (nodes) to build a single virtualization host. Despite researching this topic quite intensely, I’ve not managed to find a software capable of convincingly achieving this goal. I fully understand that connecting a couple of desktop computers to a gigabit switch would be nowhere enough to foster a reasonable communication throughput and latency for node communication, but using high speed (40-100Gbit/s), low latency Infiniband with RDMA might allow monolithic operation. Pure PCI Express switches also exist. Clever scheduling of such a hypervisor’s tasks might allow it to run a single OS instance over two or more physical machines without the guest OS being any wiser. Again, I understand that the speed with which an OS fetches something from CPU cache, main memory or from a neighbor machine via IB must be wildly different** but there must be a way to take that into account? (The now infamous) speculative execution emplored by modern CPUs is a workaround for a similar problem, could something similar be used among the nodes?
I’m obviously out of my depth when it comes to the inner workings of schedulers and hypervisors. I’d love to hear someone more competent explain if it would be able to connect say 5 Xeon workstations with 100G Infiniband in a mesh topology, equip them with a stripped down version of the FreeBSD or Linux kernel and somehow build a working hypervisor atop that?
Thank you!
**Latency estimates: [ usec = microseconds]
1 Gb Ethernet (copper) = 29 - 100 usec
10 Gb Ethernet (Mellanox ConnectX, fibre) = 12.51 usec
EDR Infiniband = 0.5 usec
DDR InfiniBand = 1.72 usec
QDR InfiniBand = 1.67 usec
Main memory reference: 0.1 usec
Mutex lock/unlock = 0.025 usec
Branch mispredict = 0.005 usec
L1 cache reference = 0.0005 usec
It seems that IB has 3 - 15x the latency of main memory. Still, I’m guessing it might be able to have the local memory of the executing CPU predictively fed with most of what said CPU might request, thereby limiting the number of longest waiting times to a minimum. Or not, I don’t know.