Software Defined Server

Melinit · August 8, 2020, 4:12pm

Hello all!

Hearing Wendell use the term “Software Defined Storage” in the latest L1 YT video made me think of the related but much less established term - “Software-Defined Server”. Let me explain.

I’m not entirely sure that this is the most appropriate part of the forums to host this thread but Linux’s is generally the most stable and open virtualization environment. I’m talking about KVM and all that comes with it.
“Software-Defined Server” is a term I first came across while researching the idea of a distributed hypervisor - a hypervisor that could reside atop a number of physical machines instead of just one (even if that machines does have several physical processors). Knowing how difficult it would most likely be to orchestrate a number of servers to act a single logical unit, I did not expect to find much. What I did find was a company called Tidal Scale which claims to “Deliver a server of any size, completely on demand.”

https://www.tidalscale.com/what-is-a-software-defined-server/

Combining the computing power of several seperate machines has long been a kind of holy grail of computing. Highly performant computer clusters, the fastest machines on the planet, employ thousands of CPUs and/or GPUs or completely custom processors to perform all kinds
of scientific calculations. The Bitcoin network has nodes that confirm transactions. Folding at home lets you partake in scientific research by sending you a small part of a bigger problem to solve. Tidal Scale promises to use similar methods by combining several servers instead of needing one very powerful and (possibly) prohibitevly expensive one.
The virtualization technology embedded inside the Linux kernel allows one to spin up many separate virtual machines by - so far as I understand - treating those VMs as processes on the host machine.

I’m fascinated by the idea of combining a number of relatively weak and affordable computers (nodes) to build a single virtualization host. Despite researching this topic quite intensely, I’ve not managed to find a software capable of convincingly achieving this goal. I fully understand that connecting a couple of desktop computers to a gigabit switch would be nowhere enough to foster a reasonable communication throughput and latency for node communication, but using high speed (40-100Gbit/s), low latency Infiniband with RDMA might allow monolithic operation. Pure PCI Express switches also exist. Clever scheduling of such a hypervisor’s tasks might allow it to run a single OS instance over two or more physical machines without the guest OS being any wiser. Again, I understand that the speed with which an OS fetches something from CPU cache, main memory or from a neighbor machine via IB must be wildly different** but there must be a way to take that into account? (The now infamous) speculative execution emplored by modern CPUs is a workaround for a similar problem, could something similar be used among the nodes?

I’m obviously out of my depth when it comes to the inner workings of schedulers and hypervisors. I’d love to hear someone more competent explain if it would be able to connect say 5 Xeon workstations with 100G Infiniband in a mesh topology, equip them with a stripped down version of the FreeBSD or Linux kernel and somehow build a working hypervisor atop that?

Thank you!

**Latency estimates: [ usec = microseconds]
1 Gb Ethernet (copper) = 29 - 100 usec
10 Gb Ethernet (Mellanox ConnectX, fibre) = 12.51 usec
EDR Infiniband = 0.5 usec
DDR InfiniBand = 1.72 usec
QDR InfiniBand = 1.67 usec
Main memory reference: 0.1 usec
Mutex lock/unlock = 0.025 usec
Branch mispredict = 0.005 usec
L1 cache reference = 0.0005 usec

It seems that IB has 3 - 15x the latency of main memory. Still, I’m guessing it might be able to have the local memory of the executing CPU predictively fed with most of what said CPU might request, thereby limiting the number of longest waiting times to a minimum. Or not, I don’t know.

risk · August 8, 2020, 5:42pm

This scale out VM type of thing, has been done more than a decade ago, but it worked like crap in practice, the name of the project escapes me … (also, I’m willing to be there was more than one such project for the same reason you got here).

Basically you log in with any IP, and you run cat /proc/cpuinfo … and you’d see thousands of cores.

I’ve thought about it every once in a while, but the more experience I have with large scale services the more I find this impractical and unlikely to ever become practical for a number of reasons:

a) The $$$ / unit of volume of hardware is going up over time - that means that paying humans to ensure stuff works well on that hardware is easy, issues like lightspeed physics tend to be overcome more easily with sweat equity if you have money to pay for that. It means that funding the project to make such a “software defined server” setup easier, is unlikely to happen.

b) computing power per machine in a data center is going up and up and up. Per core/thread hardware is not really improving that much, but compute densities are going up hugely. And with more and more practical use cases for computing being offloaded to machine learning, that’s ridiculously parallelizeable and therefore easy to justify building custom silicon. There’s just less and less useful need for monoliths that can’t be split. Up until a decade ago, you’d be wanting to split an app in order to be able to scale each part of the app on its own machine. These days, you’d want to split an app, because you don’t want to risk taking down a giant part of your production environment every time you release a new version or because you want to test things separately.

The above two things are true, even in “fancy networking environments” (don’t ask me details of how I know, but there’s this company that starts with G that’s been building its own network stuff and experimenting with various apps built on top of custom “remote ram” type of thing that was built on top of custom nics with custom low latency userland/kernel/nic/switches … it might pan out for some use cases).

I think things like kubernetes with perhaps something like criu alongside will continue to dominate the market. Once you get something like istio into the mix, you don’t really need to change the programming style much compared to building an app that runs well on a gpu, or that runs well across all cores on a socket. (It’s all full of callbacks and buffers and queues either way).