I recently threw together a budget server with some parts from my HEDT that weren’t really getting much use and some old HDDs.
It’s a Ryzen 5600X from AliExpress, 64GB and because there’s not a ton of lanes on it anyway went the cheapest business class motherboard that had two m.2 slots and a x4 slot for some of Wendle’s lane maximization trickery.
the lower m.2 is broken out into a 10gb sfp+ nic.
top is broken out to u.2 for the 1.5TB Optane 905p.
16x with a 16GB RTX A4000 (~3070ish performance)
4x has an lsi 8i hba for a total of 12 pokey sata3 rust buckets (i’ve only got 4 atm)
~$500 with a Define R5 all in. GPU, Optane, and rust excluded.
Anyway, I should get to the point huh?
Well as I’ve fumbled around with it all and loaded it up testing certain things, some that required full reinstall. The fact that the boot drive is Optane has been nagging me to figure out how to use it effectively to really stretch what I can do with my very first little home lab. I do plan on testing it out but i’m just throwing it out there to see if anyone more experienced can poke holes in my theory, has some way to improve it, or can help me figure out a good way to test it.
I’d like to have a number of very low priority containers with services that wont active much but can still be really snappy when they need to be. With the way that memory allocation works as i understand it, it’s much easier to strangle an LXC container than a VM if the intention is to forcibly cache ram onto the Optane. And, on it’s own this is an inelegant solution because it causes aggressive swapping to compete over leftover ram space for execution.
I think you can make a shared two-tier RAM space
…that eases the load on all of the other containers you’re running in one consolidated place. And the best part is that it should basically be doable with no software and maybe a script. Probably a few ways to improve it, of course.
Create one container that hosts a tmpfs ram disk as an NFS share, consuming most of it’s assigned ram and giving it a healthy swap space allocation on the Optane. I’m talkin could easily be larger than your ram. Set this containers swappiness aggressive, very aggressive… we’re talking makeshift two tier ram space.
But why with a tempfs ramdisk, how’s that help us?
Well, that way if one of those very restrained containers is suddenly trying to do a lot of stuff in the /tmp or some other tmpfs in the system you can redirect it to the ramdisk, it cuts out a ton of io delay as the process is actually doing work but settles for the same nvme speeds on the Optane for the temporary files that don’t get touched as much. Which moves them out of the dedicated container ram freeing up dedicated space without messing with their swappiness and keeping RAM space for execution stuff there.
Essentially, any workload that creates a bunch of temp files in ram in a short period but might leave them there for a while and still use them again (perhaps some generated website stuff?) , becomes a candidate. Because you already know it’s stuff that would take up ram but is probably not as important to performance if it’s not actively being engaged.
If the shared space can cache aggressively enough to have ram space open in the NFS drive, then it could be used a ram pool that’s shared exclusively between the VMs or containers you specify. Not every container will benefit from this idea but use of the tmpfs filesystem very common df | grep tmpfs so I think my theory has legs.
Could this be a path to over-provision with a dedicated two tier ram/Optane pool for better performance out of more realistic usage than just seeing how overloaded a bunch of stress tests behave?
Well, at the moment I’ve got containers with pi.hole, active directory, gitlab and jenkins (as agent) for instance. For the most part I want to stack up a bunch of these relatively light containers that don’t get a ton of use and aren’t likely to be active at the same time and forcibly shove them into swap as much as possible full time. Leaving them running and minimizing their use of real-time resources at idle.
More containers will be added with certain tasks; to periodically check an rss feed and push it to the db, to download a file and drop it in a monitored folder (handing it off to a vm with a lot more resources to uncompress it and process the data) for example.
by creating a pool for a bunch of containers to write their file based ram-disk items in, any program that uses /tmp files actively while working should basically have a relief valve of more ram than they are assigned by writing to that ramdisk.
not everything would benefit from it, but as some of those containers will be custom code running, if that works I can abuse that to great effect. and have them essentially running in a half-asleep mode until they’re actually needed and still actually responsive when they are.
ok so what your describing is efectivly is you want an op swap space?
Which then leads me to why is the builtin swap function of linux not going to work here if you put a large swap on the optane drive(proxmox is on debian 12)? You can change the agressiveness of the linux swap. I find its pretty good at actually swaping things that are used infreqently, the few times i have check what pages its swaping. This would then apply to all containers and vm’s they would all be traded in and out depending on what is needed.
Not sure about your custom applications exact needs, but you know about the /dev/shm directory in linux is a filesystem that is on ram(there is no backing storage pc off equals lost data). I use this for plex transcoding. So i don’t write the temporary transcoded blobs back to disk just to be erased later.
umm, no. I’ve got 64GB and let’s say I want to run a bunch (30) of low priority services and microservices and I and to reserve the vast majority of RAM space to the main couple of VMs.
I’m not saying this is the intent but I think it’ll clarify a bit better. Out of that 64gb, we’re going to dedicate 8gb for 30 containers.
Almost all linux distros use ramdisks, usually at least /tmp and /run and instead of those pointing to their tiny bit of actual ram they point to the pool. Yes, swappiness can help open the available ram on each of the thirty containers by swapping to the optane cash but if they’re not pooled then they still only get 256gb and end up thrashing the swap. Inherently, because these are being written into RAM as files, the data would still have to be retrieved before execution so it’s a “relatively” safe solution to separate these files into a larger ramdisk.
The benefit of that ramdisk being a pool means that each of those containers is actually given;
128mb execution RAM + up to 4gb RAM in pool
And that served ramdisk can use the aggressive swappiness to age files into optane, freeing up ramdisk space for whatever container needs it. So there’s neither as much pressure on the assigned ram or thrashing of swap to free space in ram.
I hope that clears up what I mean. And yeah, /dev/shm would be a candidate then as a way to still be writing those files to ram while the container is working on them. Just not it’s own. It can remove them on it’s own or let them age into cache which would still happen if you cranked up the swappiness on each container. The difference is that the container doesn’t have to fight as hard to keep it’s assigned memory clear while doing the work. And I think that has the potential for them to behave and perform as if they weren’t so constrained.
Tempfs is creating a block device from ram. To then have it put memory on this block device it would need to be marked as swap. This would just be marking a section of ram as swap. Does this even need to be in a container? What is a container doing for this idea.?
Optane is not ram its already a block device. You would just directly provision it as swap. To have it be treated as part of the memory system.
Now were adding networking latency and nfs protocol latency along with the cpu cycles to do both, this just sound like a recipe for slow.
Ok so if im following this idea in order to funnel a piece of memory to this container using tmpfs and optane i would set up a swap space pointing to a nfs share that is really on a tmpfs block device in ram. Ok ignoring the networking nfs part, the first problem i see is the tmpfs blockdevice has a set size in order to force the container holding the tmpfs to put its data into the optane swap you would have to over fill the tmpfs block device (to fill ram enough to swap the tmpfs block device into swap) which is impossible linux will refuse to write once its full. The mor direct version o fthis is a NFS share pointing to the optane drive, which is then provisioned as swap space after being mounted in other contaienrs. And this still sounds slow and terrible.
You say microservices, but then each container needs 2.1gb(You can make ubuntu run in a vm along with openvpn server for 512mb) of memory(regardless of if its in ram or on disk) most micro-services i run in containers need less than 512mb or ram. So for 30 x 512mb is 16gb vs 2 x 30 is 60gb, now also with most micro-services i run most spend the the majority of the time doing nothing (talking about cpu time here) . If its not actively doing something why does it need to be in ram, let it live in swap. Fastest way to swap is not via NFS its letting the kernel do its thing.
Why have the cpu jump trough all the hoops of using NFS over the network, to copy memory to other memory(tempfs) thats being treated as a block device, which then gets swapped to optane(not sure this would ever trigger). With this level of obfuscation it sounds like the kernal will have a hard time telling which way is up.
I think the takeaway here is don’t reinvent the swap wheel.
Lets assume you build a tiered swap system, first i would use zram on some portion of your ram if you actually are short by as much ram as your implying, it usually nets 2-3x the ram space using compression, then id setup your optane swap area.
Now for the questions:
Do you actually think the services running in your 30 containers actuallly need 60gb of memory collectively?
Why not mount /tmp or /dev/shm in every container to a folder on the optane drive, prevents the containers from using memory as a block device, freeing memory for actuall program needs.
What is the memory requriements of your vms?
Will the vms get their own swap space on the optane drive?
Yes, tmpfs or ramfs are the ways that linux address ram into the filesystem. You keep saying swap to optane like it wouldn’t already have to clear pages from memory into optane swap, read new pages out of optane and then work on them. So there’s already latency added. Your solution breaks the premise of reserving the vast majority of system resources, of intentionally over-provisioning but creating a solution that improves their performance when they actually need to work.
I literally said the scenario was a hypothetical and the size of the swap on the LXC containers really has nothing to do with it. They’re kinda cartoonishly large because they’re not that important to the idea. Not sure why you’re hung up on that.
instead of having 30 containers with cranked up swappiness (cycles…) all living on 256mb only, the proposal has them each with 128mb plus up to 4gb of ram in the pool. And yeah, there’s some overhead to write it to another containers ram, but there’s more bandwidth in the virtual network than to the optane drive. The data is already being formatted into a tmpfs in ram either way.
Um, no… something between 128mb and lets say half of the pool 2gb at any given doesn’t seem unreasonable. But, I don’t yet know the habits of things removing files in temp on their own and just made the hypothetical network cache large enough that files swapped to free up space in the ramdisk wouldn’t likely run out.
um, because i’m trying to actually give those containers more access to ram.
Assume the VMs only leave 8gb of ram for the containers.
Question 2 is kinda more on track though. In theory you could just mount what would be ramdisk tmfs straight to the optane. The result would be that container maintained more memory free while at work but the work would be slowed for r/w time.
Still, given that the container is constrained anyway, were it in tmpfs those files would be swapping to optane anyway so they’re probably pretty equal.
Pointing them, instead, to another containers larger tmpfs keeps them out of their dedicated ram. So if one of the containers is doing a bunch of work in temp, those files fill up the pools ram and are accessed from that pools ram. Only getting swapped down to the optane (in the pool’s swap) as the container idles.
Not sure where your missing this point, optane is allredy a block device its not ram, a ramdisk and tmpfs is turning existing ram into a block device. You cant provision optane as a ramdisk or tmpfs because its not ram. Also why would you want to when its already a block device. ramdisks and tmfs turn memory in to block devices not the other way around.
Next point here is turning your memory into block devices means it can’t be used for memory things, unless you add it back to the memory system via adding the block device(that is memory) as swap.
moving where swap is doesn’t fix the whole swapping pages part, your trying to avoid with the bog standard swap setup. Your system is only adding latency and processing to this system. Along with consuming ram to bulid the system.
Back to my proposal:
Instead of provisioning 8gb to containers to make this convoluted swap area provison 8gb to zram to make a compressed ram area. Now you have 56gb of ram + approximately ~24gb of ram space that will be compressed into 8gb + and say 128gb of optane swap area. this gives you ~80gb of memory on your ram, and an additional 128gb on the optane swap.
The point is not to treat the optane as RAM, the point is to pool the parts of ram where BLOCK DEVICES are already used into a larger hosted TMPFS in a container with more RAM. The point is to give them more RAM in a pool, to swap to optane LESS. And yes… in order to keep that ramdisk available for the containers, it will swap inactive pages.
if i lock them all to 256mb and take up the 8gb, what happens when it needs 512mb? It has to swap to the optane.
Giving them a pool of 4gb that’s hosted on the virtual network to points all the memory that uses a block device like tmpfs or ramfs to a network hosted block device, because block devices are much easier to handle across the network. And that happened to also be ram, it would have additional RAM that isn’t accounted for in it’s execution space. Making it have more potential to swap less.
Whatever latency is introduced is not likely to be worse than the optane when compared with the bandwidth on the virtual network and storing in ram. tmpfs is not addressed by memory addresses, it’s addressed by the filesystem and treated as a filesystem. So, why do those temporary files need to be in that container’s ram so long as it’s still addressable?
I think this is the rub point, where we are not understanding each other.
What i am saying is you can not point a container to a block device and have it treat it as ram. You can tell a container to use a block device as its own personal swap. So in summary that 256mb container would still only have 256mb of memory, then have a swap space of however big the block device you pointed it at. A program in said container if it exceeds the 256mb of ram still either has to crash because its out of memory or get its pages swaped to a block device.
To be clear this block device no matter what is backing it a tmpfs ram block device or optane partition block device, does not change the fact that the block device cannot be treated as ram, only as swap space.
In the grand scheme of the system other things may swap less due to the choked nature of these containers, but the containers themselves would be more likely to swap, just now to a very specific block device.
Also im still not clear on the purpose of these 30 containers, are they purely to provide these tmpfs blcok ddevices back to other services?
with NFS, the networked tmpfs ramdisk can have a folder be mounted in place of the /dev/shmtmpfs that would be created in the container’s RAM. Any program writing or reading from the area would still be writing and reading from ram. With some overhead, I’ll grant you.
If the program is active and doing work, then those files are held in ram. Idle, and they swap to make room for some other container to do work in ram.
There’s still a problem of clearing the files on the network ramdisk when a container restarts but that’s not really that hard to solve.