I’ll have to look into that as well (and also study what the performance impact it may have, during the course of a FEA run, where the CPU will also have to manage the zswap compression).
I like both ideas though. It’s worth it for me to read up further on both of these suggestions and I appreciate that.
Could also be shit tier SSDs, I’ve had an SSD used as boot drive for TrueNAS (at home) fail with filesystem corruption after 1 month.
Was not impressed, lol… only ran an SSD because new install of TrueNAS warned me about my thumb drive being used for boot… which had been in continual use for 10 years across many installs. Brand new SSD, fails after 1 month. Go figure.
OP also said SMART usage indicated some tiny amount of writes.
So the SSD didn’t die from excessive swap - because it had some tiny amount of writes.
also, the write limits are usually conservative:
Bear in mind that test is from 10 years ago on consumer drives of 240-250GB in size including those using TLC NAND. The drives in general lasted way in excessive of rated endurance in general. WAY in excess.
You’re mixing sequential transfer writes with random writes.
Writing 4 GB of data, sequentially, every 5 minutes, would still only be 1 TB of sequential writes, per day.
Writing 4 GB of RANDOM data, works on SSDs, entirely differently as the incremental volume of data being changed per second, works very differently.
~4 GiB / 300 s = 14,316,558 bytes/s of random data being written.
If you assume SSDs to have a 4 kiB sector, that would be around 3495 sectors of data being written to said SSD.
If you’re writing them at random, (find a new sector each time you write, or you have to flush/erase an existing sector to program/write to the same, existing sector again), you’re doing almost 3.5k sectors of writes every second. Multiply that by 86400 seconds/day, you’re looking at approximately 301,968,000 sectors of data being written to the SSD, per day.
In other words, you’re constantly programming, flushing/erasing, and then re-programming, the same number of sectors, every day.
GamersNexus actually has an article on How SSDs Work, published in 2014, but the general structure and architecture really hasn’t fundamentally changed how it works.
In his example, he is referencing the “Micron’s 16nm, 128Gb Flash in this example.”
4 GB in bits is ~ 32 Gib.
That means that each time you write ~4 GB of random data to it, you’re writing about 32 Gib of data.
The example that GamersNexus cites also is a dual plane SSD, and each plane contains 1024 blocks and each block contains 512- 16kiB pages.
As he also notes, in this example, the SSD can only write the data in 16 kiB pages (and it can move them around, with wear levelling).
Rounding up to 4 GiB, divided by 16384 byte pages, that would be 262144 pages.
If a sector is 4 kiB and a page is 16 kiB, therefore; one page = 4 sectors.
If one block is 512-16 kiB pages, then therefore; one block is also equal to 512 * 4 sectors = 2048 sectors/block.
There are 1024 blocks/plane in the example that GamersNexus is using, therefore; 2048 sectors/block * 1024 blocks/plane = 2097152 sectors/plane.
It’s a dual plane SSD, in GamersNexus’ example, therefore; it would be 4194304 sectors/128 Gb module.
The 480 GB drive in the example itself has 32 dies, and therefore; 4194304 sectors/module * 32 modules = 134217728 sectors.
Since we had previously calculated the number of sectors that would be written to write 4 GB every 5 minutes, daily = 301968000 sectors written/day, therefore; if you divide that by 134217728 sectors, you’re getting around 2.25 DWPD (if you’re using this 480 GB example that GamersNexus is using).
Even if you gave it the referenced 3000 program/erase cycles per NAND flash module as its rated write endurance limit, you’re programming/re-programming/erasing/re-erasing the drive over twice a day.
Therefore; 3000 P/E cycles / 2.25 DWPD = 1333.333 days of life. Divide by 365 days/year and you’re looking at about 3.65 years.
And that’s JUST with writing 4 GB every 5 minutes, or about 14.3 MB/s, which is a tiny fraction compared to what the rated spec is, even for random I/O.
If you end up just writing 16 GB every 5 minutes (so about 57.2 MB/s, which again, compared to even the random I/O spec, is only a fraction of what an SSD would be capable of doing), you’d wear out the drive in 0.91 years.
57.2 MB/s isn’t that difficult to achieve, especially if you use it for swap.
And you don’t even have to maintain that 57.2 MB/s write rate for the entire year.
As long as you average out to 57.2 MB/s during that whole time, you can wear out the drive in less than a year.
it’s really not that hard.
And 480 GB drives, even with enterprise SSDs, some of which are only rated for 1 DWPD, are very easy to come by.
It’s not that hard.
How many people who use Proxmox, are also using it for HPC/CAE/CFD/FEA applications?
I mean, it’s not like everybody who’s using Proxmox, is using it to figure out what the cross-flow Weber Number needs to be for spray deposition and atomisation.
In other words, I’ve already told you why.
Once again, you don’t read.
Again, if I don’t write it, that’s my fault. If you don’t read it, it’s yours.
It has to do with how HPC/CAE/CFD/FEA applications run, when it is trying to run, without swap.
I’d invite you to study how, using FOSS examples, how Salome Meca runs, so that you have an idea what that actually looks like/what it entails.
I doubt that you’d actually put in the work that is required, but you can either study how that works and/or you can study how OpenRADIOSS runs, especially if you are using it as an implicit solver.
Again, I doubt that you’re actually going to put in the work that’s required to study it, but as I’ve already told you, I already know why it crashes when swap is disabled.
If I don’t write it, that’s my fault. If you don’t read it, that’s yours.
The SSD was used only for swap, and that’s how I knew swapping was what killed it.
Swapping to HDDs often means async writes, and async writes, are often only about 5 MB/s. (I’ve been using the h2benchw tool since oh gosh…like 2004/5/6? something like that. I used to use it to benchmark Fujitsu 10krpm 2.5" SAS HDDs that’s in the SunFire X4200, and even then, it could still only barely muster 5 MB/s when it is running the swap part of the benchmark.)
Random I/O is precisely that – random I/O. It will write to the pages of the NAND flash modules, randomly, rather than sequentially.
People often use SSDs to accelerate their sequential file transfers and that’s great and all, because you can realise significant gains over HDDs for this use case where you’re going from like ~100 MB/s to say ~550 MB/s with a SATA 6 Gbps SSD. (And of course, even faster if you’re using any form of NVMe SSD (3.0, 4.0, 5.0, U.2, M.2, AIC, E1.S EDSFF)).
But that’s only a 5.5x throughput increase.
If you look at random I/O though, even a lowly drive (Intel 545s Series 1 TB) can pull off upto 90k random I/O/s.
Assuming that a sector is 4 kiB, 4 kiB * 90000 = 360000 kiB/s or 360 MiB/s. To go to ~5 MB/s random write speed with a HDD, to upto 360 MB/s with a SSD, that’s a 72x increase.
Again, I know that many people use it to boost their sequential transfer performance. And that’s fine. You get a nice 5.5x boost, in performance.
But I purposely and deliberately used it exclusively as a swap drive, because very early on, I realised that the increase from 5 MB/s write speed whilst swapping, stands to gain the most significant improvement with even a SATA 6 Gbps SSD that can do upto 90k I/O/s, which would be a 72x speed up of my swapping performance.
This is why the SSDs in my compute nodes, were dedicated swap drives, entirely because of this.
And the only people who would even bother with running this analysis, are the people who are facing this problem.
For everybody else, if they don’t run into this problem, then they’re not going to bother studying it, and then devising a solution to address said problem.
Again, you’re also more than welcome to run the tests yourself.
Salome Meca is FOSS, as is OpenRADIOSS now.
You can download it, and the models to run, and you can watch it go (and how it behaves on your system).
And typically, the FOSS benchmark models that you’re going to download are relatively simple ones vs. having a three part assembly and you have frictional contacts everywhere (rather than bonded contacts, because that’s vastly easier for the solver to solve), and at least one of the parts, you’re using an anisotrophic glass filled polymer, where I was using the Johnson-Cook non-linear material model to deal with the fact that it was an anisotropic glass filled, injection moulded nylon, IIRC.
In other words, you can take the basic model that you’ll be able to download for free, and then you can modify it to introduce these non-linearities into the problem, and then re-run your benchmark sim again, and then you can examine how the solver runs, during the course of the solution process, for this example problem.
(There’s a chance that you might actually have better luck trying to solve this using the implicit solver with OpenRADIOSS than Salome Meca, you can try either one and/or both.)
It’s pretty easy to say “swap killed my SSDs” when said SSDs were dedicated swap drives and it wasn’t used for anything else.
(Why burn up the finite number of P/E cycles on stuff that doesn’t need it? That’d be a waste.)
Old HPC cluster used to be Supermicro 2027TR-HTRF which has a Supermicro PWS-1K62P-1R, so if the PSU had issues, it would likely have taken other parts of the system down with it, given the way how the 2U, 4-node system is set up.
System is too old for M.2 slots.
FWIW, they’re SATA 6 Gbps SSDs. (Intel 545s Series 1 TB SSDs to be precise).
At the time, there was insufficient data to suggest that Intel’s 545s Series SSDs were crap.
For sequential writes, it’s probably fine.
For random writes, however, Intel originally didn’t want to approve the RMA request because they also, likewise, initially thought that I had blown through the drive’s daily/overall write endurance limits.
However, I just pulled the SMART report straight from the drive, and sent it to them, and showed them that the host write volume amount/number, divided by the power on hours * 24 ended up being well under their TBW/DWPD limits, so they ultimately approved the RMAs.
Ok, but the question still stands… same port, or not? Sometimes SATA ports go bad. Sometimes data and power cables go bad as well. Was there a common physical element (i.e. port, data cable, power cable) for ALL seven SSDs that died?
The evidence for that was, if it was a port issue, Linux, luckily (or perhaps, unluckily) will report port/device timeout issues.
(e.g. happens with dead/dying SAS drives)
The Intel SSDs, when it died, had another rather interesting/peculiar problem, where, normally, when a SSD exhausts it’s finite number of P/E cycles, it is supposed to put the drive into a read-only state, so that you can still read the last set of data that you wrote to the drive, but you just can’t write any new data to it.
However, the problem that I ran into was that it locked itself up so badly, that I couldn’t even get it to do that.
I was barely able to get it to dump out the SMART data before it completely died such that I couldn’t even read from it anymore.
It was a weird case that I filed an RMA with Intel for, which they approved.
And then additionally, because it was also a dedicated swap drive, and since swap kicks in, on boot, it also created other problems as well, because of it.
So yeah.
Dead SSDs, sometimes, can be a PITA to deal with. And at the time, I didn’t know how to disable swap via modifying the Linux kernel parameters at boot.
(Port issues, would sometimes, result in like a “cable unplugged” kind of message that’s written to the sys logs.)
(Sidebar: if you ever want to see people beat the crap out of computers – talk to an engineering team. It’s been said that the data for like the latest Intel CPUs, just the lithographic layout data, is somewhere on the order of several TBs.
When I used to run the aerodynamics CFD on the model of the semi (shown above), each time I ran that model for 2 seconds of simulation time (which took, I think the fastest I ever got that to run, was about a week with 64 CPU cores), writes out about 5.1 TB of data.
For the stuff we run at work, some of our thermal models now uses 1920 CPU cores and it runs for 20 days straight (or thereabouts) and this is why, we even have HPC nodes with 4 TB of RAM now, in our HPC cluster, because prior to that, after the runs were done, we couldn’t post-process the CFD results, because when we tried to load the results in, the node would crash. (Even with 4 TB of RAM, we’ve gotten the 4 TB RAM nodes to crash as well.)
You wanna see people beat the crap out of computers? Talk to an engineering team.
One of the guys that works at SpaceX that I know from college, one of his Monte Carlo sims will eat up 2% of the write endurance limit from a E1.S EDSFF SSD. And he has to run it over and over again, for whatever problem he’s working on solving. My dad wanted me to get into computers. I went into engineering instead. After graduating, I told him that as an engineer, we will never have “too much” computing power/resources. We’ll just keep making our simulation models more and more detailed and we’ll suck up any and all available compute power you give us.))
I suggest OP to go and pay for professional consultation.
This forum looks to me a hive of ‘plebs’. Your combative attitude and abusive verbals aren’t helping yourself either. Those spoken can’t satisfy or help truly. Those knowing better won’t speak to you. Others might just enjoy the schadenfreude.
This thread demonstrates a S&M relationship I haven’t seen for a long while on an online discourse.
Okay, lots of thrashing against the wall, and now putting swap into ram
Has the contents of swap been checked?
And seeing if the data is coming from particular programs?
OP mentioned that FEA runs filled the ram, but presumably others would too?
Mostly, has a larger ram drive / partition been used to see what size it takes when allowed to actually use swap space requested, instead of being constrained to the stupidly default of 8gb?
Of course the free ram means applications/system should not need to use the swap, as there is no memory pressure.
But the fact is, that swap Is being used, and it seems it often fill the current drive.
So how full, and of what.
Can even be tested with ramdisk swap drive by using a larger one, which is okay, because swap is used even when ram is half full, so can use almost half ram, for swap… (Okay, that was a slight joke)
Again, rule of thumb, was always 1x swap to ram, sometimes higher.
Obviously OP has a lot of ram, but again, if the tiny swap drive of 8gb is full, it is not protecting against out of memory, as system would have to basically stop and try and fit a city bus through a letterbox of available space…
And it seemed at the time, that VM’s were populating the swap partition.
If that is still the case, how dk the VM stats line up?
Are they also having most of their ram free?
Are they also using a lot of the swap that they have enabled?
I apologise if I missed you already reporting this; the back-and-forth with fish, was tedious to skim through
Obviously, default swap is no use. So that should be well out of the window.
If using memory for swap works, then is cheaper than throwing away SSD’s. But 8gb in use sounds inefficient.
Also, the FEA workload, does that run native, or in a container/VM? If a container/vm, have you played with the swap inside that?
I posted my question here, hoping that someone might have experience from their professional work environment, in using and/or dealing with large(r) memory systems.
I don’t expect people to have a system with 768 GB of RAM, in their homelab.
(At least not based on the homelabbing tech YouTubers.)
(I mean, I had four compute nodes with 128 GB of RAM each, 512 GB total, and a 100 Gbps Infiniband network, running in my basement, since around July 2017. I know that most people aren’t going to have 100 Gbps IB running in their homelab. I mean, it’s 2025, and there are still homelabbing tech YouTubers that are still talking only about 10 GbE networking, for their said homelab.)
But I took a chance that maybe, just maybe, people might have experience with large(r) memory system, via their work or something.
It was a gamble.
I think that you should spend more time, studying the attitude of the responses that results in said “combative attitude”.
People are always so quick to blame the ends, but never study what causes/results in the apparent “ends”.
It’s like the little kid who complains about the older sibling, reacting, when in reality, it was the younger sibling who was perpetually poking at the older sibling, that resulted in the older sibling finally reacting to said perpetual poking.
What you are talking about is when the younger sibling then tells the parents “Mom!!! Brother hit me!!!” (but leaves out the part where the younger sibling was perpetually poking the older brother, all along)
I’ll have to google how to do that. (I didn’t know that it can be done, so I’ll have to look this up on how to do that.)
Varies.
I mean, currently RAM utilisation is about 350 GiB, so presumably, I’ve got enough stuff running, that consuming 350 GiB of RAM, besides FEA.
Indirectly?
When I turn swap off, it dumps the contents from swap, back into RAM.
But the increase, when I ran the quick experiment, didn’t increase the RAM utilisation dramatically/significantly.
But again, I haven’t run this test for an extended duration.
100% agree, hence the basis behind my question.
(But @level1 I think, really helped to explain why I might still be seeing swapping, because if I exceed any NUMA node/zone, then that would help to explain why it was swapping despite free RAM being available, at the whole system level, but not accessible between NUMA zones.)
I’ll have to look into how to query what is the contents of swap.
It’s not a bad idea/joke.
I can certainly try it.
Agreed.
Will need to study/investigate.
Varies.
The report out back to the host can be unreliable, even with the qemu-guest-agent installed.
Windows will report the RAM as being 90% used, but if you check the Windows Task Manager, inside the VM, it might be like in the tens-of-percent.
And then for Linux guest VMs, Linux caches, so it will also, be reported back to the host, as >90% RAM usage, but again, the actual used amount can be quite significantly less than that.
I’ll have to check.
I’ll have my codestral:22b AI/LLM try and whip up a script that would be able to query the VMs for me, and spit out a report of some kind.
No worries, it’s all good. Thank you.
Agreed.
(Again, I used to play with trying to create a large RAM disk (440 GB) using Gluster v3.7, and then exporting the distributed striped volume over my 100 Gbps IB network via NFSoRDMA. It had some success (i.e. it worked, but not as well as I had hoped).)
The last time I was experimenting with that, it was in a LXC (because it was sharing a GPU).
But previously, I used to run it on bare metal.
It’s been a bit of a long(er) road, trying to get these apps containerised because Debian’s opensm doesn’t support virtualisation.
So, I haven’t quite figured out what I am going to do about this yet (i.e. pass the ConnectX-4 to a CentOS VM, and let it run opensm, being a RHEL-derived distro, so virtualisation would be supported, but I’ve already found out that virtio-fs ↔ NFSoRDMA doesn’t play nicely together. Alternatively, I can maybe just try to pass one of my dual ports over to said CentOS VM, and let that run opensm, but it would seem like it’s such a waste. The third option that I’ve been contemplating is to spin up another system, and have that be like the “dedicated” opensm host, but that idea is also kinda dumb because I’m going to spin up an entirely separate system just to do that? (It might be less dumb if I buy one of those like ITX or microATX boards from AliExpress that has a like a N100 processor soldered onboard, with a PCIe x16 phy/x16 elec slot, drop one of my six ConnectX-4s in that, and then let that be the dedicate opensm host.)
I dunno.
Haven’t figured it out yet.
Once I get IB SRIOV/virtualisation up and running, then I can continue with containerising my HPC apps.
But that project’s been suspended now, for quite some time, because I can’t decide on what to do about the fact that the Debian version of opensm doesn’t support virtualisation.
But that really, has nothing to do with my OP here.
Whole 'nother set of issue with that.
(edit: I didn’t see the link for how to check what’s using swap until after. Thank you for providing the link to me already. Cyberciti.biz is a great resource for me to learn how to do stuff, in Linux. I’ve used that site a few times and I appreciate you linking me.)
I didn’t know that you can probe/check the contents of swap, so as I mentioned, I’ll have to look up how to do that, cuz it’d be worth at least looking at said contents of swap.
The fill and size are probably related - because that’s the default when you “next” your way through the Proxmox installer (and the installer for the most common Linux distros).
I might test that when wife and kids have gone camping.
It is very dependent on the size of the problem that I am trying to solve.
I think that the last model that I was running only had something like maybe 563k elements, but FEA solvers, generally, are linear algebra matrix solvers, and as such, they usually/generally work best when they can run the entire solve process, from RAM.
(But of course, with that, the system then “freaks out” and think that I’m about to run out of memory, and so it starts swapping stuff out because swap doesn’t know (and probably doesn’t care) that the memory load is temporary.)
And then when it can’t swap out, it sees the solver processes as the high memory consumers, and kills them off, as “OOM” errors. Something like that. (I haven’t studied it in depth, I just know/see it happen via top, where the solver processes would go from running to pulling off a disappearing trick.)
How about swapping not to a partition but onto a file on a flash-friendly F2FS within your root FS ?
Make it with a compression option active.
F2FS should act as an intermediate layer with its COW and spread the writes in more optimal way and compresion should condense and break the destructive patterns that excessively wear&tear NAND FLASH.
NAND FLASH REALLY doesn’t like changing a bits in the sector - ti has to rewrite the sector entirely.
And when writing it into new place, it can only write 0s over 1s - so it has to write over empty sector.
And it can erase only block of sectors.
So it’s easy to see how even small writes can be destructive and cause a whole lot of re/writing, erasing etc etc if done in FLASH unfriendly manner.
Controllers do their best to alleviate the problem, but their resources and possibilities are limited.
So F2FS+file compression might just solve the problem.
On second thought, I’d skip ZSWAP etc because they are workaround graft/kludge that in the end don’t compress evicted pages on the disk-only in RAM.
Allegedly it also helps to partition the drive so that you leave some space free at the end (unpartitioned space). Controller should be able to recognize that and use that space as a spare for defective sectors replacement.
People recomment 7-20ish% as optimal.
I’ll have to research this further and test it out.
Thank you.
Gotcha.
I didn’t know that.
Thank you for sharing your expertise. I appreciate this.
Ah…okay…I see.
So it would, on the surface, look like an extra layer of overprovisioning for wear levelling, basically, above and beyond what the drive already has for said overprovisioning for wear levelling?
His recent remarks were talking a bit about the SSDs side of things, rather than the swap side of things.
Prior to that, he commented about how 768 GB of RAM would’ve costed thousands of dollars, but I pulled my invoices from eBay and showed that it only costed me $354.81 for the upgrade, going from 256 GB and adding the addition 512 GB (for said $354.81) to bring it to my current system total of 768 GB of RAM installed.
And then I ran through the arithmetic which also further showed the TARR calculations for needing to replace a SSD annually, where if I maintained his assumption of ~$100/TB (of SSD), that an 800 GB would be $80 (it’s not, as my citation from Newegg shows, but I ran with his numbers anyways), and showed the TARR result.
If I ran with the actual cost of what an 800 GB SSD costs at $278.88, then the TARR result would be shortened by 3.5x, so instead of taking 7 years to TARR out, it would be closer to 2 years, at that price, beyond which, it would be an extra cost, whereas for the additional RAM, it’s a buy-once-and-forget-about-it proposition.
The TARR analysis drives the decision making process, at least in part.