Devember 2021: Introducing UM, the Universal Modeler

ablebit · October 20, 2021, 6:15am

Hi there, I am not really an active forum participant, but I’ve been watching L1T for a few years now, and this year I decided to enter the Devember challenge with a project of mine that I’ve been prepping for a while now.

So what’s UM exactly?

Well… ummm… it’s my take on computing. It is inspired by the slew of weaknesses, limitations and design flaws that riddle software all the way from programming tools to end user software.

How is UM different?

It is not text based - compiled or interpreted, it more or less works directly on raw memory, with some basic safeguards
It must be usable through any generic input means – keyboard, mouse, touch, and further down the line – voice, for the lazy and those who suffer disabilities
No syntax means no syntax errors, logical errors are still possible, but must never result in crashes for all “design time constructs”, only graceful failures
It is designed with efficiency in mind, the design goal is to provide good user experience even on affordable phone and tablet devices
It must lower the bar of hardware requirements, so that even those who can’t afford powerful laptops or desktops can be productive with it, and I mean on the actual device rather than some front end to a web service
Its base ABI must be portable across every contemporary platform type and dependency free, with modest core requirements – 32 bit integer support and heap memory allocation, basically everything all the way down to MCUs
It is based on a few primitive built-ins, which can be used to create compound user types on the go – there is no compilation necessary
It must be almost entirely programmed within itself, there should be a very minimalistic pre-written core, which also means you can easily implement or extend language level features
It is not just for programming, as the “Universal” part of the name should imply, initially yes – but the plan is to extend it into a whole ecosystem for products and services that enable various benefits that computing can provide. You know… much like how there is quite a lot of benefit to being able to speak and write, outside of being a professional speaker or writer

I do realize all this sounds overly ambitious, especially for a single dude to implement, but I’ve been contemplating this, learning, practicing, researching and testing various components of it for nearly 10 years now. My primary language is C++, although I am fairly proficient with C and JS as well, I’ve also analyzed a number of other languages.

Here is the gist of my findings to disagree with – featuring two representatives of 3 language groups – the powerhouse old timers, the declining mid-rangers and the popular kids, 6 languages in 6 key metrics, focusing user friendliness in the top area, and hardware friendliness in the bottom:

Obviously, stuff like that is somewhat subjective, and there is no need to start a language war here, but I think everyone sufficiently competent can agree that:

C and C++ are performance kings, but that comes at a cost – they are quite tedious to work with, for those that know or can imagine better. C is rather simple as a language, but its usage idioms require a lot of boilerplate, since it lacks any high level features. You can write almost anything in C, but for the vast majority of tasks, C is actually a very poor choice. C++ has been trying to improve on that weakness, as a result it has gotten horrendously complex, and I don’t mean the basic stuff, I mean learning the language in its entirety
Java and C# IMO are the product of two different companies and the same motivation – to address the weaknesses of C++, and they did manage to provide a balanced solution with fairly proportional trade-offs, and for a long time they were very popular, but recently have been losing to even easier and more user-friendly languages
JS and Python are two of the most popular languages today, something they owe entirely to their ease of use, that unfortunately came at the cost of terrible performance and efficiency, to the extent they are not at all applicable for performance critical stuff or resource constrained platforms.

Also note that there are also many other important aspects to programming, those are just the basic fundamentals. And yes, there are many more types of languages, and newer ones that are quite promising, Rust being a honorable mention that does solve some of the C/C++ problems, but IMO not enough.

My design goal for UM is to basically fill the hexagon all the way. My research didn’t identify any technical reasons for the trade-offs between different metrics, but rather unrelated design choices, and I consider it practically feasible to design a tool that checks all those marks, and then some.

Obviously, it is totally unrealistic to expect to UM in its entirety in a little over a month, so here is my list of functionality that I think is doable:

Implement the Default Data Engine – this is basically a glorified container of data resources with unique and persistent across sessions IDs. It is a custom design, optimized for both memory efficiency and look-up performance, plus some integrity stuff. It can contain up to 2^32 number of discrete data resources, each of which can address up to 16 gigabytes of memory, for a total of up to 64 exabytes of theoretical addressing space
Implement the Abstract Object Model – the set of logic that defines and structures data resources into traversable object trees
Write extensive tests for DDE and AOM – for both performance and stability
Implement a basic GUI for user interaction, until the project reaches a state it can produce its own and more feature-rich GUI
Implement a minimal set of built-in primitives
Implement the static core language types
If there is still time, implement the Pooled Data Engine – an abstraction that works on top of the DDE to further improve efficiency for short data resources

Note that there won’t be any code initially, not until internals are thoroughly tested and finalized. The source will become open when the API is complete and stable, and the project is ready for contributions. Due to the nature of the project tho, I do not expect a lot of contributions to happen to its written code base, but rather internally, through the tool itself. The initial C/C++ code base will eventually be re-implemented in UM as well.

I will provide progress updates, test results, case studies and even binaries to run tests and benchmarks. I will need help testing things at scale, because my system has only 24 gigs of ram.

I will try to provide an alpha release before the end of November and a functionally usable beta version before the year ends.

Now back to work, I will come back when there is progress to report.

ablebit · October 21, 2021, 6:29pm

220 lines of code later, I have a reference implementation of the DDE and some of its tests. It is naive, totally unoptimized and not overly efficient, and that’s actually intentional. The idea is to use that initially, so later on I can test it against the real deal.

It is based on std::vector, and as such has two main drawbacks:

for short vectors, the idiomatic “3 pointer” solution + the overhead of heap allocation constitute significant overhead, even if not present in the initial test loops, because windows apparently preallocates some memory for each new running process, and it does take some allocations to go through
for long vectors, the automatic resizing policy tends to allocate a lot of excess memory, so while the pointer bookkeeping overhead is mitigated as the vector grows, that excess memory actually degrades efficiency even further

Here is a chart of the access latency:

The x axis is effective footprint, the y axis is access latency in nanoseconds. It is fairly flat until 8 mb, which is how much L3 cache my computer has, which according to aida 64 has around 10 ns of latency. The measured access latency is below 20 ns, there is some overhead due to cache pollution and the DDE access guards, it is honestly not that bad.

As the footprint increases, access latency plateaus to about 65 ns, which is exactly as much overhead over my ram latency as for the L3 cache - the penalty is constant and amortized, going into full cache miss mode.

Memory efficiency is quite low, this test peaks at 268435458 data items, split around 27718037 data resources, averaging around 10 items per resource, and a measured storage efficiency of 58%, and a quite significant overhead of 42%.

Not great, not terrible, perfectly usable.

Here is a windows executable on the odd chance someone is keen on running random exes from the internet. This test doesn’t push very far, only about 2.5 GB in total, but it may be of value to see if everything is as platform portable as it should - the test will fail if integrity checks do not pass. Also, it will be nice to see how it works on a more modern system, I am running this on a friggin 3770k. The test shouldn’t take more than a minute, unless you are running something even slower.

umtest.zip (301.2 KB)

I’d appreciate it if those that decide to run it post at lease the last few lines.

Next up I will investigate for low hanging fruit optimization-wise, before moving ahead with the next goal.

ablebit · October 22, 2021, 5:19am

I messed up that last one a bit.

I forgot that I changed the formula for eff to actually show the overhead, but not the label.

So the overhead in the end of the test is not 42% but 58%, making it slightly worse.

What mislead me was that I didn’t really expect out of the box vector solution to waste more memory than it uses. We do look up to C++ for efficiency after all…

ablebit · October 23, 2021, 4:27am

Witnessing the overhead of the reference implementation, I couldn’t resist pitting it against a custom proof of concept solution I wrote a couple of months back, which wasn’t thoroughly tested until now. Seems like a good opportunity to check the integrity of the latter vs the reference implementation, since after all, std::vector is a thoroughly tested industry standard. So I modified the APIs of both implementation so I can run them through the same test suite.

Since the custom solution proved to be reliable by producing identical data checksum results as the reference, I decided to do a case study on the difference in the performance characteristics.

First thing first - memory usage efficiency, which is by far my biggest priority:

I also expanded the test to push a bit further, up to 2^31 number of items, with a logical size of 8GB.

Now that’s a hefty improvement right there, the reference implementation used 15.4 GB of data at the end of the test, whereas the custom used only about 11 GB. That’s a good 40% memory saved.

The linear chart also indicates that even thou the reference implementation efficiency improves with scale, it ultimately still retains a shoot up trend, making it an even worse candidate in the long run.

There is also a chart of the storage efficiency of both implementations over time:

Reinforcing the notion that the reference implementation only diverges more over time. As for the reason why, that will be explained in the summary.

But what about performance? What is the cost of this improvement? Well, here it is:

The test results (that’s nanoseconds on the y axis) indicate that both implementations start off with identical random access performance, and early on, std::vector shows a minor advantage. The tables are turned pretty fast, and the custom solution overtakes it and retains an advantage throughout the test.

The full sweep test shows both implementation trading blows, without any distinctive winner. Full sweep is generally not something that will be utilized very often, except for when doing whole state dumps / snapshots.

Here is another chart to combine the differences in random access and memory efficiency:

The custom solution maintains its memory usage advantage thorough the test, for the design intents and purposes, I’d say it is 40% more efficient overall, with an anomaly that causes it to reach 60-70% early on.

As for access time, this chart does a better job at illustrating the difference between the two implementations, putting the short lived advantage for std::vector up to 10%, followed by another anomaly that is massively in favor of the custom solution, after which the latter maintains a performance advantage of 5-10% for the duration of the test.

All in all, despite the increased complexity, the custom solution doesn’t suffer any performance degradation that matters.

For the sake of full disclosure, the entire test takes about 50% longer to complete for the custom solution, so the complexity increase price is ultimately paid, but it is a non-issue, because in the intended typical scenario, the items will be populated mostly by human interaction, so the cost will be completely amortized and hidden.

So, what are the reasons for the advantage of the custom solution? It has to do with the implementation strategy, and that it is optimized for a very specific purpose, whereas std::vector is a general purpose run of the mill container. And the Jack of all traits is rarely master of any.

The idiomatic vector is usually a “3 pointer implementation”, that is one pointer for where the data begins, one pointer for where the data ends, and one pointer which signifies how much of that area is occupied. That’s 3 pointers, or a total of 24 bytes on a 64 bit system.

It must allocate memory on the heap in order to actually store data, meaning that all access incurs another pointer indirection. And heap allocation is not very efficient for small chunks of data. Keep in mind that heap allocation must return memory that is suitable for all primitive types, that’s usually 16 bytes + a header with meta data about the allocation.

The custom solution is different. It features an optimization for short bits of data - 1 or 2 elements, of which I expect to have plenty. The “body” of the resource features 2 32bit counters for size and capacity, and another 64 bit of payload space. As long as the container has only 1 or 2 elements, it will use that space to store the data locally, and it will only overflow into the heap if the size is larger, using the payload as a pointer to the data instead. The “head” of the custom container is only 16 bytes and it can even store 2 elements, whereas std::vector is 24 bytes without the ability to store any local data. That’s 50% of memory usage reduction at the very least, even more for short data.

In light of this, the test results are beginning to make sense.

The short advantage that the reference solution has in the beginning has to do with the very small data footprint - it fits entirely into cpu cache, so at this stage, the extra complexity of the custom solution causes it to slip behind a bit.

Also, because the custom solution is far more compact in its footprint, it takes another doubling of the data set before it spills out of the CPU cache, so it enjoys L3 cache level of performance longer. The reference solutions spills out of the 8 megs of cache at 2^19 items, whereas the custom does so at 2^20.

In the beginning of the test, the ratio of short data that the custom solution can store, is very high, here are some more charts to illustrate that:

The top chart shows how the composition of data changes during the test, the x axis is the number of elements in each data resource, the y axis is the number of resources of each particular element count. Each color indicates a subsequent level up.

It is clear that over time, the ratio of short to long resources plummets. This is illustrated by the bottom left chart - we have a 97% probability of direct hit, that quickly plummets as the data set is populated further.

The bottom right chart shows the ratio of probability for direct and indirect access over time, I seem to have deleted the x axis accidentally, but the two lines intersect at 2^21 number of elements.

Now back to the two anomalies in the performance difference chart. The memory advantage is due to the fact that the custom solution can store a sizable chunk of the data locally, so there are much less heap allocations. As the test progresses and the direct to indirect access ratio shifts, that advantage is gradually diminished.

The some 40% constant memory advantage for the custom solution is actually mostly due to the default size increase policy of std::vector. It does preallocate an increasing amount of elements depending on its current size. The custom solution only preallocates explicitly, on single item inserts, it resizes to the size + 1. Note that the OS memory manager may not necessarily reallocate, if the current chunk happens to have spare capacity, which it almost always does, it is just typically more modest and in a much narrower range than the resizing policy of std:vector.

Which is also confirmed by the fact that the custom solution test takes 50% more time, which again is not an issue, but it is proof that there is a considerable higher amount of reallocations.

I haven’t tried it out, but simple logic dictates you could easily get most of that advantage with std::vector if you simply reserve size + 1 on every insert, or find some middle ground by tweaking the resizing behavior policy. Of course, with the vector there will always be an additional level of indirection, not to mention the 50% larger head, so it may approach, but never match the efficiency of the custom solution.

The reasons for the access time anomaly also become clear - 2^18 to 2^22 is the sweet spot, where the data composition is ideal to maximize the design advantage of the custom solution - data is far more localized and compact, which is great for CPU cache friendliness, and goes to show a bit of data orientation design can go a long way for performance critical scenarios. Unsurprisingly, the point where this advantage becomes diminished is 2^21, exactly the point in which the probabilities for direct vs indirect hit intersect.

The lingering 5-10% access time advantage of the custom solution is due to the 50% footprint reduction of the head, and the presence of locally stored data in the set. The small penalty of increased complexity is completely mitigated and even reversed into a small gain.

Note that this test is a bit unfair to the custom solution, it does see many of the short resources being expanded into heap, which will not be the case of the actual object model it is going to service, in which a considerable amount of resources will be 1 or 2 items long in perpetuity, hence the respective optimization.

Also, in the actual production code, only user created objects will result in resizing inserts, so the inserting overhead will be complete hidden by the many orders of magnitude slower user interaction, that’s why the custom solution doesn’t automatically over-provision anything beyond what the OS memory allocator happens to provide implicitly. The vast instances of data population that the design anticipates will be able to leverage explicit preallocation.

Just to explicitly clarify my stance - std::vector doesn’t “suck” in any capacity of the term, again - it is simply a general purpose solution, and as such it is inevitably inferior to any purpose specific solution when used for the particular task.

I think I will keep both solutions inside the project for now, just in case, but the way to go is definitely a custom solution, as std::vector struggles to break 50% efficiency and doesn’t really bring anything needed to the table. As mentioned in the OP, I also plan on creating pooled storage later on, so eventually storage efficiency will improve far more than even the 75% that the custom solution currently offers.

Now, moving onto the data model.

ablebit · November 8, 2021, 5:29am

It’s been a while since my last update, I was a bit busy having fun at the pool.

Nah, JK, I just had my third bout with c19, so I was out of commission for a few days.

Meanwhile I did actually write the aforementioned memory pool, and it is great:

This chart shows the raw memory efficiency of the C++ std::vector solution in blue, the custom implementation in red, and the new pool implementation in green. I also changed the data composition scheme a bit so the numbers are a little different from the previous test runs.

What makes immediate impression is that the pool solution is rather inefficient up to about 256 mb of data. This is the case because it allocates memory in chunks of 64 mb, and initially it has a lot of idle capacity that is not utilized relative to the actual data set. Naturally, this is user configurable, and the pool does well even with far more modest preallocation. 64 mb seems like a balanced value for a typical workhorse system. I did also run tests with far greater chunk sizes, and it made very little difference - memory copy operations are quite fast nowadays, so there isn’t that much value in spreading those out, but it is still an option.

But then it shoots up by a decent margin, convincingly beating the custom data implementation and almost doubling the storage efficiency of the standard vector.

At the end of the test, the standard vector implementation uses almost 12 gigs of ram, while the pool uses a mere 6.5 gigs. 5.5 gigs of ram saved… why not?

But what about performance? The custom solution did take some hits in the actual book keeping in order to save memory, so what does this further improvement cost?

With the new data composition settings, the standard vector actually does a bit to redeem itself, now that there aren’t that many data resources sized 1 and 2 elements for the custom solution to store in-place the vector manages to pull ahead of the latter, thanks to the simpler driving logic.

The pool implementation however doesn’t sacrifice any performance for efficiency. The improved data locality allows it to remain faster than any of the two previous implementations through the duration of the test.

What’s more, last time the custom solution took most of its toll not in terms of access performance, but in terms of data composition, as unlike the standard vector, it didn’t do any preallocation of data whatsoever, so despite being a tad faster to access data, it ended up with significantly longer total run time for the test.

The pool solution doesn’t use any preallocation on individual data resources basis, so even tho it does preallocate memory on the pool level, every data insert causes each and every target data chunk to be reallocated within the pool, so there is quite a lot of booking and moving of memory around. Here is a test run time comparison:

pool4

Even so, the pool is still faster than the standard vector, hinting at the severe cost of heap memory allocation - book keeping logic and memory moving on every insert is significantly faster than simpler logic plus much fewer heap memory reallocations, as indicated by the total runtime. The custom solution is still the slowest one in this metric, but the pool doesn’t have to make any sacrifices in order to attain even greater memory efficiency.

The pool is better in my key target metrics, but it doesn’t happen by miracle - it does have a couple of limitations:

currently it maxes out at 16 gigs of data total for 32 bit value storage - at least in theory, provided the system is able to provide a continuous memory fragment of that size, which is the physical limit. This is not that bad, considering its intended use, and of course, it is possible to use an arbitrary number of pools There are strategies to address that limitation, but I don’t think it will be necessary for the time being, still it remains an open option
certain excessive usage patterns can lead to high fragmentation. I can also live with that, because those usage patterns are not something the intended use is likely to encounter, mainly because populating that much data by user interaction is something that will take many, many years of use. The pool does automatically defragment each time it is saved, and comes back pristine when loaded, so it takes nothing short of a deliberate high frequency gross abuse in a single session to run into fragmentation issues. In the current test scenario, even tho every fragment is reallocated on every insert, the implementation does a good job at tracking and recycling those fragments. This process can be refined further down the line, when there is enough actual user data to analyze

Last but not least, unlike the previous custom solution, which was tailored specifically for 32 bit values, the pool is a proper polymorphic template, so aside from integers, it can be used for any sort of variable size collections of user classes with proper move semantics implemented, or even as a generic substitute for heap allocation, where it does significantly better than standard malloc / new, so it comes in quite handy and allows to keep off dependencies on 3rd party “above standard” memory allocators , as long as one can live within the bounds of 32 bit item addressing per pool. IMO a limit of ~4.3 billion entries per pool is plenty, and there really isn’t any need to bloat item ids to 64 bit, as that will double the book keeping overhead. Additionally, the pool has an auxiliary component, which can directly be used for pools of fixed size elements.

Meanwhile, before I got sick I also managed to complete the base of the object model, although there isn’t much to show for it right now. My next update will be when I have some GUI to actually inspect and modify the object tree.

ablebit · June 12, 2022, 6:49am

Wow, 7 months… it’s been that long. I got sidetracked into writing benchmarks and case studies, and then some other stuff, I couldn’t actually present something devember worthy. Who knows, maybe this year will cut it. I’ve been dealing with other stuff recently, but the last week I was able to resume work, and there’s lots of changes and actual progress.

First of all, it is now “Umium”, the etymology of which I will not go into detail now. I just figured having it sound like “ummm” doesn’t sound serious enough, although I will continue regarding “um” in the capacity of a short hand version. I also got this experimental logo:

logosmall

Now onto the technical changes.

I’ve revamped the data core a bit, it is no longer fixed grain. Meaning individual memory allocations are no longer limited to 16 gb, although they are still limited to 2^32 logical objects, the overall size depends on the object size. Due to its design, Umium can actually address up to the full 64 bit address space in a single allocation, even tho the hardware limit is 48 bits for all addressable memory. It is future-ready!

The object model and runtime perform great, during stress testing, I am able to allocate objects at a rate of 10+ of millions per second in a single thread. The engine is still single threaded, but it has been designed so that it can be upgraded seamlessly when the time is right to add the complexity of concurrent access.

There’s also finally some gui - the so-called root matrix. It is an early-stage programming and debugging tool, it is basically a matrix of every defined type and each of its instances. Pushing notifications to the gui does slow things down a bit, but it is still able to allocate around 1 million items per second, not too shabby. The actual gui responsiveness is good too.

Casually scrolling through about 10 gigabytes worth of data in 1 billion items. We can introspect every item, with its individual fields and their raw byte values. The values are arbitrary because the objects are only allocated, they are not initialized to anything. So it shows the previous content left over from when the memory was used.

What’s curious about Umium is that it doesn’t even come with a set of primitives. Yes, it obviously does have stuff under the hood, but when it comes to the first run, it presents but a clean slate. When the root matrix is first presented, it is complete empty, an implementation requirement mandated by the design requirement that Umium is entirely (for all intents and purposes) programmed within itself. We are yet to create our most fundamental data types and instructions, which the runtime will eventually be shooting blind into the logic core where the magic happens.

Before we get to the automatic programming, we need to manually program the foundation. The root matrix is kinda like “kernel level”, it is not really intended to be used in the future, it is just there to provide the higher user abstractions. One would only want to work on that level with the intent to extend or provide new “language” features, or even an entirely new language for that matter. To put it even shorter - it is the language design level as opposed to the “product” design level. As a result, it is also pretty much barebone, there are no label identifiers or any of that stuff.

But before I proceed to lay the foundations of the actual language, I need to complete the raw data view – another auxiliary view, showing raw memory allocations in real time, needed for debugging purposes.

Next go the primitives, and at long last – the “actual” / default workspace view – the logical tree view. Once those are implemented, I expect Umium to be at a stage where it is self-definition ready, functional and stable enough to begin building up on the higher-level functionality.

Also, now that the internal core design is finalized, and things are finally going at a decent pace, and most importantly – I actually have some time to dedicate, expect progress at a faster pace, and more frequent updates.