[Devember2020] Brawl AI - cockfights but for AI. A way to capture better AI for computer games?

Player1 · December 23, 2020, 11:33pm

TL;DR: A minimalistic Java Spring service requires almost 500MB of memory to even start up. Equivalent Golang program with the same functionality happily runs in less than 50MB of memory.

The longer story:
I ran into an interesting out of memory problem. The kubernetes cluster has two nodes with 2GB of memory each, which seems ample for a few small services. So, 4GB in total across the cluster.

Originally two of the services written with Java Spring ran in redundancy for better uptime and accessibility in case one of the cluster nodes goes down. They were even provided with a lower memory request value and a higher memory limit for any sudden peak traffic. I had hoped that not all services would eventually climb up to the upper memory limit value at the same time, which resulted in over provisioning both nodes.

Since the beginning (in November) I had noticed that some of the requests for the API kept failing occasionally and the website felt buggy. It was working “well enough” for me to not to bother with it immediately and it could wait until I get some other tasks done, but then at some point the API failed completely. One of the nodes had totally ran out of memory and the other node did not have enough spare memory to launch any more pods.

It turns out that the optimised Java Spring docker builds (using their builder) require a minimum of about 410MB of memory to even begin starting. To the Spring team’s (Pivotal’s) credit, it does advertise the minimum amount of memory that it requires. It isn’t 100% accurate as assigning just the exact minimum prevented my services from starting. There services were allocated 512MB of memory as the upper limit. I had also tested some time ago that they require about a minimum of 270MB at run time. The old numbers were foolishly taken when the service is ran outside the docker container and obviously some memory was probably shared with the kernel etc. Anyway, the numbers were wrong by a long shot!

There are 5 spring services and they consume a combined 2.5GB of the total 4GB of available memory to even start up. In practise there is only just over 3GB of memory available for pods after the kubernetes services and the OS (the kernel and system programs) have taken their share.

At this point there are also a number of Golang programs in the cluster as well, including the database backup service and GeoIP database service. They each run in less than 50MB of memory. The difference here could be that the Golang programs are based on an empty image (scratch) and contain only the single executable file, whereas the Java ones need a JVM image as the base which typically also includes an OS such as debian.

Linode also sent an email a few days ago kindly telling me that I’ve blown over 75% of the promotional credit. I’m currently consuming $40 per month and I have been looking to downscale that number significantly. But it turns out that down scaling the nodes is not possible with the Java Spring services. The plan was to eventually replace the Java Spring services with Golang services, but this rewrite task might have just skipped in the priority queue.

Now the new task list seems like:

Finish the CI/CD (I’m currently integrating Drone CI!)
Rewrite Java Spring programs with Golang.
Downscale, including the provisioned space for the database. I initially allocated a 100GB disk in case there would be a huge influx of projects, but as it stands the database is storing less than 1MB after over a month. I’m probably reducing it to a 10GB disk which means a manual data copy. Disk allocations can only be increased in Linode.
Then resume for everything else on the original TODO list!

A little sad note / cost evaluation:
While kubernetes is fun and makes a lot of things very simple and straight forward, such as adding services and complexity to your existing project. And the rollouts of new services! Keeping a service accessible has never been easier… a single monolith service for a small website like this would only cost $5 a month vs $40 for the kubernetes cluster. Is the fun worth the money?

Current cost: $10 for the mandatory node balancer (Linode’s name for load balancer), 2x $10 for the smallest kubernetes node (2GB of RAM each) and 100GB persistent disk space at $10 a month. While I haven’t had to pay for this yet, thanks to the promotion, this will start to come out next month.

Player1 · January 6, 2021, 1:42pm

Reviewing resource usage in the efforts of scaling down:
Based on the previous post, which of the following do you think are Java Spring services and which are Golang?:

$ kubectl top pods
NAME                                  CPU(cores)   MEMORY(bytes)   
gateway-deployment-7d99cff64d-kkrm7   1m           178Mi           
geoip-deployment-7868c7f876-d4k8x     0m           6Mi             
login-deployment-5d6794cb88-xzszh     1m           204Mi           
nethook-slave-pod                     1m           6Mi             
open-deployment-c6b94d94-qrc8j        1m           196Mi           
postgres                              1m           83Mi            
referee-deployment-57c8d999dc-4gpkd   1m           210Mi           
secure-deployment-5f87ffcd85-qmpjj    1m           201Mi

The JVM does scale back down to below 270MB, which I had previously calculated. But it takes over 400MB to boot up the service (the oversight that got me into trouble previously). Running many JVMs require a lot of memory. It comes with almost 200MB tax and requires more than 400MB to boot. Sure, in a large multi-gigabyte applications it makes almost no difference, but when running small, a single Java Spring Boot application takes as much memory as about 30 Golang services.

Here are the nodes:

$ kubectl top nodes
NAME                          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
lke-xxxx                      280m         28%    1531Mi          80%       
lke-yyyy                      262m         26%    1451Mi          76%

When comparing the nodes and the pods, the pods in a default namespace (listed above) claim a total of 1084Mi of memory and other pods in other namespaces (such as kube-system) claim an additional 348Mi. The total memory used by the pods over the two nodes is about half of the total memory that is being used. 1427Mi vs 2982Mi.

The OS and the automatic kubernetes installation in Linode take a quite a big chunk off the available memory. There is also not enough room in the two 2GB nodes to run two copies of the services due to the JVM’s boot up memory requirements. If no two pods would start at the same time and would wait long enough for a garbage collector to reclaim the excess memory from the pod that just launched, before starting up another one, then there would be enough room for duplicates. This is the kind of fine tuning that can obviously be avoided by running nodes with more memory, or adding another node to the cluster (recommended for money making production clusters).

The point of running two nodes was to provide redundancy by running two copies of the key services in case one node would crash and have to reboot.

The downscale plan:
Rewriting the 5 Java Spring Boot services in Golang would allow all the services to run in less than 100MB of memory, combined, before serving requests and having any load. The largest single request that I could think of is a single upload from the secondary on-prem cluster which can be a maximum of about 20MB, but the uploads are always sequential and never parallel. Therefore, it’s not a big deal. A typical request would only be a few kilobytes. Having some hundreds of megabytes spare for small spikes should be plenty. I might need to do a load test on the new system once it is up and running!

What about running a single node kubernetes “cluster”, like Minikube or equivalent on a 1GB server, where each microservice takes about 6MB of memory after the Golang rewrite? That wouldn’t be any less redundant than the current approach and will still have a single point of failure. Not great for a reliable production server, but would be very affordable (see the end of the previous post) for a hobby project like this. A fully automated way to restart the production server and to restore all the data would minimise downtime when the production server goes down. So, this is the new plan. It will also allow to keep using kubernetes without having to move to a monolith! Horizontal scaling should be much easier if the server gets a traffic spike.

Revised task list:
The Drone-CI has been dropped due to it being proprietary and running into limitations with the credentials management (external stores are only supported on their enterprise plan). Also, you must compile and build your own drone containers if you do not wish to use their enterprise version. So, moving to an actually open-source CI system: Concourse-CI.

Re-write the memory hungry Java Spring Boot apps in Golang.
Terraform to automatically start a single node kubernetes cluster and automatically download the database data from a backup. (The production server’s database is already backed-up/streamed in almost real-time with less than 1 second delay.).
If the linode goes down, automatically apply the above step and update the DNS records.
Automate all of this in Concourse-CI (the restarting of production server may not end up in concourse, but we will see).
Concourse doesn’t support running workers on ARM architecture. The ARM support would be nice for testing my on-prem cluster, which means that I may need to add this feature.
Then resume for everything else on the original TODO list!

Player1 · January 19, 2021, 6:12pm

2nd Place!?! Come on! Yeaaah!

Ok, now that the screaming is over and my throat hurts. There are some very well deserved feedback on the announcement video:

First of all, apologies to @Wendell for the use of the term cockfights. It is a terrible name and a horrible sport. It is kind of what it felt like when I was developing this and when I considered what A.I. might feel like if A.I. had feelings. Ever since I put the website up, I’ve had an apology for the A.I. in the comments of the front page’s HTML. Probably doesn’t mean much when it comes to it. (Also noted: will never use that term again to describe this project.) There are definitely no roosters involved at all, the game is in XCOM style. Here is an example match (it ends with the red team giving up after running out of grenades):
https://brawl.ai/results?ap=Shooter-Improved-v6&au=Player1&dp=Grenadiers-v2&du=Player1
This forum thread is just a wall of text without pictures or videos. To fix that, I’ve created a short clip of the following fight: https://brawl.ai/results?ap=Test25&au=asbjorn&dp=Shooter-Improved-v7&du=Player1
You can play back the exact same brawl in your browser in 3D, both desktop and mobile should work. In my opinion games should really allow us to replay all matches (including let’s plays) in full 3D capabilities. The short clip:

Ok, how do you get to the point of actually participating. I’ve created another longer clip. Here is the process of writing a bot and getting feedback / debug output:

I’m planning on writing an example code/program for anyone to use who prefers to start from a complete example. The example code will likely not rank well and will need improvements to climb the leaderboards, but it will be enough to get there.

Note: I had to make a new youtube account to post the videos (they were too large for the forums). It was literally created today. Does this now make me a youtuber? Also, the about page now mentions the 2nd place Devember 2020 award: https://brawl.ai/about

Player1 · January 30, 2021, 10:51pm

A small update: CI/CD, downscaling and moving from Java to Golang are done!

Ok, kubernetes metrics-server is lying in this set up. The micro services are not taking 0Mi of memory and postgres is not 10Mi. I decided to add a lot of swap!

$ kubectl top pods
NAME                                  CPU(cores)   MEMORY(bytes)   
external-dns-7b5d874f95-9687k         1m           13Mi            
geoip-deployment-7d8cbbf65d-k6lq8     0m           0Mi             
login-deployment-55f568fb77-x28h7     1m           0Mi             
open-deployment-5586f87579-c4tmq      0m           0Mi             
postgres                              1m           10Mi            
referee-deployment-659f5d9b56-xwzk7   1m           3Mi             
secure-deployment-5875c6969f-7tvsw    0m           1Mi

And there isn’t really this much spare memory either:

$ kubectl top nodes
NAME           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
api.brawl.ai   281m         28%    715Mi           80%

This one tells a truer story:
resource-use

The official minimum amount of memory recommended for Kubernetes is 2GB per node. Then naturally I decided to squeeze the worker and master nodes into a shared 1GB of memory.

Without having anything else running on the system, the k0s master and worker nodes can fit into 1GB of memory. Just. But then once you add anything to it, including a load balancer and other services, it immediately goes over. That’s where the swap comes into play.

To help illustrate what I ended up installing with the k0s: I made a little system diagram. It shows all the important parts of the system.

The metrics server is lying because I’m using quite a bit of swap. I figure that many of the pods don’t really need to access all of their allocated memory all the time such as cert-bot and external-dns. They are there to react to events, but large parts of them are just swapped out of the main memory. I’ve done some very primitive manual testing and the app feels as responsive as before. ReactJS attempts to hide a lot of the lag.

The Golang processes take between 6-12MB of memory and can quickly be brought back from the swap when requested. A lot of the Postgres is on the swap too.

So, it almost fits in the 1GB of memory, but not quite… but this is now $5/month vs the previous setup of $40/month. However, and a BIG however, this is pain in the rear end to set up. The Linode Kubernetes setup was just a couple of clicks, but this was about 1 week worth of figuring all the configs out after work. And k0s is much quicker to set up than kubeadm!

All in all, I’m rather happy with the setup and how kubernetes with all the neat tools that it brings can be used in very inexpensive setups (with some manual work!). This isn’t very resilient, but I can bring the whole (single-node) cluster up with almost just a single script. I currently have to do it in two steps as the first step also runs updates and reboots, and I haven’t yet included the reboot “polling” into the script. A test environment is quick to set up as I can just bring more clusters for testing with (almost) a single script. I’m writing more comprehensive integration tests where this is done automatically.

For anyone interested, here are some of the top swap users (and their swap use). Note: the “starter” programs are the Golang services. I should have really picked better names for the binary files. They are addressed in the manifests by their Docker image names, so it doesn’t matter for that, but for printouts like these it is difficult to tell.

$for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | sort -k 2 -n -r | less
kube-apiserver 322344 kB
etcd 38036 kB
cainjector 24152 kB
k0s 22248 kB
konnectivity-se 10408 kB
kube-controller 9940 kB
containerd 9216 kB
controller 7648 kB
k0s-v0.10.0-bet 7492 kB
systemd-journal 6944 kB
traefik 6476 kB
calico-node 6256 kB
calico-node 4728 kB
starter 3812 kB
kube-scheduler 3748 kB
metrics-server 3592 kB
external-dns 3520 kB
containerd-shim 3368 kB
kube-proxy 3344 kB
haveged 3156 kB
controller 3124 kB
postgres 3116 kB
containerd-shim 3040 kB
kubelet 3012 kB
calico-node 2988 kB
kube-controller 2812 kB
webhook 2796 kB
coredns 2668 kB
postgres 2640 kB
starter 2404 kB
postgres 2396 kB
postgres 2144 kB
rsyslogd 2004 kB
speaker 2000 kB
postgres 1884 kB
containerd-shim 1856 kB
starter 1852 kB
postgres 1800 kB
postgres 1768 kB
postgres 1708 kB
postgres 1708 kB
containerd-shim 1648 kB
postgres 1640 kB
containerd-shim 1592 kB
postgres 1540 kB
containerd-shim 1540 kB
containerd-shim 1480 kB
containerd-shim 1444 kB
containerd-shim 1432 kB
containerd-shim 1428 kB
containerd-shim 1420 kB
containerd-shim 1392 kB
containerd-shim 1388 kB
containerd-shim 1360 kB
containerd-shim 1348 kB
containerd-shim 1328 kB
containerd-shim 1324 kB
containerd-shim 1320 kB
systemd 1272 kB
containerd-shim 1256 kB
proxy-agent 1168 kB
...

Player1 · January 31, 2021, 7:58pm

It is now literally the end of January and Devember 2020 is over. Is there a requirement to stop posting here?

This is a project that I will to continue to develop even further. Is it OK with everyone if I continue to post any meaningful updates here? Or must I find a new community for this project from somewhere else? I would like to continue to post meaningful updates here as this is the place where this project first saw the light of day and Devember 2020 kickstarted the whole thing.

If I don’t hear any objections, then I assume everyone is OK with more posts going forward.

My TODO list currently looks like this (not necessarily in order):

I want to address some of the usability feedback that I’ve received and make the interface friendlier and more intuitive.
The above might include an ability to choose an example working bot from a template to get started. I believe it would enable more people to get quicker onto the leaderboards.
Then add the email password recovery option (still missing). I hate email spam, so the plan is to only use it for password recoveries initially. Maybe there could be an opt-in feature to be notified if your top ranking bot has lost its rank to someone else, or some other similar features. This will be totally opt-in only.
An .onion service! I’d really like to include a service that provides TOR access. It might be a bit of a squeeze for the already overflowing RAM situation. (Maybe just add more swap!?!)
I want to write more bots too for the leaderboards!
Graphics need a serious upgrade! Should the units look like humans or robots? Or something else? Currently they are just cylinders. I haven’t made up my mind about these yet.
Then start thinking about what could be included in the next “season” (a content patch!) i.e. maybe more items, weapons, unit types. I called them seasons as rebalancing and adding more options will likely break existing bots or at least the leaderboards will no longer be accurate. I figured the leaderboards will need to be reset, but those who got to the top on previous “seasons” should not be forgotten and I’ll need to make some badges such as: “Ranked 1st on Season 1”. I’d also like to keep the ability to replay old matches from previous seasons.

TheCakeIsNaOH · February 1, 2021, 3:56pm

Absolutely, that would be great.

You can keep posting to this thread, or a new thread would be fine as well.

Player1 · February 2, 2021, 12:18am

Splendid!

Player1 · April 6, 2021, 10:24pm

A small update!

There are now two working bots (examples) to choose from when you start a new project.

I decided to write a couple of examples to help anyone get started if they were feeling stuck. One is more complex than the other. They aim to be simple and easy to expand on. After talking to a couple of people about writing the bots, I found out that if you’ve never actually written a bot, then it can be quite daunting to get started. Hopefully this lowers the barrier!

If there are any ways I could improve the example bots by making them even simpler to understand, please let me know. Or if there is even a simpler bot that can be made?

On an even smaller side note: I’ve updated all the packages, libraries and software to the latest version. It seems k0s is being developed quite fast and things tend to change quite drastically. It is also getting simpler to set up! And it ran for 63 days without crashing or needing a reboot on just 1GB of memory! (I only rebooted it during the upgrade.)

I’m looking into next:

Improving the warnings / error messages in the UI
Cleaning docker related messages that sometimes appear in the bot’s output log after an update / cluster-node reboot.
Adding a read-only mode to the UI when browsing the source-code of an old bot that cannot be modified. Currently the source-code could be edited in the UI, but not saved to the server. This behaviour is not clear on the UI, which needs improvement (i.e. the read-only mode).

… and then I’ll get back to the old TODO list.

Player1 · April 18, 2021, 11:02pm

A small usability / security update.

The editor no longer “resets the view” when changing files within the project (this was very annoying!). It now remembers your view and undo / redo stacks for each file while you have the project open.
Added the read-only mode for read-only projects (where they have already been launched) and added some notes to highlight what is happening.
Improved some warnings / error messages in the UI.
Removed FiraCode font and went with Roboto Mono. Less ligatures for those who are unfamiliar with them.
Email task is done from the todo list! Accounts can now be recovered via email (i.e. forgotten passwords). Emails can now be added or removed from the account. For security, there is a 7-day delay when deleting an email from an account, but this delay can be skipped if you still have access to your email by receiving a code. You also get a warning to your email if your email address has been requested to be removed from your account.
Oh, and passwords can now also be changed / updated.
Added a contact email (and a slight bot-obfuscation) on the about page.

While this was not a very exciting update, it now allows focusing on the other more exciting tasks on the TODO list.

What should be next? Opening up Blender and improving graphics? Or TOR for the .onion address and ultimate anonymity?

Player1 · April 11, 2022, 11:30pm

It has been about a year since the last update. There was also a warning about this before posting. I apologise if I’m breaking rules here!

I don’t have any glorious new features, but a lot has happened in the background.

In a summary: brawl.ai website now runs on Cloudflare Pages + Workers, and Oracle Cloud. Currently the monthly cost is 0$. The only expense is the registrar fee for the domain name with .ai TLD.

I’ve detailed some of the migration/conversion process for those who may be interested. Before starting, here is a short list of the changes in the tech behind the scenes over the lifetime of the brawl.ai website:

API:

DB: MongoDB → PostgreSQL → Cloudflare KV
Language: Java Spring Boot → Golang → JavaScript Workers
Engine: Google Kubernetes Engine (GKE) → Linode Kubernetes Engine (LKE) → k0s → microk8s → k3s → k0s → Cloudflare Workers (serverless)
Auth: OAuth2 → JWT → sessions
CI: scripts → Concourse CI → GitHub Actions
CD: scripts → Terraform → Ansible → ArgoCD → GitHub Actions
Email: Sendgrid → Postmark

Workers / Game host:

Raspberry Pi cluster → Oracle Cloud

The webpage hasn’t changed, but here is its tech stack if interested:

reactjs + monaco (VSCode) + threejs + fontawesome + flag-icon-css + fonts

Email:
Sendgrid closed my account as I wasn’t sending enough emails. Once I had a month or two without a single email being sent from my account (nobody reset their password), that was too long for them to keep my account open. Also, they didn’t notify me that they had closed my account. I was running tests after upgrading Kubernetes and found out that their API just happily responds with 200 for an email delivery request even if your account is closed, but without actually delivering the requested email. This is also when I found out that my account had been closed, and the reason for it.

I contacted Postmark and told them what I’m using emails for and what happened with Sendgrid and they were happy to have me.

Kubernetes:
I’ve come to the conclusion that managed Kubernetes (k8s) is nice. Very nice! I’ve tried k0s, microk8s and k3s. Here is a summary comparison between them and the issues I ran into.

Both k0s and k3s ran reliably, but microk8s automatically updates itself as part of the snap package system (part of modern Ubuntu). It managed to somehow eat all the memory on the server after some automatic restarts and failed to start up again. I had to purge snap (the package manager itself) by removing all its files from the system before I could get microk8s to start up again. It wasn’t enough for me to restart or even reinstall just the microk8s.

Upgrading k0s has rarely been pain-free. In most cases when I took the effort to upgrade it (every couple of months), something always broke. I used to run External-DNS… for the scalability and just for being able to declare domain names and get automatically correct DNS setting for automatic certs. This is entirely pointless for a project of this magnitude! For example, setting up some DNS records would have been the least painful part in scaling this cluster from a single-node to multiple nodes under load.

I used Cert-Manager for the certs. However, getting Traefik and MetalLB to work nicely with External-DNS and Cert-Manager requires careful research and experimentation to find the correct version numbers for both the software and the YAML configuration files. This includes scouring through source-code and reported issues for all these components. On every upgrade, some YAML configs had to be rewritten and often in a new layout as something was always broken between versions.

Additionally, I had to refresh the k8s ingress config every 3-months as Cert-Manager just didn’t do that by itself for some reason. This meant that the ingress controller didn’t stop using the old certificates by itself and start using the new certificates from Let’s Encrypt, and obviously new requests to the old expired certificates would fail. Not very automatic…

I ended up dropping both External-DNS and Cert-Manager after a while as Traefik could just get the certs by itself, or I thought so. After the service had been happily running for about two weeks, the log files started getting filled with these:

Jan 08 16:23:46 api.brawl.ai k0s[449]: time="2022-01-08 16:23:46" level=info msg="E0108 16:23:46.968824 469 cacher.go:420] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list cert-manager.io/v1beta1, Kind=Certificate: conversion webhook for cert-manager.io/v1, Kind=Certificate failed: Post \"https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s\": service \"cert-manager-webhook\" not found; reinitializing..." component=kube-apiserver

Jan 08 16:23:46 api.brawl.ai k0s[449]: time="2022-01-08 16:23:46" level=info msg="E0108 16:23:46.963754 469 cacher.go:420] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list acme.cert-manager.io/v1alpha3, Kind=Order: conversion webhook for acme.cert-manager.io/v1, Kind=Order failed: Post \"https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s\": service \"cert-manager-webhook\" not found; reinitializing..." component=kube-apiserver

Jan 08 16:23:46 api.brawl.ai k0s[449]: time="2022-01-08 16:23:46" level=info msg="E0108 16:23:46.076516 469 cacher.go:420] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list cert-manager.io/v1beta1, Kind=CertificateRequest: conversion webhook for cert-manager.io/v1, Kind=CertificateRequest failed: Post \"https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s\": service \"cert-manager-webhook\" not found; reinitializing..." component=kube-apiserver

Jan 08 16:23:46 api.brawl.ai k0s[449]: time="2022-01-08 16:23:46" level=info msg="E0108 16:23:46.074678 469 cacher.go:420] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list cert-manager.io/v1alpha2, Kind=CertificateRequest: conversion webhook for cert-manager.io/v1, Kind=CertificateRequest failed: Post \"https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s\": service \"cert-manager-webhook\" not found; reinitializing..." component=kube-apiserver

Yeah, that is probably a misconfiguration on my part, but it was happy for two weeks and then started spitting these! The certificates worked too, but it just spammed too much into the log files.

Traefik would write so much to the log files that the disk space would fill up. This is close to 20GB of log files on the 25GB nanode (at Linode). Kubernetes will start killing pods once the system starts to run out of resources, for disk space this happens when less than 10% is available.

I ran the latest k0s, latest Traefik and the problem still persisted. I could roll-back, but this also wasn’t the first time I ran into problems. I attempted to debug this a couple of times, but the errors didn’t show up until the service had run for about two weeks. I gave up in the end. This wasn’t worth it.

After spending a significant amount of my free time for just fixing upgrades and on maintenance, version after version, I decided it was time to look for a different platform. Maintained Kubernetes, like ones from Linode, Google, Amazon or elsewhere could be an answer, but they also cost a lot more money (for a good reason and I start to think it is well worth it when you must use Kubernetes).

Why not k3s? I run k3s on my NAS and I’m a happy user. It is by far the most polished one out of the three with the least amount of maintenance. Upgrades have been painless and things just work for the most part. Except you cannot get k3s to forward the real client IP address [1] [2], which I’d use to determine the country and flag for each user. This isn’t a big deal for my NAS/home server, but it is for brawl.ai website. It is the CNI k3s has chosen that prevents this. k8s is plug-and-play (like LEGO), and it is possible to change the CNI, but then you add this to the maintenance burden when you upgrade from version to version. This feels like going back to k0s.

If I continue with self-hosting and managing Kubernetes, I will not have enough free time to also work on improving brawl.ai. In the end, I decided to go as far managed as possible, without breaking the bank. I looked for the many different serveless platforms and the response times by Cloudflare Workers was what won me over. It was time for another rewrite. I merged all the microservices into a monolith JavaScript application and turned the Postgres queries to a “query-less” key-value format.

Cloudflare Workers:
The hardest part was to figure out how to store the data in a way that minimises read and write operations and doesn’t produce conflicts in an eventually consistent platform. If I wanted to stay on the free tier, I couldn’t use Durable Objects.

There are a couple of key points to CF Workers KV store that let you use it without the Durable Objects. One is that if you just wrote data to the KV, then it is immediately consistent if you read it from that exact same location (data centre/edge node). It can take up to a minute to propagate elsewhere, but it is consistent for the same client at the same location, immediately. I found out that while it isn’t mentioned anywhere, the list operation isn’t consistent immediately, but the put and get operations are. For example, if I created a new bot and then immediately listed all the bots that I had by the KV list operation, it would not necessarily find my new bot. This makes caching difficult and breaks the user interface. Even if you refresh the page, it may still show you old data, but not always.

The solution ended up being to separate users’ data into their own key-value pairs. There are not shared tables to query from. Each user’s data is stored in its own document or value. For example, when a user requests data, its session information is behind one key, which needs to be requested first to verify the username in the request. Then another key is used, based on the request, to get the actual data that is being requested. All data is aimed to be as pre-processed as possible to reduce cross-referencing. Ideally, a single key-value read or write would fulfil the entire request.

All users also write to their own keys only to avoid two users having a race condition to the same key (remember: the latest write wins). But this is difficult for the worker, that actually runs user’s programs and produces results. The worker also writes to its own files only. Without Durable Objects, I’m limited to just a single worker, but with the current amount of traffic, it is still plenty. When a user submits a bot for the brawl or testing, it writes a copy of that bot for the worker to find out and marks its own copy as read-only. The worker gets notified immediately. Once the worker is finished, it will modify its own copy only, and only once. The user’s read-only copy will remain untouched. But once the user requests information about its read-only copy, it will check if the worker has updated its version. If there is an update on the worker’s copy, it means the worker is now finished and the user’s read-only copy can be updated with worker’s copy, and the worker’s copy can now be deleted. This only works due to the single update (flag) on the worker’s copy.

What about backups? The list operation doesn’t return values with keys, only keys, and reading the value for each key is rather expensive and slow. However, each key can have metadata associated to it, which is returned with the list operation. So, you can get keys and its metadata with the list operation. To find out what keys’ values need reading, I encapsulated the KV write operation to update the metadata for each write. More specifically, each write adds a “writeTime” value, which is very similar to what all common filesystems do. The backup software can check from the metadata’s “writeTime” if a key’s value has changed and can then query that key to back it up automatically. What about all the keys that I don’t want to backup, such as session data or other temporary data? You can list by prefix, and all permanent/backup keys have “permanent:” keyword as a prefix. This filters out any temporary data.

The implementation is still not perfect, and there are still some bugs with the UI that I haven’t managed to solve yet. For example, if two browser windows are open, and both are logged in, they can go out of sync. Adding a project, or launching a project on one window doesn’t necessarily appear on the other window even when refreshed. I’m guess that this could be due to the two browsers hitting different end points or edge servers from Cloudflare. Sometimes even on the single browser window, the refresh doesn’t always get the latest data from the server. So, I still have some bugs to solve…

Workers / Game host:
I listened recently “EP 86: The LinkedIn Incident” from Darknet Diaries, and I was convinced that I wanted the self-hosted solution of brawl.ai workers that run untrusted user-submitted code, out of my home network.

I searched for an affordable solution to host the worker software. Ideally it would be run in a serverless environment and on-demand. This would allow it to scale when needed and not accrue expenses when it was not needed. As far as I can tell, none of the serverless offerings out there can meet the odd requirements of brawl.ai website/project. For example: there are either one or two untrusted applications running, which need to be isolated and resource limited (including no network access), but they are also controlled by another software which talks to them via STDIN, STDOUT and STDERR.

The target platform would be capable of running a multi-core node/computer/VPS, and ideally with consistent performance portfolio for each core and not the shared-core types that vary significantly in performance over time. This would give both applications an equal amount of processing power (for fairness) and allow limiting each one to its own core (or set of cores) to prevent one from affecting the other (for fairness again).

Something like a Linode/VPS instance could work, that would be automatically started and automatically destroyed. Cloudflare workers are limited to HTTP requests/fetches and cannot control instances via SSH. This rules out Ansible and Terraform. However, there are REST API and stackscripts. While I was working on this, I stumbled across Oracle Cloud’s free-forever-tier: 24GB memory, 4-core arm64 instance. Perfect. I’ll run that one forever and that does all the worker stuff now.

In my previous implementation of the workers, I was annoyed with the delays that I had when I submitted some code for a bot, it could take up to a minute for the workers to even pick it up and start processing it. I had my original worker cluster in my home network, behind a firewall. I didn’t want to open up a port for it and I left it polling. It would poll once a minute, 1,440 times a day. For Cloudflare Worker’s free tier this was too frequent and had to be reduced. I made some changes after merging the whole cluster code together for the single VPS: it has an open port and will get notified immediately when code is submitted. It will also poll every 10 minutes (144 times a day) in case it missed something and there was an error with the initial notification. It is now much more responsive than before!

But, now the API… doesn’t update reliably (see the previous section)! Aaargh! The worker now responds within seconds, but the API doesn’t show the results immediately. I’ll get to the bottom of this sooner or later. For now, you win some and you lose some.

Web UI:
Not that much has changed.

One user requested an ability to drag and drop files to the UI from a home computer. I added this capability last summer.

Another user asked why the failed project/bot cannot be edited and it had to be duplicated. I’ve changed this too. Now, once the bot has failed the initial test, it can be edited and resubmitted without making any duplicates.

Visual Studio Code can now run in a browser. The brawl.ai website uses VSCode’s text-editor part (called monaco) and I’ve been looking into adding the rest of the VSCode capability if possible. The editor hosted by Microsoft can even run python language server in the browser version (for auto-completion and navigation). However, the web version of VSCode is closed-source and the language server that runs in the browser (pylance) is also closed-source. The odds are not in my favour.

Now that brawl.ai isn’t so memory and space limited when running inside Oracle Cloud, I can start looking into adding other language options for the bots. The language choices probably need to be platform/hardware architecture agnostic, in case I need to migrate to x86 or some other architecture in the future.

Visitor counts
After moving the site from GitHub Pages to Cloudflare Pages, I got to view some analytics. In February 2022 (the first full month on Cloudflare pages), the webpage had the following number of requests according to Cloudflare:

United States: 5,977
Czech Republic: 3,470
United Kingdom: 2,433
Australia: 1,292
Other: 4,101

That’s 17,273 in a month. Unexpected!

In March 2022, the following traffic was reported:

United States: 3,442
United Kingdom: 2,935
Czech Republic: 2,027
Poland: 1,930
Other: 6,671

Totalling 17,005 for that month.

At this point I’m suspecting that these are mainly bots and crawlers on the website, but are there actually that many crawlers? Cloudflare thinks that is about 500 unique visitors per month. That’s all for the static pages portion of the site. The API gets just over 5,000 requests a month.

The main site is a single page application (SPA), built with reactjs. Webpack splits the JavaScript portion of the website to around 327 individual JavaScript files. That’s the whole site. So, if each unique visitor browsed every part of the site each visit (without CSS or fonts), that would be 163,500 requests. The request count is probably this high due to the multitude of files.

Cloudflare also blocked 4 attacks in March, and 1 attack in February. March attacks came from the US, and the February attack from Japan. I don’t know more about these incidents.

Going forward:
Hopefully all this effort will actually end up reducing the maintenance required for this website. Note: the maintenance (or updating k0s/microk8s/k3s) isn’t really that much if I was doing this full-time, but for a hobby project that I work on my free time, it eats up way too much.

Moving the API to Cloudflare Workers probably means that the .onion address (TOR) is postponed quite a bit.

I find it incredible that I can run the whole site now with zero monthly cost (apart from the domain name registration cost). The whole site should also be a lot snappier for everyone around the world, as it should always be available at the closest location. There is a cold-start, but after the first query things should fly!

Final words
I obviously didn’t explain every change in the list at the beginning of the post. A lot has happened in a year. A lot of decisions and thoughts. If you’re interested to know more on anything specific, just let me know and I’m happy to dig deeper.

I’m not going to leave another TODO list, I’ll just let you know when noteworthy things are in!