ThatGuyB's rants

ThatGuyB · May 26, 2023, 1:14am

I don’t know what I was doing wrong, but seems like the forwarding works in bind9. I tried the same settings multiple times, but it only worked after setting dnssec-validation to no, then back to auto. I was getting broken chain of trust messages.

Kinda crazy how long a DNS query can take sometimes. Typically it’s just 200ms, but sometimes I get this:

time ping level1techs.com -c 1
PING level1techs.com (172.67.73.46) 56(84) bytes of data.
64 bytes from 172.67.73.46 (172.67.73.46): icmp_seq=1 ttl=56 time=201 ms

--- level1techs.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 201.485/201.485/201.485/0.000 ms

real    0m2.238s
user    0m0.002s
sys     0m0.004s

ping took 200ms, meaning 2 seconds were allocated to DNS query and response back. Repeating the test yields only 0.007ms, which is what you should be expecting.

time ping level1techs.com -c 1
PING level1techs.com (172.67.73.46) 56(84) bytes of data.
64 bytes from 172.67.73.46 (172.67.73.46): icmp_seq=1 ttl=56 time=255 ms

--- level1techs.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 254.551/254.551/254.551/0.000 ms

real    0m0.262s
user    0m0.002s
sys     0m0.004s

Just look at this! Random all over the place!

time getent ahostsv4 level1techs.com
real    0m0.011s
user    0m0.011s
sys     0m0.000s

time getent ahostsv4 craigslist.org
real    0m0.946s
user    0m0.007s
sys     0m0.005s

time getent ahostsv4 youtube.com
real    0m0.554s
user    0m0.005s
sys     0m0.006s

I don’t blame bind, just my connection, but this can contribute to a slow internet experience. I remember when our main DNS failed and everything felt slow and we couldn’t figure out why. All hosts were querying ns1, timing out because of no response, then querying ns2. Apparently one of the scripts we were using to reload the zones broke and left the config unworkable (I don’t remember what line we had to remove, but it was something minor - if only we would have used named-checkconf before applying the changes…).

Anyway, I got a working config inside a DNS container, now I need to apply it on my router. I’ll probably end up making one or two DNS containers anyway, so I can have some redundancy (and using keepalived to prevent the slowness I mentioned earlier in case the main dns fails automatically).

ThatGuyB · May 28, 2023, 3:31am

I was trying to learn more about iSCSI and how I could implement it in my infrastructure. My general idea was that I could use iSCSI as just a block device and pass it to a VM, but seems like Proxmox is incapable of that. Or maybe I’m just a brainlet (nothing new, lol) and I cannot figure out how.

I was hoping maybe there was a way to tell a VM “use this device for your OS and you manage it” and whenever I wanted to live migrate a VM, tell the hypervisor “pause the VM, transfer data from RAM to another host, detach iSCSI” and on the other host “attach iSCSI, resume VM activity with the RAM contents you received.”

This is in fact how I was doing it, but with NFS. I had a NFS share which I configured in the datacenter level in Proxmox, all VMs had qcow2 disks and when I live migrated, I didn’t have to copy the disk contents to another location (like you’d have if you ran with LVM/-Thin or local ZFS), just copied the RAM contents and the disk was already there, mounted in the same directory waiting for the VM to be resumed.

I was really hoping I could have something similar with iSCSI, because it would make management a bit easier. With iSCSI targets on a zvol, I can just snapshot the ZFS volume for each VM and send the data over to another pool or another NAS. But with NFS, I would have to either snapshot the entire share containing all the qcow2 (or raw disks maybe, since I wouldn’t be using qcow2 for anything because ZFS takes care of snapping), or create a NFS share for each VM which would be nuts.

I wouldn’t really blame proxmox for this, in all fairness, I’m struggling doing this with LXD, which is the reason I tried it on Proxmox. The same idea was to have an iSCSI target mounted on the host and have each container use their own targets. It is technically doable, but in LXD, the concept of pools is what boggles my mind. I have to create a pool for each container, which is dumb. And then, for each pool, I have to do a dir, which makes things overly complicated.

NFS, while having its downsides, is still the easiest and most straight-forward method of sharing the VM / container with other hosts and easily live migrate them. From the perspective of storage resources, it really makes way more sense, as opposed to having 2 different VMs / containers and use things like keepalived or corosync + pacemaker. The only thing coming close to it is diskless booting VMs with readonly access to the rootfs and their own /var and /etc locations basically. You still waste some space though, unlike just moving a VM over, but it depends on the scenario and requirements, sometimes having another VM always up makes more sense than doing HA (especially if you already have another way to load-balance them and not just have an inactive standby VM).

Back to iSCSI, I found this thread on proxmox forums.

Multiple ways work:
- adding a disk in PVE (host), create LVM on top and use for multiple VMs
- using iSCSI to have the backing device for on VM
- adding iSCSI inside of the VM, completely bypassing PVE

The broken English is a bit hard to grasp, I can only guess that the second option was meant to be exactly what I am trying to do. But I found no way of doing that. When I add iSCSI in datacenter storage, if I check “use LUN directly,” then Proxmox will be treating the target as its own storage, basically be able to format it with LVM and use it for VMs.

If I don’t check that, I’m still unable to add it to VMs, neither when creating the VMs, nor when they are already created.

I wonder if Ceph has a better handler and if it can do what I’m thinking of. Well, if I were to run ceph anyway, I’d probably be setting up the hosts with local storage on them and try a shot at a hyperconverged infrastructure. I think Ceph RDB might be that?

For now, it seems like if I want to be able to take snapshots for each VM or container separately, I need to use NFS. For LXD, I need to have a pool for each container, which makes things a bit annoying. Well, now that I think about it, technically if I used iSCSI, I would’ve had to create the target, then add the initiatornames in the host conf. Not that different than having to create a new zfs mount-point, set nfs share value on it, then add the entry on the lxd host in fstab or something, then create a new pool using the dir driver.

For iSCSI, I’d also have to format the disk and mount it - unless… I think I got an idea… kinda hacky, but doable. In either case, if I create an iscsi target and format the disk on the host and mount it, or if I just mount a nfs share, I can do it without having to create multiple pools in lxd. I just need to create the container, stop it, move stuff to another directory, mount either the iscsi or nfs into the target, then move the data from the container to the freshly mounted point. Ugh, I don’t like this, but it’s what I might end up doing anyway.

I probably want to experiment this in VMs before I go live with it. Thankfully, making a VM template and cloning via zfs should be easy. Now’s one time where I wish I had a low-power hypervisor, like a NUC, another odroid h3/+ or a zimaboard. I wonder how qemu/kvm works on aarch64 nowadays, since lxd --vm option still doesn’t properly work on aarch64. Or maybe I can get away with lxd vms on my odroid h3+ itself, I have the RAM and probably the CPU for it anyway, just need to install qemu… I was always curious how lxd vms compare to firecracker in terms of resource consumption. Maybe I’ll take some time to install opennebula in a VM and run firecracker on it, just to see how it goes.

ThatGuyB · May 28, 2023, 4:00am

Seems like that’s also the case on x86_64, at least on void. Despite me installing qemu and qemu-user-static, neither of these allow lxd vms to run.

Instance type "virtual-machine" is not supported on this server: QEMU command not available for CPU architecture

My websearch-fu is not strong tonight, I can’t find the reason this doesn’t work.

ThatGuyB · May 29, 2023, 2:07am

According to the debian wiki, VMs in LXD require the package qemu-system-${ARCH} (x86_64 / aarch64).
https://wiki.debian.org/LXD
I do have it installed, because I have the package “qemu,” but it still doesn’t work on my system. Not to worry though, because I’m soon going to test firecracker.

Yep, basically zfs send / receive a snapshot of a zvol to another zvol and make the VM template in virt-manager. I’d say it’s simpler than even proxmox (because you don’t have to go through as many steps, it’s mostly pre-selected well in virt-man). But still need a quick one-liner for whatever you’re cloning in any hypervisor.

rm -f /etc/machine-id && dbus-uuidgen --ensure=/etc/machine-id

I have 3 alma vms, planning to make a cluster of some sorts and test firecracker (probably in opennebula), then wipe it out and test lxd vms. If neither are better than qemu, or if firecracker requires too much work to get working (I think it does need a kernel module or something), then I’ll just drop them (well, the ideal is to only use containers, because of hardware limitations, especially RAM). I still think the resources should be lower for both lxd and firecracker compared to qemu (at least the later), but not all the container OS are available also as vm options in lxd.

ThatGuyB · May 29, 2023, 3:03am

Not that it really mattered, since they’re containers (the memory is system shared, even if you put a limit on it, it doesn’t allocate all of it to the container), but I decreased the maximum allocated ram on squid to 512mb and to yt-dlp container to 256mb. They barely use 14 and 31MB respectively after they’ve been rebooted to apply the changes (although the changes might be live, not quite sure, but I had to reboot squid anyway, because it had nesting enabled by default, which it didn’t need).

The OS is basically negligible, but I might still have a hard time running all the programs I’m planning to due to RAM limitation. Proxmox says it’s using 5GB of RAM (out of 8), despite only having 2 containers running and nothing else (free tells that 977MB are cached, meaning proxmox is actually using about 3.5GB, which is nuts - uptime is 8 days).

Thankfully I won’t be using proxmox (not even sure why I’m getting such high consumption, it shouldn’t be using more than 2GB of RAM, really), but I’ll be limited to 4GB of RAM on my n2+. But then again, probably most containers should be fine with anywhere between 256 to 512MB, with maybe a DB container requiring 1GB. I should still be able to run anywhere between 5 to 8 containers on the n2+. I didn’t think I may need a homelab upgrade so soon, but then again, I was planning to buy more sbcs to play with after the initial lab was set up (which I obviously never got around to).

Well, the initial plan was to have some kind of load balance and HA (when half the cluster goes down, to have the other side take care of the load and max out the system until the other side comes back up and containers or services can be moved back there).

I might need to look into newer SBCs, but I really don’t look forward to more SBC adventures. I’d love to get a quartzpro64, but that thing won’t be available anytime soon (not to mention lack of support from any distro at this point - I’m struggling enough with the hc4 as it is). I don’t even want to touch stuff like Rock Pi 5 or Orange Pi 5 without first knowing what they use to boot (I think the rocks have a weird bios-like thing anyway). But Rock 5 and Orange Pi 5 seem to be the only one on the market with 16GB of RAM. But quartz64 with 8GB of RAM is pretty cheap… I don’t want to hear about any other bootloaders than u-boot and petitboot. I found a single reference online that the orange pi 5 might be using u-boot (which would be nice).

Argone · May 29, 2023, 3:29am

Hi! I am gonna go back through the backlog of this thread. I have been watching and reading a lot of linux and lab stuff lately.

ThatGuyB · May 29, 2023, 2:26pm

This is just my crazy talk place. Use the second comment to navigate to the usable materials.

risk · May 29, 2023, 4:40pm

Did you give this a try? IIUC there isn’t really snapshot-ing, or disk data transferring (only ram is sent over the network), qemu has an rbd backend that Proxmox just happens to configure.

How many hosts do you have?

ThatGuyB · May 29, 2023, 8:15pm

My homelab consists of 1 proxmox host, 1 libvirt host (with a proxmox VM if I ever want to test stuff locally), 1 LXD host, 1 NAS (with iSCSI and NFS) with 2 zfs pools and since yesterday, my own PC also on libvirt.

I don’t have the resources to deploy a ceph cluster, I might try it under virtualization, just to see a theoretical setup and how it would work.

In proxmox, with NFS backend, using qcow2 vdisks, there is no snapshotting happening either. Just the RAM contents get moved over to another host when live migrating and the disks stay in place, as all the hosts can access the NFS backend concomitantly, but they won’t try to touch the vdisk unless they are the ones running the VM.

I believe with a ceph block-device backend, the same would apply here. You’d export the vdisk in ceph and when a live migration happens, you don’t have to snapshot, send the contents over, then change the disk location on the hypervisor, it will all just be seamlessly pointing to the same location.

The problem I’m dealing with, is that it seems iSCSI doesn’t seem to have the same feature? Which really makes me wonder why I can’t figure it out. The procedure should be just as straight forward. Unlike NFS, iSCSI can only be mounted on one host at a time, so it’s understandable you’d first need to unmount the target, then on the other host connect to the target, then resume the VM operations.

ThatGuyB · June 16, 2023, 3:05am

Successfully deployed OpenNebula, but as mentioned in “what are you working on today” thread, seems like you can’t have more than one type of hypervisor per server, i.e. you can’t have kvm, lxc/lxd and firecracker all on one host, you need different hosts for each.

The yum packages opennebula-node-* are set to be in conflict with one another. I can see why, in a datacenter you probably care more compute grouping, but this kinda makes it pretty useless for homelab setups, unless you’re just going with one type of hypervisor anyway. I’m testing firecracker just for fun. The opennebula front-end is easy to configure, but I haven’t figured firecracker yet. I might need to launch a second VM and install the opennebula-node-kvm package, because I’m more familiar with that.

I really don’t remember these services, I wonder if these are something new to opennebula 6, but I’m not curious enough to actually look into it. I just remember that previously on 5.x I was running just libvirtd. I could be wrong, opennebula seemed a bit too much to learn back then, when all we needed was proxmox (which we migrated to).

It’s kinda fun to configure a cloud though. You can set up a template, like a t2.nano and configure the price per month for the template (vm) and configure the price per hour for cpu, ram and storage. I don’t know how you’d make use of that, but I suppose there’s an api somewhere to integrate with a payment portal, which is integrated with a payment processor (and if you don’t pay, your PaaS gets powered off and your account locked - pay and it gets unlocked). Or maybe there isn’t one, lmao, haven’t checked, but I doubt the big companies that use opennebula would be manual billing their customers based on the exports from the web interface.

Given what I can tell, while not exactly complicated, setting opennebula is not a walk in the park either, you need a few hours or days to have this set up and running (depending on what you plan on deploying). But I think this can be useful for the future. I wouldn’t create a how-to guide on this, because the documentation on how to set it up is good. It’s a lot to read, but most stuff can be skipped. And the links at the end of the page point you to the next part in the process without having to read all the in-between, which, I gotta say, for such a massive software, the documentation is top-notch, at least as setup is concerned. But reading about features doesn’t seem to be as polished (e.g. trying to find more info about firecracker and how to set up, but I couldn’t find much - although it could’ve been just me).

ThatGuyB · July 15, 2023, 8:43pm

Quick rant: I did a browser cleanup from 200+ tabs to about 75. Will do another cleanup this or next week to get it to around 20, after I’ve read all the things I’ve been saving up.

I am absolutely shocked how many tabs I had saved over the past 4 years or so (since 2019) that have disappeared from the internet. The internet archive still has some of them. I kinda wish to make my own internet archive and found just the tool for it.

Not sure when I’ll be able to deploy this, but it’s an interesting topic. Alongside a wiki download, this might be a good tool to save information.

There were many pages I saw that were old news things that I didn’t care about saving, but a lot of the articles I’ve read or wanted to read were gone, especially tutorials on how to do tech stuff.

Quick note on OpenNebula, I dropped all my Alma VMs after the Pink Hat recent fiasco. I’ll try that again on Debian when I get the chance. Also, I kinda needed more storage space, so next time I’ll make the storage backend on my flash pool on my NAS instead of running local (I’ve no need for speed or redundancy for this, I kinda wish I had another scratch storage medium, like my usb hdd I use for ytdlp, but to use it for test VMs or containers).

I’m kinda sad that x86 is still the easiest to use to deploy new stuff on. Compiling from source doesn’t always work for many projects OOTB (especially for things that contain java, like libreoffice - which it can be removed from at compile time, but if you need that particular feature, you’ll have to do some code changes yourself). Even when we’re talking make and make install, x86 is still the most supported. My hand feels kinda forced to buy an x86 low-ish power build (<200W max power draw) as a home server (nas and hypervisor), only to have it always on (and to get a bit more performance out of my ssd pools).

ThatGuyB · July 16, 2023, 7:35pm

Seems like OpenNebula has a hard dependency on systemd. That sucks. I tried to install it on Devuan 4, but the dpkg process quit when trying to mess with some systemd-tmpunit or something like that.

I don’t feel like extracting and maintaining a version without systemd. There is technically 0 reasons to need systemd integration, other than having the service unit file have everything start automatically. You can make your own services in other process supervisor suites.

I’m sure the debs could be extracted and worked on, but I’m not going to bother.

This leads to another thing I kinda feel like talking. All the deal about “”“universal packages”“” like snap and flatpak. I was thinking that, let’s say GNOME Shell will start shipping entirely as a flatpak. Of course, it would have to be non-sandboxed, so it could actually interact with other things on the system (like having access to the actual /usr/bin to launch your programs).

But this does not resolve the problem of universal dependency. GNOME Shell still makes use of systemd components, like user slices and session scopes. If you’d try to launch a gnome shell flatpak on a system like OpenRC Gentoo to avoid compiling it, you’d be in for a lot of disappointment.

And you can’t just bundle the entirety of systemd inside a flatpak without literally reinventing lxc and docker. And then, if you want a display manager, you need to integrate that with the flatpak’ed GS to launch it. And if the DM is also flatpak’ed, you’d need to further make your system start it (although this wouldn’t be as hard, you’d just need your process supervision suite to launch the DM).

These “”“universal packages”“” might work well on desktop programs, but for bundling desktops themselves, I don’t see how they’d work without bundling a bunch of crap in it and even then, you’d be hard pressed if it’d work.

Many projects assume certain dependencies are met from the get-go. Which is fine, but then treating flatpak and snap as an “universal package” becomes disingenuous. Don’t get me started on trying to run GS on the BSDs or other unixes.

And yes, I know Gentoo has its own version of GS and they could technically compile that into a flatpak and then offer people other integration tools based on your process supervision and installed tools, but this will also require some dependencies, like elogind, meaning if you’d try to install this flatpak on debian, you’d be in for the same kind of disappointment.

I’m not a dev, but I believe for a package to be truly standalone, it needs to have as little external dependencies as possible. But then, depending on the OS, you’ll need to make some integrations to make the package work. Which leads to… package maintainers. So we’ve come full circle back to having package maintainers that do the integration with the OS.

Of course, not having to deal with dependency conflicts and potentially having an atomic OS would make the life of maintainers way smoother. All you’d have to do is basically make the integration and forget about it. But you’d still have to have a local package manager, like say apt on debian, of which purpose is just to install the minimal dependencies, like the systemd unit files for a theoretical gdm flatpak that launches it on system startup.

A generic software store could also do some checks on what the underlying packages are (systemd / openrc / s6, dbus, systemd-logind / elogind etc.) and configure that for you automatically, but would also require some integration outside of the distros, but would make package maintenance a bit more centralized and easier to maintain (since the dependencies wouldn’t be as hard, like requiring a particular version of a program, but instead would mostly be API calls or some scripts to set up the startup sequence).

And of course, now I have obligatory shill for the nix package manager and how revolutionary it is, making dependency resolution a thing of the past before any of the “universal” package managers. You are still going to have to do some integrations, but you could technically have anything run side by side, possibly even systemd (if you remove that crap about it requiring to be PID 1 in order to launch any kind of services).

ThatGuyB · July 30, 2023, 2:05am

Seems like I can’t focus on 1 thing at a time, I’m such a mess, jumping from a technology to another. I do get around sometime to come back to all of them (eventually… maybe? I’m sorry for not fixing my forum wiki articles sooner).

Now I’ve just tested importing a rootfs image into lxd. You can’t really do it as easily as just importing, you need to create a metadata.yaml file, but it’s easy to make, only has 7 lines. And I’ve used nixos rootfs x86_64 image from hydra (nixos build site).

These 2 website were useful in dealing with lxd and with nixos container respectively.

tl;dr

cat <<EOF > metadata.yaml
# this is metadata.yaml file
architecture: "x86_64"
creation_date: 1690679693
properties:
architecture: "x86_64"
description: "NixOS 23.05"
os: "nixos"
release: "23.05"
EOF

tar -cJvf metadata.yaml.tar.xz 
lxc image import tmp/metadata.yaml.tar.xz tmp/nixos-system-x86_64-linux.tar.xz --alias nixos2305
lxc exec nixos -- /usr/bin/env poweroff
lxc config set nixos security.nesting true
lxc start nixos
lxc exec nixos -- /usr/bin/env bash

cat <<EOF > /etc/nixos/configuration.nix
{ config, pkgs, ... }:

{
  imports = [ <nixpkgs/nixos/modules/virtualisation/lxc-container.nix> ];
  # Supress systemd units that don't work because of LXC
  systemd.suppressedSystemUnits = [
    "dev-mqueue.mount"
    "sys-kernel-debug.mount"
    "sys-fs-fuse-connections.mount"
  ];


  environment.systemPackages = with pkgs; [
    neovim
  ];

  system.stateVersion = "23.05";
}
EOF

nixos-rebuild switch

(the container has nano preinstalled, I had to use it to configure the nix cfg - and I can’t believe how bad muscle memory can interfere with that)

I’d like to move to pure lxc, but unfortunately, it seems like lxd has a lot of really good tooling, like live migration, built into it and not ported to lxc. Not sure if it works outside of ubuntu, never tested it (I’m not using ubuntu). This was more of a playground, to see what I can achieve on my own, but also kind of a simulation in case Canonical decides to close the images repo like docker did and charge for using it. It’d be understandable, because the bandwidth consumption is not insignificant and I’m not entitled to it, but I wanted to see what can be done by playing with your own local repo.

ThatGuyB · July 30, 2023, 11:19pm

I’ve been dealing with containers lately, as can be seen above. Because of some curiosity, I wanted to see if gitlab was able to run in lxd. It seems there’s some limitation to redis if ran inside a container. Gitlab has a systemd service dependency on redis-gitlab, which never starts.

I’ve tried both nixos and debian 12 to run gitlab in lxd, with no luck. Literally the exact same issue. Leading me to try lxd vms, which are supposedly full VMs using qemu, but with a lot less cruft of the normal libvirt. Well, obviously lxd won’t run as it should on non-supported systems. In my scenario, it’s not allowing to launch VMs because it’s failing some qemu check, despite me running qemu 7.1, lxd 5.9 and lxc 5.0.1.

This is part of the reason why I wanted a firecracker hypervisor, but the only way I knew how to get it was either aws or opennebula. This is why I was meddling with opennebula a few weeks back. But now I’ve just discovered microvm.nix. It can run anything from full qemu, firecracker and crosvm, along other tools I’ve never heard of. The crosvm is interesting, I’d need to read more about it vs firecracker. Seems like firecracker was a fork of crosvm, which I wasn’t aware of.

Unfortunately, it seems like the only way to live migrate OSes as of now is qemu’s tooling and for linux and OCI containers, using CRIU. Technically, CRIU should be able to handle any kind of RAM data migration, but I haven’t yet found people trying to migrate firecracker. The idea of “serverless” workloads is that they are ephemeral and short-lived, meaning that you should be spinning up microVMs on-demand when needed and kill them when the demand is lower.

But that’s lame. My idea is to use existing technologies to build redundant home self-hosting highly available infrastructure without breaking the bank and using as few resources as possible.

If the hypervisor is unable to live-migrate the programs one wants to run, then one should move the task over to the guest OS, coupled together to form a cluster resource. Pacemaker and corosync are certainly good tools for this, but I don’t believe they act as fast to disruption. The idea is: service goes down, it gets immediately relaunched on another host.

When a host dies, in HA at the hypervisor level, you already have the same data copied in memory on another system and you just resume activity pointing to the same disk resources. In an OS cluster, your program dies and you launch another identical instance immediately that takes over the burden. But your service needs to be able to account for disruptions. For example, if you have a webserver, it dying and getting launched again on another host won’t impact much, except maybe the current session of a user, but it’s not going to be very disruptive.

But for something like a DB, say postgres, you launch the service on another host, postgres sees it crashed and it re-reads its last activity logs. This can cause a massive disruption while pg tries to figure out what happened and resume activity. A PG cluster with synchronous streaming replication helps, but if your service doesn’t have its own built-in support for something like that, you’re kinda screwed without live migration when a host dies.

Typically hosts dying is not as big of a problem as people make it to be, but it can happen and I’m pretty sure it can happen more frequently in a home environment. Many people with homelabs don’t do any kind of HA because of the high costs, even when their services are important, like self-hosted email.

Well, for that matter, many VPSes don’t offer you HA either, if the host ever goes down, they have a SLA that you’ll have your VM up in at most n hours. And if they can’t respect their SLAs, you don’t get compensated, they just offer you a deal to reduce your costs as an apology, but your users won’t be happy when the service they use is unavailable. I put an accent on VM, because if your service doesn’t come back online because of a crash, it’s still up to you to fix it.

Anyway, the rant devolved into too many things. I mostly wanted to talk about linux containers and lightweight virtualization. NixOS certainly does a lot of interesting things. I’m undecided now if I want to run a nixos hypervisor or if I want to continue pursuing opennebula. On one hand, opennebula is a cloud infrastructure software. Managing clouds is not exactly my thing. It can work as a datacenter software, but you’re using a sledgehammer to hammer in small nails. It can work, but it’s overkill.

Given my discovery of microvms.nix, I could realistically now run a single host for my needs, unlike with opennebula requiring multiple hosts (or at least VMs) to run different kinds of contexts, like a hypervisor for vms, one for microvms and another for containers, when microvms.nix and containers.nix could do fine for what I need.

Don’t get me started on containers.nix. It uses systemd-nspawn, but I haven’t looked into how to make it work with bridges instead of tap / NAT. I prefer everything be on the main network I assign it to, not follow the docker hidden / inaccessible network model (I can’t be bothered to deal with port-forwarding and furthermore, I’d rather use default ports and different IPs and hostnames per container, instead of having different ports mapped to each container).

The only problem that I see with containers.nix is that the containers are mapped to the host’s nix-store. Saves space, but might cause conflicts on access. If microvms.nix proves to be lightweight enough compared to containers, it’d probably be the way forward. A kernel image shouldn’t be too large and emulating a lot of legacy cruft shouldn’t be necessary.

Thinking about it, I feel I’m very biased towards nix now. Seems I dug into a hole I can’t escape now. I’m also interested to learn more about nix flakes and nixops. From what I can tell, some people already did the work on terraform-nixos provider, because terraform is more of a standard than nixops. There’s a lot to learn, but I feel nixos would be an easier and more advantageous technologically than ansible (because you don’t have to write more code to remove things previously installed, nix takes care of that and later you can garbage-collect).

In the pursuit of efficiency, let’s say we use linux containers (or containers.nix) for most of the services. MicroVMs still allows us to overcome limitations of the containers, so at the very least, it should be a complementary tool.

ThatGuyB · September 17, 2023, 6:15pm

I’ve been struggling with getting a VM backend in virt-manager to work properly for a while now. Normally NFS qcow2 works fine, but for some reason it didn’t and I really wanted to avoid the current setup becoming a permanent temporary fix, where the temp fix is, e.g. a windows VM having NTFS on qcow2 vdisk on ext4 / xfs on iSCSI on a zfs zvol.

@redocbew you might be interested in this post, maybe.

I realized that for certain VMs with local zfs, I was just passing through the zvol volume directly with /dev/zvol/pool/whatever-vol as a source disk, instead of using qcow2. It didn’t occur to me that I could just login to iscsi portal and instead of formatting and mounting a local fs on the hypervisor, just passthrough the whole disk sdX to the VMs, just like I’m doing zvols. This is actually the same method of editing the vm.xml file as used by Wendell’s Fedora 26 Ryzen passthrough guide.

After doing a zfs send of a VM that doesn’t get powered on often from local nvme zpool to my nas spinning rust zpool and modifying the xml, started the vm as normal. The VM won’t need fast local storage and is basically “archived” for all intents and purposes. To make sure I wasn’t just booting off the local copy somehow (the config can lie - although I could hear the rust spinning when powering it on initially), powered off the VM, zfs destroyed the local copy and started the VM flawlessly again.

The xml part looks like this

    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
<!-- 
discard='unmap'/>
 -->
      <source dev='/dev/sdX'/>
      <target dev='sda' bus='sata'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>

Quite happy with my realization, but the shortcoming right now is that the disks might have random mounting orders. I have at least 3 iscsi targets for this host alone and another one or two for another host. Sometimes the 500GB disk is sda, sometimes sdb. Maybe I’m doing iscsi wrong, I have a target for each lun and I treat each target as a disk. This is because I want the ability to later orchestrate a target change via the auth group with a simple service reload to switch the host it is running on.

I’d like to find out how to add a custom WWN to the iscsi LUN, so I can assign /dev/disk/by-id instead of /dev/sdX in the qemu vm xml file. @diizzy maybe you know something, since I’m using freebsd ctld for the iscsi target config.

With this out of the way, now I’m having trouble thinking of a good solution for containers using a similar block device backend, not sure if there is a driver for something like this. Worse case, I can just fallback to individual nfs share per container, to be able to just zfs snap the fs.

redocbew · September 17, 2023, 7:00pm

Mapping by id is the trick that I used also, but in my case they’re all local drives so no re-mapping required.

Gnuuser · September 18, 2023, 1:52am

Sadly no one bothers to back up anything very much anymore.
There are simple methods such as just copying to external media to advanced backup and restore techniques. Each serves a purpose as the backed up and archived infrormation can then be removed from the main drive.
Given the large volumes of todays drives users do not feel the need to do backups, only to pi$$and moan when the drive decides to go to camp tookash!t
I do out of force of habit but that me.

ThatGuyB · September 18, 2023, 9:25pm

I’m pretty sure somewhere in the thread I complained about backups and archival. Not sure which comment you are referring to here, but I agree. I’ve got 2 copies of my important data (just using zfs-send, but I want to set up restic for that sweet deduplication - was also thinking of bacula, because enterprise software, but just want something simple for now).

ThatGuyB · September 19, 2023, 1:32am

I always forget and always am shocked how good the man pages are in the bsds… a man ctl.conf later and I found device-id entry for ctld conf. Lo-and-behold, a service ctld reload later, now I have custom WWNs that show up in /dev/disk/by-id and I can just add fstab entries for the mountpoints. Brilliant!

MikeGrok · September 19, 2023, 4:21am

I was working at IX systems (primary developers of FreeNAS) a few years ago, and they liked to have at least 2 extra data drives in any given commercial data set. Where raidz2 was not cutting it, ie people using a large raid array instead of an SSD for a database or to back the boot drives of VMs, they would have pools of mirrors.

ZFS does mirror differently than most hardware mirrors. Write events go to all of the drives, but read events go to a single drive, so a mirror in practice works much faster during read events than a striped array of the same number of drives.

They put 3 to 5 drives in a mirror, then made pool of those, often up to 32 vdevs in a pool, then a bunch of spares, and some flash caching drives. An array like that would have the normal redundancy with HA (High Availability), ie 2 motherboards (hosts) in a HA chassis, 2 cards per host, 2 data paths all the way to the drive in each drive shelf. Each motherboard can bring to bear 32 SAS channels to the drives. Write events would consume a lot of channels, but the amazing part was during read events. The read event goes to the drive with the data. With a mirrored array, all of the drives in the vdev can be the drive with the data.

When you look at how much hardware is committed to making the data sets high availability and performant, it just does not make financial sense to make the VDEVs into raidz arrays instead of mirrors. With 3+ drive mirrors you can have a hard drive failure, and rebuild that array while it still stays performant to read events. There are many vdevs in the pool, ZFS will give write events to a mirrored vdev that is not busy performing a resilver, so the pool stays performant.

If you are going to have several pools per HA, you might as well spread out your mirrored vdevs amongst the disk shelvs so if a disk shelf gets lost due to someone dropping it down a flight of stairs while you are moving it to a different rack, you don’t loose any data.

Also every disk shelf should have at least 3 hot spares of every drive type that it contains upon deployment. The hot spares can temporarily decrease as drives die and get RMAed, but if you go down to 1 or 0, you should buy some more drives to add to that enclosure (or keep nearby to add) so that you maintain a safe number of hot spares.