ThatGuyB's rants

ThatGuyB · September 17, 2023, 6:15pm

I’ve been struggling with getting a VM backend in virt-manager to work properly for a while now. Normally NFS qcow2 works fine, but for some reason it didn’t and I really wanted to avoid the current setup becoming a permanent temporary fix, where the temp fix is, e.g. a windows VM having NTFS on qcow2 vdisk on ext4 / xfs on iSCSI on a zfs zvol.

@redocbew you might be interested in this post, maybe.

I realized that for certain VMs with local zfs, I was just passing through the zvol volume directly with /dev/zvol/pool/whatever-vol as a source disk, instead of using qcow2. It didn’t occur to me that I could just login to iscsi portal and instead of formatting and mounting a local fs on the hypervisor, just passthrough the whole disk sdX to the VMs, just like I’m doing zvols. This is actually the same method of editing the vm.xml file as used by Wendell’s Fedora 26 Ryzen passthrough guide.

After doing a zfs send of a VM that doesn’t get powered on often from local nvme zpool to my nas spinning rust zpool and modifying the xml, started the vm as normal. The VM won’t need fast local storage and is basically “archived” for all intents and purposes. To make sure I wasn’t just booting off the local copy somehow (the config can lie - although I could hear the rust spinning when powering it on initially), powered off the VM, zfs destroyed the local copy and started the VM flawlessly again.

The xml part looks like this

    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
<!-- 
discard='unmap'/>
 -->
      <source dev='/dev/sdX'/>
      <target dev='sda' bus='sata'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>

Quite happy with my realization, but the shortcoming right now is that the disks might have random mounting orders. I have at least 3 iscsi targets for this host alone and another one or two for another host. Sometimes the 500GB disk is sda, sometimes sdb. Maybe I’m doing iscsi wrong, I have a target for each lun and I treat each target as a disk. This is because I want the ability to later orchestrate a target change via the auth group with a simple service reload to switch the host it is running on.

I’d like to find out how to add a custom WWN to the iscsi LUN, so I can assign /dev/disk/by-id instead of /dev/sdX in the qemu vm xml file. @diizzy maybe you know something, since I’m using freebsd ctld for the iscsi target config.

With this out of the way, now I’m having trouble thinking of a good solution for containers using a similar block device backend, not sure if there is a driver for something like this. Worse case, I can just fallback to individual nfs share per container, to be able to just zfs snap the fs.

redocbew · September 17, 2023, 7:00pm

Mapping by id is the trick that I used also, but in my case they’re all local drives so no re-mapping required.

Gnuuser · September 18, 2023, 1:52am

Sadly no one bothers to back up anything very much anymore.
There are simple methods such as just copying to external media to advanced backup and restore techniques. Each serves a purpose as the backed up and archived infrormation can then be removed from the main drive.
Given the large volumes of todays drives users do not feel the need to do backups, only to pi$$and moan when the drive decides to go to camp tookash!t
I do out of force of habit but that me.

ThatGuyB · September 18, 2023, 9:25pm

I’m pretty sure somewhere in the thread I complained about backups and archival. Not sure which comment you are referring to here, but I agree. I’ve got 2 copies of my important data (just using zfs-send, but I want to set up restic for that sweet deduplication - was also thinking of bacula, because enterprise software, but just want something simple for now).

ThatGuyB · September 19, 2023, 1:32am

I always forget and always am shocked how good the man pages are in the bsds… a man ctl.conf later and I found device-id entry for ctld conf. Lo-and-behold, a service ctld reload later, now I have custom WWNs that show up in /dev/disk/by-id and I can just add fstab entries for the mountpoints. Brilliant!

MikeGrok · September 19, 2023, 4:21am

I was working at IX systems (primary developers of FreeNAS) a few years ago, and they liked to have at least 2 extra data drives in any given commercial data set. Where raidz2 was not cutting it, ie people using a large raid array instead of an SSD for a database or to back the boot drives of VMs, they would have pools of mirrors.

ZFS does mirror differently than most hardware mirrors. Write events go to all of the drives, but read events go to a single drive, so a mirror in practice works much faster during read events than a striped array of the same number of drives.

They put 3 to 5 drives in a mirror, then made pool of those, often up to 32 vdevs in a pool, then a bunch of spares, and some flash caching drives. An array like that would have the normal redundancy with HA (High Availability), ie 2 motherboards (hosts) in a HA chassis, 2 cards per host, 2 data paths all the way to the drive in each drive shelf. Each motherboard can bring to bear 32 SAS channels to the drives. Write events would consume a lot of channels, but the amazing part was during read events. The read event goes to the drive with the data. With a mirrored array, all of the drives in the vdev can be the drive with the data.

When you look at how much hardware is committed to making the data sets high availability and performant, it just does not make financial sense to make the VDEVs into raidz arrays instead of mirrors. With 3+ drive mirrors you can have a hard drive failure, and rebuild that array while it still stays performant to read events. There are many vdevs in the pool, ZFS will give write events to a mirrored vdev that is not busy performing a resilver, so the pool stays performant.

If you are going to have several pools per HA, you might as well spread out your mirrored vdevs amongst the disk shelvs so if a disk shelf gets lost due to someone dropping it down a flight of stairs while you are moving it to a different rack, you don’t loose any data.

Also every disk shelf should have at least 3 hot spares of every drive type that it contains upon deployment. The hot spares can temporarily decrease as drives die and get RMAed, but if you go down to 1 or 0, you should buy some more drives to add to that enclosure (or keep nearby to add) so that you maintain a safe number of hot spares.

ThatGuyB · September 19, 2023, 11:20am

I appreciate the insight, I might pin your comment, but mere mortals can’t afford to lose 3 to 5 times the capacity for redundancy (that’s what, 33% usable capacity and 20% respectively? and that’s not counting hot spares) for our home labs or even for small productions, so we’re trying to squeeze all we can out of raidz. We are aware resilvering tanks the performance and might kill some drives during the process, so we try to keep the pool small (that’s why I never recommend people in their own setups go above 11 drives per vdev and then just do stripped vdevs).

Sure, when it comes to important large production boxes, going balls to the walls with mirrors makes tons of sense, where impacted performance loses you money (like if people can’t click that “add to cart” button quickly and place the order).

For the previous company I worked for, working with small budgets and making due with LACP (and balance-alb when running out of lacp groups) on 2x gigabit ports, we never went above 6 drives in raid6 (I wanted to put the cards in HBA and use ZFS, but got outmatched 2 to 1 by my colleagues - if we went ZFS, I would still do a 6 drive vdev and stripe 3 of them). Our small production was pretty snappy despite the somewhat underpowered backend.

Not sure if working for such a large company as ix has allowed you to see the craziness that happens in the low budget departments. It’s a fun world in itself, but virtually all companies that cut corners will eventually find themselves in a situation where they lost money because of it, so next time around, they will go with a saner config (like a 3-way mirror). The ones that didn’t have an it manager that either saw it happen before, or lied about seeing it just to get approval for a higher budget to avoid it entirely (our department didn’t have an it manager, we used to but he left the company and we restructured under a CTO instead, with developer background, so all decisions were kinda democratic there).

MikeGrok · September 19, 2023, 12:57pm

At the time IX systems was about 65 people.

I was only at ix systems for 3 weeks for a support job. I learned a bit, and gave them some ideas that completely changed the way they diagnose and replace potentially bad hardware. I also wrote a shell script to automate a boring and error prone hand data analysis task. I was more concerned about stopping them from doing stupid shit than being friendly.

I was let go after I left in the middle of the day to get my girlfriend (now wife) to the hospital. I had followed procedure, but my boss didn’t and lost face when looking for me. She had a hernia, the doc said if I had gotten her there 2 hours later it would have become life threatening. Her disability (quad amputee) and vanity (she wasn’t put together, literally and figuratively) prevented her from just calling an ambulance.

Unlike most people, I read the SAS spec cover to cover when it was initially released. I talked with their CEO about some issues in the freebsd driver that did not match spec and were costing them a lot of money. I also talked to him about some of the reasons behind designing SAS, and ways they could leverage that to get the software that they needed. They talked to LSI, who came back to them a week later and trained the senior staff on a software package related to that (I was not in the meeting and did not know the details). Their daily shipping costs for replacement hardware was around $12k. If they got the tools that they needed to match what the SAS spec required, their replacement hardware costs should drop to 20% and become much more convenient to customers. I know that they stopped all shipments right after that meeting.

They were trying to get into markets like lawyer offices, doctor offices etc that did not have machine rooms. The hardware they supply is noisy. I suggested that instead of investing ever more money in vibration isolation of hard drives and fan profiles, they just buy some acoustic enclosures, and gave a few examples that dropped sound levels from 30db for 4u and under $500 to 56 db at $7000 available in oak, beech, and teak. They said it was ridiculous, but now they are providing that very item. It completely solves the issue that they were encountering, and frees up 5 full time staff who wanted to work on other projects.

I was supposed to be getting trained, but the person who was training me was on a different shift, and we only overlapped by 2 hours a day. I noticed that much of the staff was spending the majority of their time hand analyzing the logs. Since I had several hours a day of ide time, I wrote them a 3k shell script which machine analyzed the logs and output an html file. During one of the daily meetings they brought up one of the logs I had analyzed, and announced that the solution was to replace the faulty drive. I asked about the other faulty drive and the questionable drive. They ran my script, quickly found the faulty drives, and decided to replace the questionable drive too. My script should have reduced the time needed to analyze logs from 6 hours per day per person to hopefully less than 2 (including phone calls and arranging shipping). In 3 days the support department went from using lots of overtime and being run ragged to being caught up and having idle time. As the most junior member of the staff, I was let go.

2 weeks later I had a 6 month contract at double the hourly rate.

Updates to FreeNAS that was on track to be released in 3 years was released in 7 months.

Token · September 19, 2023, 1:57pm

Have any experience with SCC (SCAP Compliance Checker) and making a custom benchmark (XML) or custom OVAL content for it?

MikeGrok · September 19, 2023, 4:10pm

I don’t, but I have xml experience.

If you can show examples of a source and destination file I can write something that does that.

Token · September 19, 2023, 4:17pm

I won’t hijack ThatGuyB’s thread, but I’ll @ you in another thread, maybe in my ‘rant’ thread haha.

MikeGrok · September 19, 2023, 7:53pm

By the way when making a stripe of mirrors, you can start with dual drive mirrors, then add drives to existing mirrors if your need for read speed increases, or add more mirrored vdevs as you need more space. The reason for a 3 drive mirror per vdev is that so that the pool remains online and high performance even as drives fail, get swapped out, and resilvered. Also during writes the zfs server knows which VDEVs are busy, and directs writes to different vdevs if it is a stripe of mirrors the pool is forced to perform writes to a degraded vdev, which can get messy. I think a jbod (just a bunch of disks) where each disk is a vdev is a better strategy. Also one of the VDEVs in a pool may get frequently accessed data while the others are less frequently accessed. It is possible to increase the performance of a single vdev by either adding a SSD device to that single vdev, or just add more rotational drives to that vdev.

on a 3 drive vdev during resilver:
drive being used for reads
drive being used for source to resilver
blank drive being filled with data.

on a 3 drive vdev after resilver:
drive being used for reads and writes
drive being used for reads and writes
drive being used for reads and writes

on a 5 drive vdev during resilver:
drive being used for reads
drive being used for reads
drive being used for reads
drive being used for source to resilver
blank drive being filled with data.

Unfortunately if a drive is going to fail, it usually fails while it is being the source drive during a resilver, hence the usefulness of more than 3 drives per mirrored set.

on a 3 drive vdev during resilver and second drive fails:
drive being used for source to resilver
DEAD: drive being used for source to resilver - during resilver drive fails
blank drive being filled with data.
Notice that there is now no drives available for reads, The entire pool may go offline until the resilver is complete which may take more than 5 hours.

on a 5 drive vdev during resilver and the source drive fails:
drive being used for reads
drive being used for reads
drive being used for source to resilver
DEAD: drive being used for source to resilver - during resilver drive fails
blank drive being filled with data.

You can see why it is worth it to spend more money on more independent drives if the data needs to be high availability.

It is a good idea to have at least one of the drives in each mirrored array on an independent disk shelf.

Also remember that redundancy is not backup, backup needs to occur independently of redundancy. It is usually a good idea for the backup server to request data from the storage server instead of the storage server pushing data to the backup server. In case of ransomeware if all of the data on the storage server gets compromised, you don’t want the backup server’s data to also become compromised. Also in case of dedupe, if the dedupe table becomes larger than system memory, you don’t want the data on the backup server to become unavailable too.

ThatGuyB · September 20, 2023, 1:56am

I agree thus far, until this point.

I don’t think ZFS dedup is worth the hassle for budget stuff (which is basically my expertise), which is why I want to get into solutions like restic or potentially bacula. If you have lots of RAM (again, big enterprise customers) then maybe, dedup can save you TB of data if you have a hypervisor with, say, windows VMs in the upper double digits (even if you have 10 VMs you potentially save 40GB * 10, so 400 GB in one shot if all VMs run the same version of windows - and if deduped enough it can also give a speed boost since the same DLLs are cached in ARC and don’t need to be read again).

But generally my zfs pools run on low RAM (ix doesn’t even look at you if you run ZFS on less than 8GB of RAM, or at least that was the case a few years ago and I run ZFS on devices with 4GB of RAM and also run other things on top, like nfs and iscsi and it’s still fine). What was the recommendation for dedup, like 1GB per TB of total (not usable) capacity?

And I want to rant about RAM. Why is RAM so dang expensive on consumer hardware compared to stuff like CPU? You pay $250 on 64GB of RAM and $200 on a CPU. It’s ridiculous. So is SSD storage if you want to go over 2TB per drive, even if you go with bare basic 2.5" drives, let alone nvme. You get to 8tb qlc stuff and suddenly you are paying $900 for a storageless build and $1000 on 4 drives !!! ..... ?!?

In the enterprise, it makes more sense. You spend like $2000 for RAM and 8 to $10k on a CPU. Consumer memory is too darn expensive. So is flash storage. And I know if you run gigabit you can just use HDDs, but for the energy saving, it makes tons of sense to get ssds. I’m planning a new build and I’m salty about how much I need to spend on storage (and I already need to spend more on some spinning rust for a dedicated backup server - I currently have copies of my important data on my main pc and on my NAS as a backup, but I want to move everything to the NAS and have a separate backup server).

I can afford it, but I’m not made of gold. I’d rather not spend as much money if I can help it, but I don’t trust the used market, unless what I buy is so cheap that it doesn’t matter if it dies or not.

I’m getting tired and I feel like I’m losing focus on the above every two sentences or so, so I’m going to stop here.

In other news, my threadripper system worked flawlessly yesterday, but today booting appeared to disable the network card (I suspect a kernel crash). Tried booting into an older kernel, still failed. Can’t figure it out and I’m too lazy to reflash hrmpf on a usb stick to see if there’s anything to fix, like the vfio script or dracut. And without a dgpu to troubleshoot (since my gpu is passthroughed, although when I had 2 gpus, both had the driver blacklisted and I didn’t have a tty before either, but never had this problem). I’ll probably buy a used 710 or something that sips power as a troubleshooting GPU (interestingly, can’t find any gt 1010 gpus around, which should be the cutdown version of the 1030, neither new or used - people who sell 1030s for $70 must be crazy).

MikeGrok · September 20, 2023, 2:11am

ZFS dedupe is crazy dangerous. IXSystems had a client who backed virtual machines on the zfs server with dedupe. They had over 60 to 1 compression until they filled up their ram with their dedupe table. The problem was that the computer was already the most high end computer on the market. They had to wait 5 months for intel to release a new server line that could hold more memory before they could read any data from that pool.

Have you seen the epyc 8004?
amd is making an epyc mini. It is 1/2 an epyc using cut down epyc controller chips. 1P with up only 6 ram channels and 96 pcie channels. The CPUs start at $409 for an 8 core, or $639 for a 16 core.

I am currently running the epyc 9124 which is the cheapest cpu which would light up that motherboard. For an extra $800 system cost I don’t have to worry about running out of ram channels, pcie channels, or sata ports, and I get ECC.

MikeGrok · September 20, 2023, 2:39am

You do take regular snapshots right? Can’t you just roll the system back a day or so?

ThatGuyB · September 20, 2023, 11:24am

There was no update on the system, I only powered on to launch a windows VM, then powered it off. The change is not in software, but in hardware. I removed 2 pcie cards (a gpu and a usb controller). I probably got lucky on the first 2 bootups (one when I had side panels still open, a second time to test it is ok).

And no snapshots are taken on the threadripper, it’s a playbox that has no important data on it and storage is limited. I will troubleshoot it when I feel like it.

ThatGuyB · September 20, 2023, 11:03pm

This was among the most retarded things I’ve done in a while. All because of this:

@MikeGrok yes, a ZFS snapshot would have definitely saved me there, but I wouldn’t know the root cause. I was dumb and added a fstab entry without noauto for an iscsi lun that only shows up on-demand (whenever I run the iscsi login script).

It took me like 5 minutes to find, it took longer to write a hrmpf iso to a usb (I should really make myself a ventoy usb and just slap the ISOs there, I used to use easy2boot ages ago, but I’d prefer open source if I can help it).

I obviously couldn’t find a log where mount of some mountpoints fail, because the system fails to boot properly, but I found in /var/log/socklog/kernel/current (which is how socklogd / svlogd logs stuff) an entry about iscsi not automatically logging in and then a bridge port going down, which made me look into fstab.

I didn’t even realize I edited that, that event was completely erased from my memory, I would still have mounted iscsi manually (although I might’ve used the wwn, as I remembered I got that working, so I don’t have to guess by the disk size). I kinda wish I maintained a changelog.

ThatGuyB · October 23, 2023, 2:00am

These past couple of days I’ve been messing with nixos microvm.nix. Seems promising if I’m planning to only run nixos. Probably something to look into in the future (or maybe sooner).

I’ve also messed with opennebula and the firecracker node, but I can’t figure out how this thing actually works. The weirdest part is that I’m getting stuck in a network message that the (to be) instantiated VM can’t get an IP / MAC lease from vmbr0. Which doesn’t make any sense, because it works just fine for the host. Setting this up is really not easy, unless you use the silly demo version miniONE (which runs everything on a single host).

I guess I’ll need to get the kvm host to work flawlessly (haven’t even attempted to test it), then troubleshoot firecracker. This shouldn’t be necessary, but I think I might be missing a step (probably NIC settings related and maybe storage backend).

ThatGuyB · November 15, 2023, 2:30am

Today I want to rant about USB. OMG, USB can suck @$$ sometimes. Here’s the deal. I’ve got an old Windows x86 tablet with a single micro-USB 2.0 port that’s OTG capable, also used for charging. My original setup with this was:

micro-USB B cable from tablet going to;
usb Y splitter, with an A female port and 2 A male ports (it was used to power an ancient 3G USB SIM modem to connect to the internet in the WinXP era and that thing needed more power than the mere .5A the standard USB A could provide);
1x male A goes to a power brick;
1x male A goes to a USB A female-to-female adapter;
inside the other female A port, plug in a USB stick, a keyboard, or a USB hub, powered or unpowered;

This setup works, but the tablet still gets discharged (albeit slower) when using a USB device, even with a stinking powered USB hub.

How to solve this? Well, buy an el-cheapo micro-USB OTG Y splitter: 1x micro-B female, 1x A female, 1x micro-B male. No matter the order you connect things to it, tablet does not charge at all, but it can see the devices attached. Plug this dinky adapter to an old android phone. Boom, works perfectly and charges.

Adapter failed. What other options do we have? Well, there’s USB type-C hubs that also have PD passthrough. Buy 2, to make sure (different brands and models). Originally used a charging adapter type-c female to micro-b male. This one didn’t show neither devices, nor charging the tablet. Got another micro-b adapter (different brand), still nothing. Test it with my newer(-ish) phone, works flawlessly and can connect usb sticks to it while charging.

Then I tried the hub with a type-c female to A male. Setup used micro-b cable to double A female to mc2fA to the hub. With the hub plugged to power, nothing can be detected. Without wall power, devices get recognized.

Ok, fine. Try using another adapter specifically soldered for this particular scenario ages ago (but had to discard the 5V brick and cut the wire). This one is kind of a Y-in splitter for data and power. Got female A to a male A that only passes data + another cable that used to lead to a 5V 5A brick (hard wired).

Guess what happens next.

micro-b male to female A cable;
Y splitter fA->2xmA;
1x end to power brick;
other mA end to f2f adapter;
data usb A male cable from the other Y p&d (power and data) splitter coming in the f2f adapter;
usb A normal powered hub in;

No devices recognized, despite getting the power. It used to work when the p&d splitter had power coming in (well, it served more as a powered hub itself, it has 2 usb fA ports, but ignore that). It seems like a bare basic powered USB A hub doesn’t work if it doesn’t receive power from the USB data port. It works without wall power when plugged to a normal USB port (or through an OTG adapter), but when plugged to wall, but without power from USB, no devices get recognized on the PC (or tablet in this case).

Replace the powered A hub with the type-c hub that has USB PD (power deliver) passthrough. And this thing works. But I am now left with a lot of very jank adapters to make this work. Adapter-ception I’d say.

This is not portable at all, it’s a wire mess. It’s not a fire-hazard thankfully (as everything is soldered or tightly coupled), but I need a brick with 2 ports (A and C females). I might be able to do something like a “black-box cube” into which to throw all the adapters, but I’d still be left with a brick that has 2 cables coming out (one to the C PD hub, the other to the black-box), to finally combine into one cable going to the tablet.

Both C PD hubs that I ordered kinda work, but one is smaller than the other (and despite that, also has an audio out that doesn’t work on linux, lmao, so I’m using a FOSS USB audio out from thinkpenguin).

There’s no moral of the story (besides USB sucking), but a few big question are left. Why does the powered USB A hub fail when connected only via data to a host, while the type-C “powered hub” (the one with USB C PD passthrough) works? And why the micro-b OTG splitter didn’t work for the tablet, but worked for the phone (both are supposed to be OTG capable)?

MikeGrok · November 22, 2023, 5:51am

This is why usb over wifi is a thing.