Im building a datacenter

ThatGuyB · July 3, 2021, 2:13am

If I’m going to use software RAID, why use mdadm instead of ZFS? That’s not to say there’s no place for mdadm, just that in most server use cases, ZFS is a better bet IMO.

rcxb · July 3, 2021, 2:15am

You’re not. It’s just a fallback option if your controller dies on you.

judahnator · July 3, 2021, 2:26am

I appreciate the pointer, I’ll give it a shot. I got one off eBay but it will be a while before it gets here with the holiday and all that. I had no luck with the H710, but I imagine a discrete card may do better than the “mini” card stuck to the motherboard. My only real concern is maybe needing to grab different cables, the 610 and 810 looks to use the older style connector and the 720 cable is a bit short.

In the meantime, it looks like I might not be at a total loss for testing things out as a cluster. I realized that I had two “working” 600GB drives in RAID-1 in one server and a 64GB SSD in one of the others. I may not be great at math, but that’s three drives and I have three servers.
I’ll see about standing up a cluster and using Gluster as the test backend with one drive in each server since Gluster is file-based rather than device-based.

rcxb · July 3, 2021, 2:35am

A search turned up one hit, specifically saying this model (IBM 45W9609) is 520b sectored by default: https://picclick.com/Ibm-45W9609-900Gb-10K-Sas-Self-Encrypting-25-142704937308.html

NOTE: The H700 card won’t handle the 520b sectors any better than the PERC6/i.

You’ll need a non-RAID SAS card (or flash the H700 to IR firmware) to change that, like so: Drives "Formatted with type 1 protection" eventually lead to data loss | TrueNAS Community

infinitevalence · July 3, 2021, 3:45am

If the drives are 520 then just follow the guide using SCSI tools to rewrite them. It’s not hard just takes time.

Trooper_ish · July 3, 2021, 11:58am

like

ChrisA · July 3, 2021, 12:29pm

Well done you, is all I can say

judahnator · July 3, 2021, 2:03pm

When I was looking at the drive info in iDRAC I think I remember it saying they were 512, but that might also be the controller not knowing what to do with it. It’s worth a shot at least!

Would you have any recommendations on a budget SAS controller that would talk to these drives so I can reformat them? So far I have a variety of controllers, but although they all recognize the drives none of them will expose them to the OS.

rcxb · July 3, 2021, 7:42pm

Just about any non-RAID SAS card should work.

If you’re just looking for something that’ll let you format and test the drives, eBay is flooded with SAS3801E cards that’ll cost you all of $10:

The cheapest connectivity seems to be a SFF-8088 to SFF-8482 cable, like so:

A lot of footnotes, but that should be good enough for your current purpose. Boot-up with System Rescue CD, find the drives with lsscsi and then you should be able to run the sg_format commands, and after, test them with pv /dev/zero > /dev/sda (and check dmesg after that`s done). Be sure to aim a fan at the drives when running them this way, as well.

rcxb · July 3, 2021, 8:46pm

If anyone here does end up in a situation like I did with IBM SAN SSD drives, which work up to a certain point but then misbehave, look into hdparm -N to over-provision the drives. You should be able to make the drives appear a bit smaller, hiding the portion of the drive near the end that you can’t write to.

vhns · July 3, 2021, 10:10pm

Now that’s some redneck computing

judahnator · July 6, 2021, 2:25am

While waiting for stuff to come in the mail, I am doing a “dress rehearsal” of the setup. I have a 3-node setup with Gluster installed as the backing file system. I stood up a DNS server and an HTTP proxy server and have been playing with some of the Proxmox features.

The one that blew me away this evening was the live migration command. I ssh’d into my DNS server and started a little shell script.

while true; do echo -n "."; sleep 1; done

Then from my local PC, I just started a simple ping command to this VMs IP address.
Finally, I initiated a VM migration from one server to another. I was interested in seeing if my SSH connection would break, if the script would crash, if any packets were lost, anything of that nature. I had to tab back over to my Proxmox tab to make sure it was working, and the VM had already migrated without me noticing.

Just to be double sure I didn’t miss anything I rebooted the cluster and all the VMs and tried again. This time instead of a simple “migrate” command, I gave the “shutdown node” command and tabbed back over to my terminal. The VM migrated and the node shut down, and it was entirely seamless from my perspective. Absolutely blows me away how cool that is, having a running machine migrate like that.

Just for extra good measure I might install a Minecraft server and see if I can migrate that VM without anyone disconnecting. That would be neat.

Performance-wise everything seems fine. Disk speed is a bit lacking, but I won’t judge that too harshly before I get the disk array properly functional. I did verify that the link aggregation on the storage backend was doing its thing, doing some cache-to-cache benchmarks I managed to hit about 1.8-gigabit throughput, which is more or less what I was expecting.

rcxb · July 6, 2021, 6:53am

Live migration is much harder when you have lots of RAM, lots of system activity (RAM, disk and CPU), and a slow or busy network for that SAN to replicate the data over.

It’s not a surprise that with a 1 second sleep you wouldn’t notice anything at all. With something that outputs constantly, like make on a big piece of software, video playback, or games, you’d be more likely to notice the slowdown then the fraction of a second hiccup as it preps and live migrates.

ThatGuyB · July 6, 2021, 11:12am

Wall-of-text of managing a small data center (slightly ranty)

I have live migrated Oracle DB VMs in Proxmox without users ever noticing. Lots of times. And lots of VMs at one time (well, I never went over 12 parallel migrations and I think 40 VMs, so ok, maybe not “many,” but not few either). When there were OS updates for Proxmox requiring reboots, I would move VMs, reboot the systems, move them back alongside other VMs, update the other system, then re-balance the VMs on the hosts a little. My hosts never saw more than 9 months of uptime (on average, 3 to 4 months of uptime). I have been using 2x 1 Gbps ports in LACP (mode access) for the NAS connections and 2x 1 Gbps in LACP (mode trunk) for all other traffic including management (the network Proxmox uses to talk to other hosts, migrate VMs, receive commands etc.).

Before me and my colleagues standardized every host to Proxmox and got rid of of 10+ year old hardware that was still in production (eek), I had some hosts with libvirtd on CentOS 6 and others on CentOS 7. I live migrated using Virt-Manager, but you could migrate only on hosts running the same major version of CentOS. And everything was behind OpenNebula, but it wasn’t configured that great (and just like the old CentOS 6, it has barely seen any updates). We did the move to Proxmox before CentOS 6 got EOL, but some hosts and VMs didn’t see updates (they were like, CentOS 6.4 - 6.7 or something) and the move happened in the summer of 2019 (fun times). Still, both libvirtd and proxmox’s qm use KVM and QEMU underneath, so live migration will work flawlessly on both.

Our small data center was used mostly for development of a suite of programs, but we did have some services that were considered “production,” like our GitLab and Jenkins servers, Samba (albeit this one was a physical server), our VPNs and our public websites (the later which were in the DMZ) and a few others. Despite this, we managed to survive without HA. None of our VMs were highly available. And we only used NASes, each VM had its disk(s) on them. They were running NFS and all Proxmox hosts were connected to them, so we didn’t have to shuffle disks around when live migrating, just the RAM contents between the hosts. We did have backups for basically anything, even the test DBs. While we were in charge, no host or NAS decided to die on its own. And if we noticed issues like failed disks or other errors, we got to moving everything somewhere else. I heard however that there was one time when an old hardware raid card died, leaving all the data on that host unusable (however, backups were being done ok even before me and my team came - but I can only imagine the pain to restore all that back). There was no replacement for that card. The server had 12x SAS Cheetah 500 GB 15k RPM drives, so you can imagine how old that thing was (it was just laying around when I came).

We used Zabbix, then Centreon. I preferred Zabbix personally, it seemed more solid (and I prefer agent-based monitoring), but I did a lot of custom alerts in Centreon with nothing but bash and ssh, which was cool in itself. But then again, Centreon uses SNMP and I was using some SSH to monitor stuff, this wasn’t ideal. At home I’ve installed Prometheus and Grafana, but didn’t get around to configure alerting and other stuff (still haven’t made a mail server). So make sure you configure monitoring early. It’s not a fun chore when you have to manage 100s of VMs, but it should be more doable when you only have to manage the hosts themselves (we were only 3 people with both Linux and Networking knowledge, with no NOC).

Having things on NASes is not so bad, but it’s definitely not as resilient as Ceph / Gluster. However, that hasn’t been an issue at our small scale (9 chunky virt hosts and about 7 NAS boxes). We started with 5 (not full) racks and ended with almost 2 full racks after decommissioning the dinosaurs.

Also, having just an aggregated 2 Gbps pipe (give or take, more take than give) wasn’t ideal either. Frankly, we started experiencing some network issues when we moved from balance-alb on 1 switch (well, they were 3 switches, 1 for management and VM connection and 2 non-stacked switches for the storage network, the later 2 being connected by 1x 10G port on the back), to moving everything on 2 stacked switches and doing LACP (due to the switches getting very loaded by LACP, we had 12 aggregation groups on a 48x2 stacked switch, the maximum allowed). These issues led to the very last few DB backups to not complete in time and fail the backup jobs. Aside from the issues with nightly backups, we encountered some latency / slow responding VMs during the day. These issues weren’t apparent before the switch (pun intended).

Depending on how much resiliency you plan your data center to have, you can do some tricks to save some money now and upgrade later when you can afford it, like not having stacked switches and using balance-alb link aggregation. But that implies having some additional costs later, so it’s either an issue of cash flow, or an issue of total investment costs. Your choice. However, 1 thing I highly recommend is that if you don’t use 10G switches on every connection (which I highly recommend at least for storage), at the very least use minimum 2.5 Gbps on the storage and management interfaces. And split them into 3 different port groups: 1 group for storage access (and have a separate network for it), 1 group for management and 1 group for the VM connection to the internet, so at minimum you need 6 ports for the hosts and 4 for the NASes. Proxmox doesn’t necessarily recommend that the management interface / group have high bandwidth (so not necessarily 10G), but they recommend having low latency, because Corosync is latency sensitive. That said, it still worked for us without a hassle on a shared 2 G pipe between the hosts and the VMs connections (there wasn’t really much traffic going on between the VMs during the day, so YMMV).

My inner-nerd tells me that it’s bad to be an engineer (i.e. create a barely working infrastructure - for reference, it’s an old joke: “people create bridges that stand; engineers create bridges that barely stand,” which is related to being efficient with materials and budgeting and stuff). But nobody will blame you if you don’t go full-out when building your data center. I don’t remember seeing anything related to your LAN speed, but in the pictures, I only saw 2 switches, 1 with 24 port and another with probably 12.

I know people who worked at some VPSes and their company was acquired by a big conglomerate alongside other 4 VPSes. Apparently this is the norm for VPSes: none of them had HA for their VMs. If you bought even the best option they had, there was no HA option. They were the only one out of the 5 where you could order dedicated servers and request Proxmox or CentOS to be installed on them and make a HA cluster yourself. But 3 dedicated servers would be costly.

To improve on their density, VPSes don’t bother with HA, they just give you a VM, they make snapshots regularly and if something goes down, usually they have a SLA with at most 4h response time (so they can take at most 4h to just respond / notify you that something is wrong on their side) and when you agree to their services, they have a total downtime for your services of 2 weeks / year, so plenty of time to just restore VMs from backups if something went wrong. If you lose money as a customer during that time, it’s not their problem, it only becomes if over the course of 1 year, you get more than 2 weeks of total downtime. So VPS providers can have a big profit margin if they don’t do HA. One issue I’ve been told about was when one of their hosts just went down, wouldn’t power on, so they just mailed their customers and restarted their VMs on other hosts (so basically 20 minutes of work). Another time, one of their NASes went down. VMs obviously crashed. What did they do? Just mailed their customers, took the HDDs out of it, put them in another box, started everything back up and they were up to the races. At most 2h of work, if even that. And people can’t complained, they agreed to this.

VPSes depend a lot on snapshots, backups and resilient hardware. They could use cheaper hardware and do HA, but that wouldn’t be as economically feasible, unless they charge a lot more for features like HA and better SLA. So the work the level 1 NOC people do at a VPS provider is monitoring hosts, connectivity, switching hardware around (like failed disks), physically installing new racks and servers and respond to customers. Unless they have other issues, the biggest work they face is monitoring and balancing VMs on hosts. They do actually take their time to look to see which VMs use more resources and balance their hosts accordingly, like 2-10 more resource hogging VMs alongside 20-30-40 (however many can fit) lower demanding VMs on 1 host. The level 2 people are rarely needed (they’re mostly on-call), they just configure new hosts and if the customer is paying for additional support and the level 1 guys can’t solve the issue in a timely manner (say, 1 hour), they grab the steering wheel and solve the issues themselves.

So, conclusion time: aside from the layer 2+3 design choices, you also need a monitoring system (Zabbix / Prometheus), a ticketing system (OTRS), a page that customers can use to access their VMs (I never allowed people to access Proxmox’s management interface, not sure what it’s capable of when used in a VPS scenario, OpenNebula is more catered towards IaaS and VPS stuff - the VPS providers I talked about above were using Proxmox and on their VMs cPanel, I can guess Cockpit can be an alternative, I have no idea how to run a VPS) and some automation with a payment system that will stop or pause VMs if customers don’t pay (you could do it manually at the beginning) and maybe mail them 4 days before the due date (OpenNebula has some “funds / credits” system built-in, never took advantage of that, because my data center was for internal use only).

There may be more stuff that I don’t remember that I probably should mention, but I’ll leave some time to digest this long-ass comment for now.

judahnator · July 6, 2021, 4:00pm

I have a very similar setup, glad to know I might be on the right track! I currently have four networks.

+------+---------------------------+
| vlan |          purpose          |
+------+---------------------------+
|   10 | client to client + NAT    |
|   20 | infrastructure management |
|   30 | storage                   |
|   40 | publicly routable         |
+------+---------------------------+

Each of my hosts has 4 ports. I have two bonded (LACP) for storage, one (mode general) for client and public traffic, and the final one (mode access) for infrastructure. I figure that storage should be on its own network, I don’t mind if the link is saturated but I do mind if its unpredictable. I also want management on its own interface just so if there is a DoS attack or something I have a guaranteed available connection to infrastructure.

You can see the rough idea with the picture of the servers I posted above. Orange is management (the “danger” network), blue is client/public, grey is storage.

I definitely want to grab some 10G cards for client and storage traffic, this will free up more ports for better link aggregation on the management network.

I have an 10-port Mikrotik router, and a 24-port Dell switch. I needed lots of WAN ports on my router and this ended up being a nice product. The switch leaves some to be desired, but it was dirt cheap. All ports are gigabit, with the exception of a 10G SFP+ port on the router.

I definitely want to get some redundancy at the switch level, this is one area I know I am lacking in. I actually have another 24-port switch sitting on a shelf at home. I just couldn’t think of a good way to get switch redundancy with only 4 ports per server, while still having link aggregation options and a dedicated interface for management. Current plan is that if this Dell switch fails, I just copy the config over to my backup and drop it in.

To extend your bridge analogy; there is enough room in my budget after building a cheap bridge to have another one on standby!

Yup. This is a problem for “future Judah.” I like the pointers though.

For monitoring I am hoping to give Proxmox -> InfluxDB -> Graphana a shot. I have seen some dashboards for some larger companies and I am amazed at the density of information that can be displayed. I am also looking into writing a script that will read information from the UPS systems and environmental controls. I’ll have to look into a notification system for if things go wonky, but one step at a time.

I can throw together a billing panel easy enough, I have built many, but I am a bit fuzzy on VM automation. I am sure I can secure VNC access fine, but automating installation of a variety of OS’s without allowing the client to get into a situation where they can cause IP conflicts has got me stumped. Most likely ill have to do that part manually. The Proxmox API itself seems pretty robust, the challenge will be in automating it in a secure way.

Bless me with your knowledge! I would much rather learn from your mistakes than from mine.

I really appreciate folks like you who love to share your wisdom and hard-earned experience.

risk · July 6, 2021, 5:09pm

This guy makes YouTube videos on SAS and flashing LSI HBAs and storage stuff. https://artofserver.com/ . Also he sells (relatively?) cheap HBAs tested and pre flashed.

ThatGuyB · July 6, 2021, 5:36pm

I don’t know what VPS providers are doing with their layer 2, but I’m guessing they just yolo it with 1 switch (i.e. if one fail, have a spare and switch it up, just like you mentioned). Ports rarely fail on switches and switches dying altogether is even rarer, so a 30 minute - 1h disturbance while you replace the broken one should make up in layer 2 simplicity, lower cost and fewer rack units used.

To do a fully redundant layer 2 while not doing active-standby stuff, like spanning tree (i.e. use all your resources), you need stackable switches with dual PSUs. The way you connect everything is like a mirror. The stacked switches are “bonded” through some software voodoo, so you only see 1 “software” switch with double the amount of ports (so if you have 2x 48 port switches stacked, you’d see only 1 switch to which you can remote into). The interfaces (for gigabit) would look like gi0/1 - gi0/48 and gi1/1 - gi1/48 (or similar, depends on the vendor)). You would connect srv1-port1 to gi0/1 and srv1-port2 to gi1/1 and do a LACP between those 2 links. So, if either 1 switch port dies, the other stays up (but this works on a single switch too), or the scenario which stacking meant to prevent, if 1 switch goes off, the other is still up and you get half the speed. Dell has some nice stackable switches, you can probably find them cheap 2nd hand (maybe).

Dual PSUs are just a bonus if one UPS dies, the other is still up. UPSes dying happen pretty often, so it’s important to replace those frequently (maybe every 5 to 6 years or so and batteries every 2 to 3 years at most). There have been UPSes which lasted 10+ years (and I’m the owner of one), but I wouldn’t rely on it for mission-critical applications. Another newer UPS from the same brand, but still old, blew up in my face, that was terrifying.

Speaking of UPSes

I don’t have experience with monitoring other brands, albeit I’ve had both Ablerex and Eaton, but APC has a FOSS daemon for monitoring their UPSes in Linux (apcuspd = APC UPS daemon). Maybe look into that.

There are “cheap” (like $200) ethernet-enabled temperature sensors that you can set in your data room, give it an IP address, go into its web UI, give it a mail server and it will give you alerts whenever temperature goes above a certain threshold that you set (this save our butts countless times).
We used exactly this thing:
https://www.ute.de/produkte/ueberwachung/serverraum-ueberwachung/ip-thermometer.html
I don’t remember it’s name though, just “IP Thermometer.” You could combine the mail with an alert in you monitoring system. Speaking of NMS, InfluxDB and ELK’s business models, I feel (so, being really subjective here) went a little off the sides. On a small scale they should be fine though. I don’t tell people what to use, because I don’t take on the responsibility of managing people’s systems, but I personally try to stay away from those two (which is why I went with Prometheus at home). And Grafana is the best GUI stack, Chronograf, Centreon’s and Zabbix’s own UIs don’t come close to it.

I remembered something about the VPS level 1 NOC people. Most VPSes have limits on how many mails you can send from an instance. It’s ideal to have some packet inspection to avoid spammers and getting your IPs blacklisted. Still, spammers went around them. Not sure how they enforced their policies / restrictions, but if you tried sending more than 50 (?) emails at a time, they’d get blocked and you’d get a mail back telling you what happened (and talk with the NOC if you still needed to send more mails). Spammers would most likely go into more targeted attack and the IP address of the VPS would still end up on a public blacklists somewhere. So level 1 NOC had to unlock them for the new customers when they requested, or give them new IPs (public IPs were done automatically in software, end-users couldn’t choose what IPs they could use).

Hmm, I would still choose 6 ports (2 for each) and use balance-alb on things like management and client LAGGs with LACP only on storage (just to save up on LACP groups and avoid the switch CPU / ASICs going too close to 100% usage). But for starters, 4 ports should be fine, I guess. Again, at the beginning you don’t need to over-engineer. Just make sure you got enough bandwidth for them.

For a guaranteed connection, unless you are on the local network and not remote, you’d need an out-of-band connection, i.e. you management network being completely separated from the customer link to the ISP. You can get away with using a 4G connection for that, as long as it’s reliable, as you only need to account for a VPN, the webGUI, SSH, Proxmox updates and just a few more. These aren’t bandwidth intensive. But they can go over a cap, if you got one, so maybe using a squid proxy and make it cache anything that comes from debian and proxmox domains wouldn’t be too far fetched (you can do that even on a Pi with an SSD attached to it, it’s basically disposable hardware that you can always replace). What you can’t get away with using this 4G connection is backups. You’d definitely need something with more oomph for off-site backup.

And speaking of off-site backups, check out with the ISP you will connect to if they are renting rack space in their data centers, you could get away with putting your own backup servers in their rack and end up being cheaper in the long run than using backup services, but again, YMMV.

risk · July 6, 2021, 6:11pm

… or if you’re getting fiber to an ix, you could rent a rack or half a rack + a little bit of bandwidth for 500-1000 a month at a colo - chances are that any nearby colo provider will be at your nearby ix too. This could double as your oh-sh* backup transit/admin access networking.

2bitmarksman · July 6, 2021, 7:47pm

In theory if you have it deploy something to handle NATing in addition to whatever OS they’d like, like a pfSense VM or something, then I think you could get it to work (just have the pfSense VMs WAN at DHCP). Not ideal, but I think it’d work

Also, may be able to use the IBM drives with a flashed Dell H200/310 NOT in the storage slot.

Also also, I read the ubiquity problems. Ubiquity is fine for Layer 2, but stick with something reputable for handling the Layer 3 stuff

Also also also, Mikrotik 4-port 10G SFP+ switch and some Mellanox Connect-X2 cards might be an easy way to get some 10gig in the network for the core stuff and use the 1g bonds as a backup in the event something goes kaput

modzilla · July 10, 2021, 12:21pm

I don’t know if you’ve seen it already, but with Proxmox there’s also a way to get SDN and provision VXLANs. With that you’ll be able to just route into a public net.

https://pve.proxmox.com/pve-docs/chapter-pvesdn.html

Also do yourself a favor and deploy a Source-of-Trouth like NetBox!

I’m not using the SDN features of Proxmox but on my OpenStack Cluster I deployed a seperate VLAN for basic internet access and OpenStack will just create a OVN router (SNAT) for every tenant network that needs an internet connection.