What's the craziest tech catastrophe you have faced?

AviatorMaverick · May 28, 2022, 5:38am

What’s the worst tech catastrophe that has happened to you?

I think mine would be losing around my entire storage of ~2 TB of data, but I did have backups for what mattered most so it’s okay, I guess.

Argone · May 28, 2022, 5:48am

Plugging a pcie plug into cpu power. End result dead system.

AviatorMaverick · May 28, 2022, 5:55am

Everything died?

GigaBusterEXE · May 28, 2022, 6:25am

I shoved 2.3v into my brand new athlon II X2 because my Friend didn’t elaborate that “go up one” meant 10mv not 1v

PaintChips · May 28, 2022, 6:44am

In a work environment they had pre-fab PCs so using a molex splitter or SATA splitter were common–weren’t given permission to remove the power from the optical drive to power a 2nd HDD. On a developer test box one of those molex to SATA splitter adapters caught fire and the short took out the boot and 2nd HDD. Web developers kept their work on their own laptops so the “developer box” going toast was a day off but no actual work was lost. It did give the ownership an excuse to replace a few older desktops in fear of more cable adapters would melt and cause fires.

Argone · May 28, 2022, 6:52am

yea.

E-Wasted · May 28, 2022, 7:07am

Ran Large FFTs with AVX instructions by way of the the Prime95 blend test on my brand new FX-8120 Bulldozer on an overclocking motherboard with an “unlocked” TDP. Within seconds the CPU overheated @99C+ and tripped OTP on the motherboard giving a BSOD.

About a decade later I managed to do the same exact thing with my brand new 10900KF. Once again OTP protected the CPU from any harm. I now always set OTP to the minimum 100C on any unlocked board for this reason.

regulareel · May 28, 2022, 7:34am

I’ve fried 2 computers before:

I played with the input voltage switch at the back of the PSU and forgot the correct setting. It was supposed to be 220V but I switched it to 110V

I also was disassembling our Athlon X2 and turned on the PC without the CPU heatsink and cooler…

judahnator · May 28, 2022, 11:04pm

Once while working in PC repair I was assigned a laptop. It was brought in claiming “water damage,” so it was left on a shelf in the summer heat for a week to dry out. When I opened the lid I discovered that it wasn’t water damage, it was fruit loops damage. I could tell from the moldy fruit loops.

Another time a former employer decided that off-site backups were too expensive and deleted the whole S3 bucket. In the process also deleting many terabytes worth of files in glacier that incurred the early deletion fee. Im sure not having backups will come back to haunt them later too.

thro · May 29, 2022, 4:15am

Back in 2003.

Set up a DFS share between two servers as they were going to be in two different locations.

Other tech took one of the servers to secondary site before it had finished replicating

Other tech got impatient with it at the other site and figured he’d restore (via robocopy) a week old backup into the DFS share in order to “speed it up”.

DFS did what DFS does and replicated his 1 week old backup back to the primary HQ site, wiping out a bunch of data.

Nice!

DavieDavieDavie · May 29, 2022, 4:16am

A five line batch script taking out an entire airport.

thro · May 29, 2022, 4:16am

Haha, what were the 5 lines?

In a similarly, seemingly innocuous action, i took out the company ERP server via nmap during an audit of it.

Turns out SCO openserver (at the time) had a bug in the kernel and NMAP basically crashed it (what a piece of shit :D).

DavieDavieDavie · May 29, 2022, 4:33am

OK, story time. Strap in, this could be interesting. This was a number years ago, I think I can now safely tell the story.

So I used to work for a company that manages airport IT systems. I was in a time responsible for check-in, gate, and self service systems. Check-in and gates were known as CUTE and the self service is CUSS (CUTE & CUSS are IATA standards).

Anyway, one day a site admin of an European airport reaches out to my team saying their having issues with Amadeus DCS software so I wrote a simple 5 line batch script to run at logout and clean-up temporary files left by the application.

This was a Friday (I since then I’ve personally implemented read-only Friday’s on myself). On Saturday I get a call from one of the new guys on the team that was covering the weekend shift. He called me because he knew I deployed a script as part of our procedure when deploying anything is to share with the rest of the team and go though the ITIL change control process.

So then I’m told that for whatever reason, they log-in, it works, they logout. The next airline to come and use a CUTE workstation, they cannot login – okay, that’s a bit odd, I thought to myself.

So I went and reverted my change. I then reviewed my batch script I realized I made a minor, yet surprisingly critical mistake. A typo. I was missing one letter.

Now, log-on and log-off scripts working directory is %SystemRoot%\System32. Because of typo, my batch script didn’t change to the directory of the temporary files. That, combined with legacy airline applications requiring “Power User” privileges, a throwback to Windows NT (Power Users grants write access to System32), the script proceeded to wipe out the contents of System32.

I spent all of my Saturday co-coordinating with the on-site admin and tech’s to restore the affect machines (basically re-imaging them back to a working state) – after 7 hours, functionality was restored and the airport was able to operate again.

The moment I realized the screw-up, I called my manager and told them that I screwed up. The site manager and some customer service managers called for me to be fired, but I had the back of my department managers and senior management – I got lucky.

It was an expensive mistake, but I learned from it. To this day I double and triple check scripts, fully test them before deploying.

thro · May 29, 2022, 4:39am

I had a similar login script problem a few months ago with powershell.

Aim: remove email sig from end user PC unless they are in an active directory group (as it was managed by server upstream).

Worked fine in my test environment. Due to experience and paranoia (i’ve done this sort of thing for decades now) I was super anal about logging everything and backing everything up in-script through just in case.

What happened?

When deployed to prod (well, to subsets of end users via AD GPO applied to an OU) - none (well almost none) of the end users had the AD powershell module installed.

So the check for AD group membership failed (as i was relying on AD powershell module in script) which then kicked off the email signature removal every time.

Luckily i had backed up all the artefacts on their PCs so i could run another script to revert, edit my script to write my own AD group membership check via WMI (that didn’t rely on AD module) and re-deploy.

Lesson learned (never think there’s not stuff to learn!):
Just because it works on your admin workstation, doesn’t mean it works on an end user machine. TEST on a representative machine…

Also… for others (saved my ass):
Even if you think your script works, even if you think nothing can go wrong: log the shit out of everything you do (i.e., if you’re about to change something, log what you’re doing including the content of all the relevant variables at that point) and ALWAYS ensure you’ve backed things up before you change them in order to revert. Just in case. Storage is cheap. And if you can’t log - throw an exception and die.

Yup and as above. Have a test lab/machine that actually represents the real world. Also deploy to small test groups first. Be paranoid!

thro · May 29, 2022, 4:53am

Another one from my ISP sysadmin noob days.

Small regional ISP
I was a junior admin (circa 1997-1998)

Logged into a web server. I had a habit of using su to root, but didn’t know (at the time) about su - to actually log into roots profile rather than just elevate. So anything i ran which stored dotfiles was storing them owned by root but in my profile.

So, over time i’d been running things like screen, etc. which resulted in a bunch of dot-files being owned by root in my home directory.

This meant that when i tried to run screen as a normal user, it couldn’t write to the files as they were owned by root.

I’ll fix this, i thought!

# cd /home/jrose
# chown -r jrose .*

“Hmm… that’s taking a while…”

It was an old Sun Netra with a few hundred user’s home directories on it.

After 5 seconds i realised that shouldn’t take that long (and my mistake) and killed it.

Guess what?

“.*” includes “…” (wierd, i can’t do two ., it auto replaces to three)
when combined with -r … that meant it went UP a directory and recursed back down. The only difference between chown / -r jrose and the above is that it would do bottom up instead of top down…

Good thing i had a senior admin/mentor who was handy with awk on hand to help revert ownership of the rest of /home, and good thing i killed it before it got any further up the tree

Lesson: be CAREFUL with -r and wildcards!

AviatorMaverick · May 29, 2022, 10:19am

This makes me happy I saw undervolt/overclock guides before trying anything haha

Trooper_ish · May 29, 2022, 10:30am

Did the mentor/experienced tech do something like compare the user ownership to the name of the user folder or something? And if mis match, then change?
Or something to do with some history or something?

The three full stops/ periods .. Is something I noticed before, so one can put something in angle braces between 2 dots, like .<test>. unless you need to put it in preformatted text… then .. works anyway Or even just .<t>.

thro · May 29, 2022, 10:55am

pretty much.

i think he did it in one line of awk, in about 1-2 minutes (he used to write all his stuff in awk because perl was new and wasn’t installed everywhere, awk was - we were a Solaris/Linux shop), feeding in the name of the folder into chown.

But yeah i was shitting myself

Now, still not knowing awk that well (as unix admin is now very much part time) i’d probably just dump a directory listing into excel and do it the dumb way with copy/paste

thunderysteak · May 29, 2022, 12:24pm

At work?

Took down an entire datacenter down which took a full week to get back up and running, because I was given a permission to do something I wasn’t supposed to get a permission to do in the first place.

Personal?

Wiped 2TB of data on an FTP server because Debian is horrible and I had no backups.

Server grenaded itself from apt-get installing a broken udev package and friend just told me to reinstall using debootstrap from rescue OS via ssh to keep the data in the /home directory because no IPMI or a place to dump 2TB of data at.

Debootstrap ignored my settings and just wiped the entire drive. This is where my hate for anything Debian-based started.

tin · May 29, 2022, 11:49pm

My worst catastrophe was probably a mirrored pair of Maxtor drives in a mail server both failing at the same time…
Backups existed, but on tape and I didn’t want to deal with that. I installed a new pair of drives, did an rsync of the first failed drive to the new ones, made a list of unrecoverable files, then copied those manually from the second drive… Had something like 2 or 3 that were busted on both drives. All in users spam folders (it was a Linux system with maildir storage).

Got super lucky, if I’m honest.