Sleepless nights and enterprise storage

CrashAndBurn · November 1, 2022, 12:07pm

Hi everyone,

I don’t think this has been discussed much in the forum… I did search first, so here goes.

A high school site I look after is looking at upgrading their hosts and SAN.

Currently they have:

2x Dell PowerEdge R740, Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz, 192GB of RAM
1x Dell SCv3020 which is out of support
2x Dell S4048T-ON linking the storage up

Currently I am looking at the cost of replacing this with
1x Dell PowerStore 500T
2x Dell PowerEdge R7525 Server with 2x EPYC 7343 3.2GHz 16C, 256GB DDR4-3200
and keep the switches but move the SAN to fiber optics using breakout QSFP cables.

The issue I have is the cost. The school can’t really afford to buy the hard drives from Dell - they are so expensive. And you know they lock you in!

I was looking at the SuperMicro CyberStore 248N 48x NVMe All Flash Storage Array based on their SSG-2028R-NR48N but I’ve always played it safe when it comes to storage for my clients, the big players are a safer choice I normally say…

Obviously this does not have the redundant controllers like the actual SAN but the price is less than half for the same amount of NVMe .

Have any of you deployed anything ‘home made’ in production and snubbed the big players in storage?

How did you do it and managed to sleep?

Long time viewer, first time poster so go easy, eh?

infinitevalence · November 1, 2022, 12:31pm

SuperMicro is a big player, they just dont have advertising and consumer brands like HP/Dell/Lenovo.

Their servers are the go to for white label providers as they are often the OEM/ODM for many other companies and products.

My company sells a mix of Dell and SuperMicro, and I personally use SuperMicro at home because of the great value on the used enterprise market.

Where things get different is with SM your buying hardware, with the big guys your often getting hardware and software. So have a plan for how to manage the data, and validate your software stack on the platform before fully investing.

FOSS can often make up most of the gaps, the problem is having a number to call for support. That is why companies like Proxmox have paid enterprise versions.

cowphrase · November 1, 2022, 1:08pm

For small installations I really prefer SANs, they make things so simple. Something like Ceph can be a head-ache if something goes wrong.

Do you have access to a good VAR? Some will simply quote their standard kit, but others will attempt to find deals and alternatives that suit your requirements. I’ve seen some good deals on storage come along occasionally. I know higher ed can be a challenge though due to fluctuating budgets.

I haven’t had a chance to play with SuperMicro storage, but having used their servers I get the feeling there may be a few compromises. Maybe in the software department? Though in the end the hardware has always worked, it’s all the same chips after all.

Yes, but we had the man power and know how to find and fix issues. Though propriety stuff has bugs all the time, and you often hit limits far sooner.

I’ve had to wait six months for an AP vendor to apply a linux kernel patch to their AP. A super subtle bug they couldn’t even replicate, we had to find and guess the cause and fix for them.

diizzy · November 1, 2022, 1:17pm

Have you considered looking at iXsystems TrueNAS Enterprise or similar?

CrashAndBurn · November 1, 2022, 1:21pm

I have started a conversation with them but not sure if they can supply/support to the United Kingdom, yet.

CrashAndBurn · November 1, 2022, 1:22pm

I am the VAR

infinitevalence · November 1, 2022, 1:49pm

Maybe its me, but I have found Proxmox far more reliable and stable next to TrueNAS.

Its still built on ZFS for local storage, has far better virtualisation support, and at least on the surface it looks like they respond to paid enterprise accounts.

But then, I have never paid for either TrueNAS or Proxmox so I cant really speak to their customer service.

diizzy · November 1, 2022, 1:53pm

Are you talking about Proxmox Backup Server or Virtual Environment vs TrueNAS Enterprise/Core (based on FreeBSD) or something else? They’re not the same type of products and I’m not sure how that’s relevant as we’re talking about storage.

infinitevalence · November 1, 2022, 2:24pm

I am talking about Proxmox as an overall solution vs TrueNas as an overall solution. I am not speaking specifically about any one of their “products”

Also while we are talking specifically storage, they are relevent because they both are backed by the same tech, both use Linux (with Scale anyway), both use ZFS.

With Proxmox you get the added advantage of converged technology where you can run both virtual machines and containers, as well as robust storage with both ZFS and Ceph if you want to cluster/scale.

While the webUI for TrueNAS is better optimized for “just storage” I find that its overall just too limited next to Proxmox. Its just as easy to spin up an LXC drop webmin or something similar on it, and expose a large block of storage.

Ceph also, while having a learning curve, allows for growth as you can expand pools much simpler than ZFS. Proxmox can also have both ZFS and Ceph running so you can start with a single server with ZFS, and grow into an expanded cluster as required.

Dont get too caught up in person asked for X and your talking about Y. Were not SI’s or VARs, this is an open discussion, and if I was looking for a solution that offers better value than a storage only proprietary software stack Proxmox is the first place I would go.

MazeFrame · November 1, 2022, 4:58pm

I do not trust my home-brew chaos at home.

Big-Boy storage and two different backup solutions.

thro · November 2, 2022, 6:05am

No and I would strongly advise against it.

If you think you can do it a lot cheaper, etc. you’re probably missing things. Timely replacement of like to like spares, support if you get hit by a bus or go on holiday, etc. (guaranteed storage failure time is if you’re on leave)

I’ve got a couple of all flash PureStorage arrays at work and am very happy with performance. Depending on their dataset the deduplication will help a lot.

Performance is great and the shutdown and power up process is simple. Coming from Netapp… so much nicer.

Shutdown: power down all your VMs and pull plug
Power up: apply power

It literally does not have a shutdown process. So long as your workload accessing the SAN isn’t actually running, you’re good. If it is - well it’s just like the VM lost power. There’s nothing array specific to worry about.

Pretty sure they do two lines of all flash array now - capacity based and performance based.

thro · November 2, 2022, 6:07am

Supermicro support, at least where I am, in my experience is… crap. Decent hardware, but if anything goes tits up, good luck getting any support in a timely manner.

We still use them, but if you plan running their gear I’d suggest keeping your own spares on-hand.

cowphrase · November 2, 2022, 7:40am

I almost didn’t believe you about not needing to shutdown the Pure Storage, but damn. Radical coming from SANs that take 15-60 minutes to shutdown. Though I can’t help but feel that there must be some performance or space compromises to make that happen.

thro · November 2, 2022, 8:33am

They have onboard battery backup or capacitors or something.

But seriously, performance is great.

Here’s a graph from one of my lightly loaded arrays (currently. migrating workload to it, so its half prod right now) at the moment:

Check the latency stats…

I’ve had 2 gigabytes/sec out of it over dual 10GbE per controller in testing. Limiting factor being the SFPs I’m using, pretty sure it can do 25 GbE if I get some cables for it.

and here’s my in-house (I wrote it) Netapp shutdown process for comparison:

How to shut down and re-start Clustered Mode Data ONTAP Filers

Currently, XXX-fas02 is a clustered mode filer setup.

The process to shut down and restore is different from the previous generation filer (7 mode – XXX-fas01).

If you run into any issues, the official netapp documentation is here:

https://kb.netapp.com/app/answers/answer_view/a_id/1003836/~/what-is-the-procedure-for-graceful-shutdown-and-power-up-of-a-storage-system

Shutting down the filer

Shut down the workload (i.e., running VMs, the UCS cluster).

Log into the Storage Processor on node 1:

Ip XXXXXXXXXX (XXXXXXX-1-sp – but DNS will be down)

Username and Password are in XXXXXXX. Node names must be UPPERCASE

Ø System console

Ø Log into the system console using admin and the filer admin password

Ø Disable HA so that the shutdown command doesn’t try to faiover and prevent the shutdown:

o cluster ha modify –configured false

Ø halt local –inhibit-takeover true

Ø Be sure of what you are doing then agree to the warning message

Console will drop to the system boot loader. You can optionally issue

Ø System power off

To prevent the machine from auto-power on when power is restored

Repeat the process for the Storage Processor on node 2

Ip: XXXXXXXXXX (XXXXXXXXX-sp – but DNS will be down)

Username and Password are in XXXXXXX. Node names must be UPPERCASE

Ø System console

Ø Halt local –inhibit true -skip-lif-migration true

Console will drop to the system boot loader. You can optionally issue

Ø System power off

To prevent the machine from auto-power on/auto boot when power is restored

Power Up

If you issued the command “system power off”, the filer will not boot when power is restored.

You will need to log into each storage processor in turn and issue the command

Ø System power on

Ø At the “loader” prompt, issue the command “boot_ontap”

Ø Monitor boot progress. Be aware that some failure to initiate take-over messages during boot prior to the network coming up are normal .

If in doubt, call NETAPP for support.

infinitevalence · November 2, 2022, 2:03pm

good to know, I have never contacted them for support since we do it all internally.

thro · November 2, 2022, 2:04pm

Yeah internal support is great until you have a failure

infinitevalence · November 2, 2022, 2:06pm

Spares, and more spares!

We do broadcast so generally we have a hot spare for nearly everything, and then a shelf spare so we can fix the primary if it goes offline while the backup is running.

Janos · December 23, 2022, 1:08am

I mean it depends primarily on the expectations regarding HA and TCO of the solution and how good is the local team, can they do the support or does an external have to be paid additionally.

In your example, if your single node SAN is down, everything is down.
With next bussines day support from DEll or HP, this could mean Friday till Monday afternoon, and if you are unlucky and the action plan from your SR owner is wrong, a second attempt is needed.

I would use at least a second SAN system for Host-based mirroring, or better, if your local support is
capable enough, use a two node Promox ZFS based Cluster plus a third Proxmox system for the virtual cluster witness vm and Proxmox Backup Server.

But it must also be said that if a node fails in the 2 node cluster, you lose the data between the snapshots and you have to take care of your databases recovery process, If you do not set your databases in hot backup mode for each snapshot.

With Ceph Storage this is better, but then you have other problems and at least one server more.

For the switches I would take 25Gb/s, especially for Ceph and cross-chassis LACP is nice to have

system · September 22, 2023, 7:09pm

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.