Brocade switch intermittent restarting

Hi All,

I’ve been happily running a Brocade ICX7250 48P-2X10G network switch for a few months now 24/7. Recently I observed my Proxmox cluster nodes regularly rebooting my VMs and I think I have traced the problem to the Brocade switch. It seems to reboot intermittently multiple times per day and I don’t see an obvious pattern to it.

I have syslog remote logging to a server so these logs persist whilst the show logging command outputs don’t seem to survive a reboot. I don’t see anything weird in there other than regular warnings along the lines of:

May 26 19:42:24:A:System: Stack unit 1 Temperature 67.0 C degrees,

I see this steadily rise to about 80C then at some stage a reboot occurs and the temperatures can be closer to 60C again then start rising.

I wonder then if these issues are actually thermal related despite being below the shutdown temperature of 105C? I also note that recently I discovered one of my ports with an SFP+ to 10GBaseT adapter was not blinking LEDs or providing a network. On replacing the adapter the switch quickly went into a boot loop with all amber LEDs. I had the system powered off overnight then next day it powered on ok with the adapter in except for this more intermittent boot looping.

To test the thermals I tried blocking the fans and the temperature reached around 90C according to the logs but just kicked in the high fan mode so cooled down and did not reboot. So maybe under heavy load the switch could be quickly overheating to the shutdown temperature before it has time to send a remote syslog message (highest I see in the logs in normal usage is around 80C before it drops lower eg. 50C on next reboot)? But this seems unusual. Are there any other explanations/fixes people can think of?

Am hoping someone can make sense of this/have seen something similar in their own switch, and suggest how I can fix this. Any advice would be very much appreciated at this stage as aside from this new issue I have been very happy with this device. Thanks for your help!

Here is some further debug output:

show version

Copyright (c) Ruckus Networks, Inc. All rights reserved.

UNIT 1: compiled on Aug 8 2023 at 23:06:54 labeled as SPR08095m

(33554432 bytes) from Primary SPR08095m.bin (UFI)

SW: Version 08.0.95mT213

Compressed Primary Boot Code size = 786944, Version:10.1.26T215 (spz10126)

Compiled on Tue Nov 29 23:13:15 2022

HW: Stackable ICX7250-48-HPOE

==========================================================================

UNIT 1: SL 1: ICX7250-48P POE 48-port Management Module

Serial #UK3845L1DZ

Software Package: ICX7250_L3_SOFT_PACKAGE (LID: fwmINJKnGfb)

Current License: l3-prem-8X10G

P-ASIC 0: type B344, rev 01 Chip BCM56344_A0

==========================================================================

UNIT 1: SL 2: ICX7250-SFP-Plus 8-port 80G Module

==========================================================================

1000 MHz ARM processor ARMv7 88 MHz bus

8 MB boot flash memory

2 GB code flash memory

2 GB DRAM

STACKID 1 system uptime is 3 hour(s) 44 minute(s) 17 second(s)

The system started at 19:38:19 CST Mon May 26 2025

The system : started=cold start

show chassis

The stack unit 1 chassis info:

Power supply 1 (AC - PoE) present, status ok

Power supply 2 not present

Power supply 3 not present

Fan 1 ok, speed (auto): [[1]]<->2

Fan 2 ok, speed (auto): [[1]]<->2

Fan 3 ok, speed (auto): [[1]]<->2

Fan controlled temperature:

Rule 1/2 (MGMT THERMAL PLANE): 91.3 deg-C

Rule 2/2 (AIR OUTLET NEAR PSU): 40.5 deg-C

Fan speed switching temperature thresholds:

Rule 1/2 (MGMT THERMAL PLANE):

Speed 1: NM<-----> 95 deg-C

Speed 2: 85<----->105 deg-C (shutdown)

Rule 2/2 (AIR OUTLET NEAR PSU):

Speed 1: NM<-----> 41 deg-C

Speed 2: 34<----->105 deg-C (shutdown)

Fan 1 Air Flow Direction: Front to Back

Fan 2 Air Flow Direction: Front to Back

Fan 3 Air Flow Direction: Front to Back

Slot 1 Current Temperature: 91.3 deg-C (Sensor 1), 40.5 deg-C (Sensor 2)

Slot 2 Current Temperature: NA

Warning level…: 85.0 deg-C

Shutdown level…: 105.0 deg-C

Color me a sceptic on this sentence.

I think this is a case of correlation not being causation.

What “adapter” is currently in the switch (are you referring to a “transceiver” - what’s the make and model)? Maybe a compatibility issue?
Did you try a different adapter? DAC, Brocade branded transceiver?

Is the switch boot looping? Or the VM? I was under the impression the issue was a VM boot looping.

Thanks for the swift response and I appreciate the skepticism, you are quite right that the real issue could be entirely elsewhere!

This is the module I am currently using from fs.com:
Brocade Compatible 10GBASE-T SFP+ Copper 30m RJ-45 Transceiver Module (LOS)

The previous module was working ok for months but did appear to have died and was a FlyproFiber Transceiver.

Here is the switch monitoring for the current one in case it gives clues:

show optic 1/2/8

Port Temperature Voltage Tx Power Rx Power Tx Bias Current

±----±------------±------------±------------±------------±--------------+

1/2/8 42.0000 C 3.2500 volts -004.1930 dBm -002.9174 dBm 6.016 mA

    Normal          Normal       Normal         Normal         Normal 

The only error I see in the latest switch startup logs is:
May 29 11:16:15:C:hmond[733]: Application logmgr.py failed recovery and functionality provided by it will not be available until failure reason is remedied (may require manual intervention)

But not sure what that means?

I do believe the switch is restarting eg. here is the latest from show version:
STACKID 1 system uptime is 26 minute(s) 7 second(s)
The system started at 11:14:23 CST Thu May 29 2025

The system : started=cold start

which says its only been up 30 mins despite no intervention from me. It is intermittent so hard to debug but I notice the fans on high multiple times a day and expect this is when it is rebooting. Also hard to tell since the logs don’t seem to persist on reboot despite me forwarding syslog to a remote server (which does receive the thermal logs previously indicated at least).

As for the coinciding Proxmox issues:

Here are the most recent Proxmox logs showing when the VM reboots happen (all status OK except for the one with the error indicated):

May 29th 10.15 AM: Node1: Bulk start VMs and Containers (this is the most recent reboot of VMs and coincides with the switch reboot shown above)

May 29th 10.14 AM: Node2: Bulk start VMs and Containers
May 29th 9.16 AM: Node1: Bulk start VMs and Containers
May 29th 9.16 AM: Node2: Bulk start VMs and Containers
May 29th 8.28 AM: Node1: Bulk start VMs and Containers
May 29th 8.25 AM: Node2: Bulk start VMs and Containers

May 29th 8.00 AM: Node1: Bulk start VMs and Containers
May 29th 8.00 AM: Node2: Bulk start VMs and Containers
May 29th 7.13 AM: Node1: Bulk start VMs and Containers
May 25th 9.53 AM: Node4: Bulk start VMs and Containers
May 25th 9.47 AM: Node3: Bulk start VMs and Containers
May 25th 9.47 AM: Node5: Bulk start VMs and Containers
May 24th 6.43 PM: Node4: Bulk start VMs and Containers:
TASK ERROR: cluster not ready - no quorum?
May 24th 6.32 PM: Node2: Bulk start VMs and Containers:
May 24th 6.30 PM: Node5: Bulk start VMs and Containers:

These logs are surprising to me actually and as you suggest indicate a different issue. I have noticed the switch seemingly rebooting more regularly over the last few days (based on checking uptime when I log in and regularly hearing the fans in high speed mode for short intervals, just like it does when booting). eg. On May26th the reported uptime of the switch was only 3 hours, suggesting it had rebooted but there is no coinciding Proxmox VM reboot seemingly.

I initially observed the Proxmox rebooting problem. This lead me to notice the SFP+ module for the network link on Node4 was dead. I figured Quorum was lost in my 6 node cluster since it wasn’t seen by the others. I keep one node cold offline which I realise is problematic and have recently removed its votes so now quorum is reached from 3/6 nodes. I since replaced the SFP module as mentioned in my previous post and it has a network link again, aside from the reported problems.

Then I observed in the other 4 nodes that their network links were going up and down (Dell Optiplex micro servers with same hardware) so I considered if their Realtek network drivers were failing. A bit weird for all of them to fail at once though. At this point I realised the switch was seemingly booting intermittently as reported and it makes more sense to me that the network itself is rebooting with it rather than each individual NIC on the nodes.

I should also note that I am running HA replication between 3 of the nodes for some VMs and regular backup jobs to Proxmox Backup Server which uses storage from an NFS share from one of the VMs on the node that originally had the failing SFP module (Node 4). Whilst probably complex/non ideal, this system has been working fine for a few months until this recent issues identified about a week ago.

I don’t see an immediate pattern in the Proxmox reboot logs above. It looks like each of the nodes is being rebooted at different times and again I did not expect to see that they had actually stayed up over the last few days…

On the other hand, Uptime Kuma over the last week shows the attached logs for Node1. This shows more yellow outages than the Proxmox logs report, so maybe they aren’t telling the full story? I am just reporting the ones shown in the Proxmox Web GUI…


Hopefully with this more detailed overview of my setup you can offer some insight into what I can do to debug this? Thanks again for your help!

Thanks for sharing more data.

So, the switch really does reboot. Well, THAT’s a problem. Let’s try stopping that. These type of switches are designed to run 24x7x7 for years on end.

This makes me suspect this change may be the root cause of your issues.

First observation is that all “10GBASE-T SFP+” transceivers are known to be challenging in switches, because they consume more power than regular fiber optics transceivers.

I searched the fs.com website for the product name and was given three different products with very different (power) characteristics ranging from 2.9W to 1.5W, the one consuming the most power being the most affordable (duh). I suspect that this is the very one you have.

Do you have more than one of these (10GBASE-T SFP+) transceivers in the switch?

Would you be able to remove that transceiver and operate your Proxmox with an on-board 1gb NIC for a while (albeit slower) to see if the issue of the switch rebooting goes away?
I am looking for a way to isolate the root cause of your most pressing issue (switch rebooting).

3 Likes

This seems like a sound approach, thanks! I have now removed the potentially offending transceiver and will report back tomorrow about the reboots.

It is indeed the SFP-10G-T-30 model which the website says has max power output of 2.9W as you say. Is it clear that this is too much for the switch to handle do you reckon?

I only have 1 of these, I also use two DAC cables in other ports for 10G links but these don’t run hot. The only reason I use this one is to get the 10G link to the RJ45 port on my motherboard which doesn’t have any spare PCIE slots. It does have 1G links so I have swapped to using those for now.

I am also considering updating the firmware of the switch since there is a newer version now. Do you reckon this is worth trying? Just strange that the problem happened so suddenly after working fine for months…

1 Like

From your description in your first post it sounds like the Brocade doesn’t like that transceiver. Wattage or not.

I am also using a 10GBASE-T SFP+ transceiver in a Brocade (ICX6450) and have temporarily tried that one in an ICX7250 without issues. So, it’s not generally this type of transceiver.
However, I went with the cheapest one I could find on Amazon, just for reference, not an endorsement.

I removed the transceiver and have started monitoring temperature and CPU stats via SNMP over this time:

Zoomed view over latest outage:

There looks to be two outages/reboots over this time as noticeable in the temperature curve with discrete dropouts. The CPU curve may not be logging correctly but it does at least correlate with these events.

So even without the adapter we seem to have the problems. It’s interesting to me that the two events occurred at similar times in the morning of each day. I am wondering if this is significant? If so, it is tricky for me to see what could be causing it. No obvious clues in that temperature sensor at least to me. The power comes from a clean UPS and its logs show no irregularities in power supply. I wonder if network demand spikes around those times regularly due to eg. some backup job or something and that higher demand triggers a crash? Or perhaps my OPNSense Router or some VM has the power to trigger such a reboot cycle of the switch but I don’t see how…

Do you have any further ideas or things to help with debugging? I guess one option would be to try upgrading the firmware but I am baffled as to why this would be needed. Maybe some damage has occurred from the adapter and now I have an intermittent hardware fault so need to replace the switch? Weird because it otherwise still works fine…

I don’t know anything about Brocade products specifically, but I wonder if a failing temp sensor would cause the system to reboot when it looses signal, and the temperature cycle gets it working again temporarily? Though I would kind of expect that to get logged somewhere.

If you have enough resolution on your graph, can you tell if the temp drops first or if the crash happens first?

No idea if that’s how it would work, just an idea.

1 Like

Good theory! Unfortunately I don’t see a way to tell which comes first from the logs I have…

Two more crashes have now occurred since my last graph (one at 6pm and one at 9am the next day). Interesting they seem to keep coming roughly the same time in the morning, but not exclusively given the 6pm one.

It is starting to look like the reboots occur at round number times eg. 9.05 AM, 6.40 pm (rather than 6.42 pm or 9.03 am for example). Which maybe points to something with a pattern/scheduled event rather than just random failures? But I guess we don’t have enough data to confirm that.

Another interesting clue is that there are actually two reboots in rapid succession for each event. Not sure what this tells us though.

I assume that you didn’t acquire the switch new but that it had a life prior to making it into your home. Also, it seemed happy with you for a while.

A common issue that is introduced over time is fans and other stuff just getting dirty and gunking up, potentially hindering air flow to components that really need it.

Did you give this switch a cleaning at some point (open cover, blow compressed air)? Would this be something to explore?

Is there a way to export the data behind the plots? They could be time-stamped in a way that is human readable.

Yeah, I did get it second hand and have not cleaned it yet. I will try that as my next step I reckon! I would love to swap the fans for something quieter too so hopefully that will help me scope that project out a bit too. Will have a go and report back!

As for the data export, I am currently improving my logging. Here is a screenshot of the event from my latest dashboard setup. Unfortunately, the data being logged drops out for all metrics at the same timestamp so I can’t tell which event might precede the other:

1 Like

I’ve now cleaned the internals and fans. These were quite dusty. On doing so the switch is running about 10C cooler than before but it has still rebooted on its own again. It seemed to occur exactly when I logged into my Grafana share to view the stats which may or may not be a coincidence (triggered by even small network load?). But I guess this means the issue is still present…

Maybe the dusty fans lead to overheat and some component is intermittently failing now? In any case I guess I should just buy a replacement switch and hope to avoid this in future.

1 Like

One power supply? These fail because of age and temperature, and may glitch. Electrolytic capacitors are usually the culprit.

Yeah, I now believe the issue to be intermittent faults with the single power supply. It was quite dusty but cleaning this has not improved the situation but may have contributed to the power supply failure n the first place.

I have identified that the power supply LED on the front is amber and according to the manual indicates failure:

Power LED is Amber: Internal power supply has failed. Contact Technical Support.

So I guess the remaining question is, does anyone know if I can replace/fix the power supply or is it time to buy a new switch?

From a cursory Google search:

Link to PDF: https://gzhls.at/blob/ldb/a/b/e/7/6633931b0f02f6c5019f7d4e6b199bc36d38.pdf