"New" DS4243 + TrueNAS Core = Checksum errors across every disk [Solved!]

I recently picked up a NetApp DS4243 from a local PC recycler to use for my home lab and threw a bunch of 2TB drives into it. The drives were all recognized with the correct metadata all showing up, so I created a pool with 2 5-drive vDevs in raidZ-1 with another drive as a hot spare. This wasn’t going to be my final configuration, I just wanted to flex on my coworkers. To verify everything was working as it should, I transferred one or two hundred gigs of files to the new array and everything was going swimmingly until I noticed a new notification in the TrueNAS GUI warning of a critical issue and that my pool was unhealthy.
Pool BigChungus state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

Fast forward through me trying to figure out exactly what the errors are, zpool status -v and it’s equivalent in the GUI both show no errors for read or write, but a non-zero amount of errors for checksum. I’ve reseated drives in the caddies, shuffled drives around, tried various combinations of raidZ, stripe, and single-drive vDevs, all with the same result. After transferring around 200 gigs of files, either from the pool internal to the server, or from another machine on the same network, I’ll get between one and four errors logged for checksum.

The drives in the shelf right now are 11x 2TB Toshiba SAS drives, a 2TB SATA drive, a 1TB Western Digital Green, and a 1TB Western Digital Black. All the SATA drives are using interposers. The SAS drives are all recent pickups from Craigslist, but I have the exact same errors on the WD SATA drives I’ve had since new, and the 2TB SAS drives in the server’s hot swap bays are identical to what’s in the shelf, and they even came from the same CL seller, so I’m strongly suspecting something that isn’t directly the drive.

The server is an Intel SR2600 with 2x Xeon X5677 CPUs, 96GB of ECC RAM, running TrueNAS Core 12.0-U1 from a USB 2.0 drive. The HBA I’m using to interface with the shelf is a PMC/Sierra PM8001 v3.0. That’s connected to a QSFP or QSFP+ cable that came from the same recycler. It’s largely unlabeled and I found it in a bin of misc. server cables, so that’s a pretty big unknown. Should probably clean the contacts with CRC? The DS4243 has two IOM3 cards. From what I remember the top card was slightly dislodged when I was looking at it at the recycler, so even though I put it back in, maybe it needs to be reseated. The server lives in the warehouse at my work, so I can log into it remotely, but I don’t have hands-on access at the moment to retry anything physical. The shelf has two of the four PSUs installed. I don’t think the PSUs would create such a specific issue, but I can try running with a single unit at a time to see of there’s any change in behavior.

Is there anything else I’m missing? The shelf was unpopulated when I bought it, so all the drives are formatted to 512 bytes per sector, and are otherwise recognized and seem to be writing data at full speed (91-130MB/s), at least in my most recent testing with only a small number of drives installed. I’m running a long SMART test on the drives right now (in theory, anyway, since I haven’t found any way to monitor the progress of the test), so I’ll see if anything is up in the morning once those have finished.

Edit: I should also mention that the checksum errors do not appear simultaneously across all the drives in the pool. They appear one at a time, and do not necessarily show up the same number of times across the drives.

Probably not the drives. I’d first suspect the sas cables, the HBA (overheating usually), and the backplane in that order.

3 Likes

When you say the SAS cables, do you mean the QSFP cable between the HBA and the shelf? Or internal to the shelf?

I tried to scrub the pool that had the single 1TB drive and it’s, uh… having a bad time lol.

I’ve read that the PM8001 and similar PMC/Sierra cards have issues when the bus is saturated, so maybe I should order a different HBA and the appropriate cable.

If the QSFP cable was bad, you’d be getting lost packets or some network related error. the data sent would fail the checksum and simply never make it to the disk.

I’m talking about the cables connecting the backplane to the motherboard, or whatever the fuck is in a netapp box. Never bothered with their extra proprietary mess so I don’t know the details.

The single biggest source of errors I see is a bad cables. Much like you posted, I’ll see random errors show up in ZFS on various disks after scrubs, despite all evidence being that the disks are fine. The SAS controller (which often have a RAID or HBA mode) can also either overheat or go bad, also producing errors. backplanes also go bad, but typically that it’s a specific port that’ll produce problems.

There’s some special error code in SMART that shows up in regards to errors from data transit issues, but I forget what it is. Possibly a CRC error.

This morning I removed the QSFP, swapped the IOM3 cards and made sure they were securely seated, reinstalled all the drives into the shelf, and recreated the pool into a configuration not unsimilar to what I might use in production.

The first file transfer (42.5GB) went without a hitch, but during the second transfer (95.3GB) I saw 1 checksum error on drives da0 and da1. Did another file transfer (65GB) and still only the pair of checksum errors. I thought the same tests produced more errors before, but maybe the size of the array slowed down the transfer so it was less prone to over-saturating the controller (if that’s indeed the problem). So I went to remove the directories that I had copied over in preparation for a 1.5TB file transfer just to see what a very long file transfer looked like and noticed that I had many, many more checksum errors. I looked again and it looks like it was due to the resilvering process.

I ran smartctl -a against the two drives that had the single checksum error (da0 and da1) and didn’t see any errors reported, but at the same time, it doesn’t look like it’s capable?

=== START OF INFORMATION SECTION ===
Vendor:               TOSHIBA
Product:              MG03ACA200    SM
Revision:             4321
Compliance:           SPC-3
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x500605ba001d0030
Serial number:        15G2K792F
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Jan 25 11:31:12 2021 PST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Read defect list: asked for grown list but didn't get it
Error Counter logging not supported

Device does not support Self Test logging

I guess next I can pop the lid on the shelf and see if anything is obviously out of place, but because I suspect the controller, I ordered an LSI card and the appropriate SFF-8088/QSFP cable. Hopefully they’ll be here by Friday, otherwise it won’t be until Monday when I can try again.

Not familiar with Netapp, but qsfp+ is typically network/infiband. SFF-8088 is SAS. The cables between the shelf/backplane to the HBA should be SAS. As @Log said, the symptoms you are experiencing are consistent with bad SAS cables (or HBA or backplane).

I’m not sure if you are confusing a QSFP+ cable with a SAS cable, if Netapps transmit SAS across QSFP+ for some reason, or maybe you have some sort of fiber channel there and not SAS at all.

From what I understand, the DS4243 uses QSFP+ for its SAS connections.

^This is the controller that lives in the shelf (one of two) and connects to my PM8003. The PM8003 card is what’s used in the NetApp controllers, so unless I’m missing something, the hardware should all be compatible.

This might be a dumb question, but if I was using an incompatible cable, would I even be able to recognize and interface with the individual disks in the shelf?

The reason I mention the SFF-8088-to-QSFP cable is because that’s what would be required if I swapped the PM8003 with a more standard LSI HBA. Currently I’m using a straight-through QSFP cable.

Edit: From one of the NetApp service manuals for the DS4243 and similar:

SAS optical multimode QSFP-to-QSFP cables can be used for controller-to-stack and shelf-to-shelf connections, and are available in lengths up to 50 meters.

https://library.netapp.com/ecm/ecm_download_file/ECMP1119629

1 Like

Ah, TIL.

Probably not, but who knows.

Do those exist? I imagine it would need to be an active adapter.


If I were you, I’d follow @Log’s suggestion and swap parts according to:

Until it works.

For the next troubleshooting step I focused on the HBA again because it’s the easiest to access right now. It was installed in the bottom-most PCIe slot with the heat sink pretty close to the bottom of the chassis, so I put it in the top slot and moved my SFP card to the bottom slot in an attempt to get more air across the cooler. Once I did that, I rebuilt the pool and started a 1.5TB transfer to put a long strain on the card. It took 15 minutes of sustained transfer before any checksum errors popped up (1 on da2 and 1 on da5). It’s been almost 25 minutes total and no additional errors, so I’m still suspecting the HBA. if it’s just an airflow issue making the card overheat I think I’d still rather replace it with one that won’t have these thermal issues since I don’t want to ramp my fans just to cool the card.

Thought that maybe vibrations were potentially causing issues since I only had one screw holding each drive in, so I bought a ton of new screws and put 4 in each drive, re-ran the test. Still getting errors. :frowning: I’ll pop the lid on the shelf tomorrow and take a look at the internal SAS connections, but I can’t imagine that being the problem since I’m (eventually) getting errors across all the drives in the shelf.

It seems a lot more like a problem with the HBA or maybe (but less likely) a problem with the IOM3/6 controller. The DS4243 doesn’t use internal cables. The controllers are connected directly to the backplane. Did yours come with two controllers? If so you could try swapping to controller B. Make sure you use the port marked with a square and don’t connect anything to the circle port. Also make sure you aren’t connecting anything to the RJ-45 ports.

The above info about the cable is right. The IOM3 and IOM6 netapp controllers use a qsfp cable. I believe with a netapp HBA you would just use a regular qsfp DAC. Though connecting to any other HBA, qsfp to sff-8088 and qsfp to sff-8644 cables probably exist for that reason. The IOM12 uses mini SAS connections but that controller is quite expensive.

Edit: I had a chance to read over the post again. If you get access to the shelf, pull both controllers out as well as the blue blanks and shine a flashlight in to inspect the pins. With the lever locked in place, the controller either seats fully or not at all.

2 PSUs are fine. One in slot 1 and the other in 4. You should only need to connect one for the shelf to work but connecting both is probably good if you don’t have easy access to it. If you are wanting to connect both controllers to the HBA, I believe you use the square port on controller A and the circle port on controller B.

Hope that helps

4 Likes

I’m new to this forum software so pardon if I do something stupid with the quoting.

I’ve got an LSI SAS9201-16e on the way, and the SFF-8088/QSFP cable was delivered today, so once the card shows up, hopefully all this will go away.

Yes, it has both IOM3 controllers. I swapped them around last time I reseated the cable with no change in behavior. Confirmed that the QSFP cable was plugged into the square port, not the circle port. The RJ45 ports are empty. I’ve also not connected the controllers together since I only have the one QSFP cable.

Pins and backplane all look good from what I can tell.

The shelf came with PSUs installed in 1 and 4 (top-left and bottom-right when looking at the back of the shelf). Haven’t bothered messing with these at all since the LED status lights seem to indicate they’re working correctly. Both are plugged in but to the same power strip since I don’t have multiple circuits to plug the PSUs into.

I appreciate the input. So far everything seems to be pointing to the HBA. New one is being delivered on the 30th, so hopefully February is the year of mass, stable storage.

2 Likes

New card was delivered while I was at work, and the cable has been here for a few days, so tomorrow I’ll drop it in and see what happens.

For those interested, the QSFP+ cable I had been using is a Juniper Networks 740-038623 Rev 01.

The HBA I had been using is a PMC-Sierra PM8003 SSC Rev 3.0. I had originally ordered a rev 5.0 card because I heard about issues running the 3.0 cards in servers that weren’t specifically the NetApp servers, but the seller had sent this instead. The issues that I saw with the 3.0 were related to the card not being recognized or the card throwing all kinds of errors on boot; seemed totally different than what I saw here, so I dunno if 3.0 cards are generally cursed or if I just happened to have an odd issue.

Now that I’m looking at the card, trying to decide if I should remove the heatsink and clean/reapply the thermal paste, I noticed the A/B/C/D port labels on the back of the card. Fairly sure I was using port D instead of port A. You think that matters at all? Seems like, if it did matter, it wouldn’t work at all, but who knows. Might pop it back in and try it out.

I generally reapply the thermal paste on everything I get used, once I confirm the item works, as who knows how long it’s been sitting there. I do this because it makes me feel better, but does it make a difference? No idea. I’d have to figure out whatever pain in the ass software is needed to look at the temps AND come up with some sort of consistent test. Generally, I don’t think it matters to much for cards with tiny IHS/chips to cover.

I personally prefer IC Diamond paste for things I intend to forget about, as it gives temps similar to top tier pastes like kryonaut, while also being seemingly stable for many years in my own experience. In comparison, most other pastes should probably be reappled after about 2 years.

Also thanks for posting pics of the internals. The thing goes way beyond what I was expecting when it comes to being “special”.

Reinstalled the card in a different slot and with the QSFP+ cable in port A instead of port D with no change (if anything, slightly worse performance since the card was closer to my SFP+ NIC. Decided against reconditioning the cooler since my options are a) tinker with a card I’m going to replace tomorrow morning or b) go home and have dinner lol.

Got the card installed. Here’s some pics of the server and disk shelf. Let’s see if this thing works.

1 Like

So everything looks like it’s been resolved by installing the new HBA. When I booted the server I immediately saw all the disks in the shelves, was able to setup 2 vDevs of 5 disks in a new pool, and ran my usual suite of file transfers to try to invoke an error. The biggest transfer was a 1.5TB folder and a 39GB ZIP file, among other things, and I got zero checksum errors, so that seems to be all resolved. Using Windows File Transfer to copy files across our 10gig fiber network I saw peaks at 1GB/s and a constant file transfer of 350-420MB/s.

image image

Next I’ll be deciding on how I want to configure the drives, rebuild the pool, and migrate the data from the drives in the server into the disk shelf, then move the drives over to the NAS to expand the storage there (right now it’s 5 striped disks with no redundancy lol).

2 Likes

Very good news, glad it worked out!

Expansion planning is always fun…especially when using ZFS at home. If you can find 2 more disks (referring to the two 5 disk vdevs), you could go with two 6 disk vdevs so you still have some stripe write performance benefit and have the option to expand up to 4 vdevs on that shelf. On my shelf I decided to do a 12 disk raidz2 so I could expand the pool with 12 more disks for consistency. Though even how cheap drives are, buying 12 at a time still hurts… I’m actually thinking about trimming to fewer larger drives. The IOM6 i have does support the 8TB drives i have now and there have been reports it has no problems supporting even larger drives.

Something else… if you’re not planning to use that second controller, you can save about 15-20 watts by disconnecting it. You could just leave it partially in the slot so airflow is unaffected. The second power supply also draws about 30 watts from what I’ve been able to measure. If it is critical that it stays up then the extra power draw is cheap insurance. Otherwise that extra 50ish watts is a pretty good power savings IMO.

2 Likes

Last night I backed up all the current files on my box to a NAS at work so I can wipe the whole system. I have a total of 16 2TB drives, plus a bunch of misc. drives. Based on that, my options are

# vDevs RaidZn drives/vDev Usable Capacity Hot Spares Max Capacity Hot Spares
2 1 8 28 TB 0 42 TB 0
2 2 8 24 TB 0 36 TB 0
3 1 5 24 TB 1 32 TB 4
3 2 5 18 TB 1 24 TB 4
4 1 4 24 TB 0 36 TB 0
4 2 4 16 TB 0 24 TB 0

These figures are assuming I’m only using my disk shelf for storage. I still have 5 bays in my server, so taking that into account, the maximum capacity figures and the number of hot spares available changes quite a bit.

# vDevs RaidZn drives/vDev Usable Capacity Hot Spares Max Capacity Hot Spares
2 1 8 28 TB 0 42 TB 5
2 2 8 24 TB 0 36 TB 5
3 1 5 24 TB 1 40 TB 4
3 2 5 18 TB 1 30 TB 4
4 1 4 24 TB 0 42 TB 1
4 2 4 16 TB 0 28 TB 1

That being said, I’m torn between “This is already plenty of storage” and “I’ll probably just get bigger disks in the future”, so I’m not sure how much planning I want/need to do versus building for speed+resiliency. :thinking: If I go with 2 vDevs with RaidZ2, I feel like I’m getting a good compromise of speed and data loss protection, and I can always just buy one or two more drives as a hot spare. Is there a benefit to going with more/smaller vDevs beyond making it less expensive to expand the pool? Better odds of a multi-drive failure not all landing on the same pool?

The smaller the vdev is the more disk you’ll lose to parity and/or redundancy. The most performance would probably just be a stripe of mirrors. I tend to go for the largest number of disks in a vdev for maximum capacity. Hot spares for a home user isn’t really necessary IMO. I have my TrueNAS email me anytime there is an issue and use raidZ2 which still gives me 10 disks of capacity with two disks capacity worth of parity.

Each vdev added to a pool is striped together it’s more important that all the vdevs have parity to survive the resilvering process in the event of a disk failure. But when it comes to data security, nothing beats frequent backups…