Need Advice: V100 ECC Retired Page Error exceeds 63 and corrupted InfoRom

Just bought a used V100 16GB PCIE as my learning tool for CUDA/Deep Learning.
My long term plan is to get a 2nd one and use NVLINK to learn how to do DL data/model parallel.

I ran NVIDIA’s diagnose test and found two errors (picture attached):

  1. Retired Page Error exceeds 63
  2. corrupted InfoRom
    During the test, the GPU temp is about 62C when TDP is set to the max (which is 250W).

Questions:

  1. Should I be worried?
  2. Any other test that I should do rather quick?
  3. Any way to tell if it has been used for mining? (Is the retired page caused by mining?)

Thank you all in advance.

Disclaimer: I’ve never used one but I have dealt with enterprise Nvidia hardware enough to kind of know some things to be dangerous enough about it.

Getting that out of the way, I know a tad about the page retirement/row remapping memory stuff which usually when there is a double error or failing memory on the video card, the address gets blacklisted and sent to the InfoROM to store. Nvidia has documentation here which backs that up. That would worry most people but more telling is the fact that the InfoROM seems corrupt and that may also be the cause of the page retirement/row remapping error. It means you should use nvflash and find a V100 BIOS of which TPU has a couple on offer here and that may get rid of the InfoROM error. You probably want to run the diagnosis afterwards to make sure everything is good. That should confirm to you quick whether or not your used V100 is junk or not.

Retired pages is normal if you get a couple over the lifetime of using the card, but that amount of over 63 or more as dcgmi says would indicate probably mining if it still sticks around. If you use it, it may in the worse scenario complete your job but have a bunch of corruption and miscalculations in the results. If you are using it for hobby purposes, it may be on the verge of fine but that would get me to return it.

2 Likes

Thank you very much for the tip.
I will do a bit more testing as you suggested.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.