UCS: Blade is stuck on discovery after UCS firmware upgrade (unidentified FRU)

12 Nov

Here is pretty common problem in UCS 2.0 release.
At any stage of UCS upgrade  one or more blades go into discovery mode and never finishes it. Depending on the version they can get stuck at any percentage but usually between 4% and 40%.
Most of the time a corruption occurs in SEEPROM of  M81kr CNA card because of this corruption checksum fails and UCS cannot recognize the mezzanine card any longer and this prevent Discovery from finishing.
You can see the following errors when this happens:
Configuration Error: adaptor-inoperable. Discovery State: Insufficiently Equipped.
Adapter 1 in server 1/1 has unidentified FRU 

There are multiple Cisco bugs for this issue CSCub16754, CSCty34034, CSCub48862, CSCub99354 and I’ve seen it happening on 2.0(1q), 2.0(2r), 2.0(3a) releases.
Unfortunately the issue is not fixed and there is no workaround. The good thing is that if this occurs the fix is pretty simple and quick and no hardware replacement is needed but only Cisco TAC can fix this or whoever has access to their internal resources.

To verify if corruption occurred you can do the following:

  1. SSH to UCSM IP
  2. Enter connect cimc x/y (Chassis/Blade)
  3. Enter mezz1fru on the versions starting from 2.0(3a) you need to enter fru
    If corruption has occurred the last line of the output will show something like
    ‘Checksum Failed For: Board Area!’

The other method to check is to look at the logs.

  1. Generate show-tech support logs for the chassis
  2. Extract the log file there you’ll see files called MEZZxy_TechSupport.tar where x is blade id and y is adapter id
  3. Extract this file. There you’ll find file debugdump.
  4. In debugdump file look for the line starting with ‘fruprom –s’ after this line you’ll see FRU values(see bellow).
    fruprom -s
    Mezz Internal Use Area
    CARD_TYPE : 3
    NUMBER OF MACS : 6
    MAC : 70:81:05:43:55:0DBoard Info Area (96)
    MFG DATE : 08/25/11
    MFG INFO : Cisco Systems Inc
    PRODUCT NAME : N20-AC0002
    SERIAL NUM : ABC12345678
    PART NUM : 73-11789-09
    FRUFILE ID : AC02
    PART NUM REV : A0
    FAB VERSION : 07
    VID : V03
    CLEI : 0000000000The example above shows when the values are fine. When they are not fine you’ll see that some information is missing or there are some odd characters in most cases it was ÿ see below:
    MFG INFO      : ‘Cisco Syÿtems Inc’ <– as you can see we have ÿ here
    MFG INFO      : Cisco Syÿ   <– as you can see we have ÿ again but also the rest of the line is missing
    PART NUM      : 73-11ÿ89-09   <– here part number got corrupted

I would strongly advise to use one of the methods above to check if corruption has not occurred and fix it before doing UCS upgrade or you’ll have unexpected outage if you don’t. As there is no fix at the moment it does not mean that it will not happen again but at least it is healthy before you start upgrading it.

 

Tags: , , , , , , , , ,

Leave a Reply

IT Blog

Just another blog on Kozeniauskas.com Network