UCS: Warning: there are pending SEEPROM errors on one or more devices, failover may not complete

03 Dec

In UCS CLI after issuing command ‘show cluster state‘ a warning is received on one of the chassis.

UCS-B # show cluster state
Cluster Id: 0xf122a7f83dba11e0-0x9a4c123573c4f1c4

B: UP, PRIMARY
A: UP, SUBORDINATE

HA READY
Detailed state of the device selected for HA storage:
Chassis 1, serial: FOX1234567A, state: active
Chassis 2, serial: FOX1234567B, state: active
Chassis 5, serial: FOX1234567C, state: active with errors

Fabric B, chassis-seeprom local IO failure:
FOX1234567C READ_FAILED, error: TIMEOUT, error code: 10, error count: 7
Warning: there are pending SEEPROM errors on one or more devices, failover may not complete

In sam_techsupportinfo  log you’ll see the following message
Creation Time: 2012-10-12T01:12:21.217
ID: 2712562
Description: device FOX1234567C, error accessing shared-storage
Affected Object: sys/mgmt-entity-B
Trigger: Oper
User: internal
Cause: Device Shared Storage Io Error
Code: E4196537

This is known Cisco Bug CSCtu17144 and here is what needs to be done

If the fault condition stays on or keeps being cleared and re-raised, try the following workarounds:
1. Reboot the IO module.
2. Remove and re-seat the IO module. Make sure the module is in contact with the backplane firmly.

I’ve had this problem couple times and resetting IO module was enough in both cases

Tags: , , , , ,

3 Responses

  1. mark hreha says:

    I am experiencing these same issues on one of my clusters that runs firmware 2.02q. This may be the cause of the issues with multipath that crippled the oracle rac cluster. In an attempt to rule out this particular blade/chassis I have provisioned another server on a different chassis.

    Host details
    Oracle rac
    OS: rhel6x64
    kernel: 2.6.32-279.14.1.el6.x86_64
    Blade: B200 M3

    UCS4sjc-A# show cluster state
    Cluster Id: 0x84dbbf20c37911e0-0x89d2547fee05f1c4

    A: UP, PRIMARY
    B: UP, SUBORDINATE

    HA READY
    Detailed state of the device selected for HA storage:
    Chassis 1, serial: FOX1502GV2E, state: active with errors
    Chassis 2, serial: FOX1527GQY7, state: active
    Chassis 3, serial: FOX1618GE3L, state: active

    Fabric A, chassis-seeprom local IO failure:
    FOX1502GV2E READ_FAILED, error: TIMEOUT, error code: 10, error count: 8

    Warning: there are pending SEEPROM errors on one or more devices, failover may not complete
    UCS4sjc-A#

    I am planning on upgrading infrastructure firmware to 2.11 tonight.

    Mark Hreha

    • Andrius says:

      Reseat IOMs one by one on chassis 1.This should clear the errors that you are seeing when running ‘show cluster state’ command.

      For Oracle RAC there is a know issue with dropped storage. Upgrade to UCS 2.1(1) and fnic driver 1.5.0.20

  2. Tim H. says:

    This is most frustrating… A bug which puts the environment into a degraded state in everyone’s monitoring application and has been ongoing without a fix for three years. Tens of thousands shipped and the only thing customer’s are told is to live with it or reseat the hardware periodically. Cisco makes Microsoft look reliable. Going back to c7000 and Flex-10 is looking more likely every day.

Leave a Reply

IT Blog

Just another blog on Kozeniauskas.com Network