V6.2.0.x, V6.3.0.x and V1.3.0.x Node or Node Canisters Will Automatically Reboot After 208 Days of Uninterrupted Uptime

Flash (Alert)


Abstract

Due to a known Linux kernel issue, Storwize V7000, Storwize V7000 Unified block node canisters and SAN Volume Controller nodes will reboot after running for 208 continuous days since their last power on or software upgrade.

Content

A widely documented Linux kernel issue can result in a kernel panic occurring after 208 days of continuous uptime, due to an internal counter overflow. For Storwize V7000, Storwize V7000 Unified block or SAN Volume Controller nodes or node canisters running V6.2.0.x, V6.3.0.x or V1.3.0.x, this can trigger a self-recovering reboot event once this amount of time has elapsed.

Although each reboot event is only expected to take between five to ten minutes, for clusters in which both nodes in an I/O group were last booted in close proximity to one another there is a risk that a loss of host access may occur after 208 days. A typical scenario would be if both nodes in a new installation were powered on simultaneously.

If the node or node canister exceeds an uptime of 208 days, then it should continue to operate normally until it is shut down or rebooted. Whilst shutting down or rebooting there is a risk that the linux operating system will shut down without allowing the SVC or V7000 software to save it's critical metadata to disk. If this happens on two nodes or node canisters at the same time (for example during a power loss) then cache data may be lost.

All nodes or node canisters which have an uptime of greater than 208 days must be rebooted before attempting a software upgrade to avoid complications during the software upgrade process.

The Software Upgrade Test Utility can be used to determine the uptime of all nodes in a cluster which is exposed to this issue. The following URL contains download and usage instructions for this utility:


http://www-01.ibm.com/support/docview.wss?uid=ssg1S4000585

This issue was first introduced in the V6.2.0.1 release, which shipped on 10 June 2011. For convenience, the following table lists the published date for all code levels that are exposed to this issue, and the date at which 208 days will have elapsed.

Version Number Release Date Earliest possible date that a system running this release could hit the 208 day reboot.
SAN Volume Controller and Storwize V7000 Version 6.2
6.2.0.0 Never shipped to customers
6.2.0.1 10 June 2011 04 January 2012
6.2.0.2 11 July 2011 04 February 2012
6.2.0.3 05 September 2011 31 March 2012
6.2.0.4 07 November 2011 02 June 2012
6.2.0.5 Not Affected
SAN Volume Controller and Storwize V7000 Version 6.3
6.3.0.0 30 November 2011 25 June 2012
6.3.0.1 24 January 2012 19 August 2012
6.3.0.2 Not Affected
Storwize V7000 Unified Version 1.3
1.3.0.0 23 December 2011 18 July 2012
1.3.0.1 06 January 2012 01 August 2012
1.3.0.2 08 February 2012 03 September 2012
1.3.0.3 29 February 2012 24 September 2012
1.3.0.4 Never shipped to customers
1.3.0.5 Not Affected

Workaround:

A full reboot of a node or node canister will reset the affected counter to zero, effectively extending the length of time until the automatic reboot event occurs by an additional 208 days .

Customers that are nearing the 208 day limit but are unable to perform an upgrade to a level containing the fix for this issue, are strongly advised to perform a controlled reboot on each node or node canister in the cluster, leaving a reasonable time of no less than one hour between rebooting each of the two nodes in a given I/O group.

Before proceeding with controlled node reboots, it is important to ensure that all host multipathing drivers are correctly configured, as these reboots will trigger multipath failover events between the nodes in an I/O group.

For each node that is approaching the 208 day limit, the following sequence of steps will perform a full reboot of the node.

Note: Customers with a 2 node SVC cluster, or a Storwize V7000 with only one Control Enclosure running V6.2.0.3 or earlier are advised to run the following procedure prior to restarting the nodes. This action plan can only be carried out using the Command Line interface.

  1. Establish a CLI session to the management IP address.
  2. Run the following commands
    1. svctask chquorum -active 0
    2. svctask chquorum -active 1
    3. svctask chquorum -active 2
  3. If there is a requirement for the active quorum to reside on a particular quorum disk, then the active quorum should be moved back to the desired location.

Reboot using the Service Assistant GUI

  1. Connect to the service assistant GUI using http://<service-ip-address>/service
  2. Select the radio button for the node to be rebooted
  3. From the Actions drop down box, select "Enter Service State" and press Go
  4. Wait for the node to enter service state. This is shown in the 'Node Status' column
  5. From the Actions drop down box, select "Reboot" and press Go
  6. Wait until the node has completed its reboot
  7. From the Actions drop down box, select "Exit Service State" and press Go
    Note that during this procedure it may be necessary to log back in to the service assistant GUI

Reboot using the Command Line Interface

  1. Establish a CLI session to the node's Service Assistant IP address.
  2. Place the node into the Service State by entering "satask startservice".
  3. Wait until the node has successfully entered the Service State.
  4. Enter the command "satask stopnode -reboot".
  5. Wait until the node has completed its reboot.
  6. Remove the node from the Service State by entering "satask stopservice".

Fix:


This issue was resolved in the V6.2.0.5 and V6.3.0.2 PTF releases for SAN Volume Controller and Storwize V7000, and the V1.3.0.5 PTF release for V7000 Unified block systems.


Cross reference information
Segment Product Component Platform Version Edition
Storage Virtualization SAN Volume Controller 6.3 SAN Volume Controller 6.2, 6.3
Disk Storage Systems IBM Storwize V7000 Unified (2073) 1.3 IBM Storwize V7000 1.3

Rate this page:

(0 users)Average rating

Document information


More support for:

IBM Storwize V7000 (2076)
6.3

Version:

6.2, 6.3

Operating system(s):

IBM Storwize V7000

Reference #:

S1004038

Modified date:

2012-04-05

Translate my page

Machine Translation

Content navigation