V6.2.0.x, V6.3.0.x and V1.3.0.x Node or Node Canisters Will Automatically Reboot After 208 Days of Uninterrupted Uptime
Due to a known Linux kernel issue, Storwize V7000, Storwize V7000 Unified block node canisters and SAN Volume Controller nodes will reboot after running for 208 continuous days since their last power on or software upgrade.
A widely documented Linux kernel issue can result in a kernel panic occurring after 208 days of continuous uptime, due to an internal counter overflow. For Storwize V7000, Storwize V7000 Unified block or SAN Volume Controller nodes or node canisters running V6.2.0.x, V6.3.0.x or V1.3.0.x, this can trigger a self-recovering reboot event once this amount of time has elapsed.
Although each reboot event is only expected to take between five to ten minutes, for clusters in which both nodes in an I/O group were last booted in close proximity to one another there is a risk that a loss of host access may occur after 208 days. A typical scenario would be if both nodes in a new installation were powered on simultaneously.
If the node or node canister exceeds an uptime of 208 days, then it should continue to operate normally until it is shut down or rebooted. Whilst shutting down or rebooting there is a risk that the linux operating system will shut down without allowing the SVC or V7000 software to save it's critical metadata to disk. If this happens on two nodes or node canisters at the same time (for example during a power loss) then cache data may be lost.
All nodes or node canisters which have an uptime of greater than 208 days must be rebooted before attempting a software upgrade to avoid complications during the software upgrade process.
The Software Upgrade Test Utility can be used to determine the uptime of all nodes in a cluster which is exposed to this issue. The following URL contains download and usage instructions for this utility:
This issue was first introduced in the V184.108.40.206 release, which shipped on 10 June 2011. For convenience, the following table lists the published date for all code levels that are exposed to this issue, and the date at which 208 days will have elapsed.
|Version Number||Release Date||Earliest possible date that a system running this release could hit the 208 day reboot.|
SAN Volume Controller and Storwize V7000 Version 6.2
|220.127.116.11||Never shipped to customers|
|18.104.22.168||10 June 2011||04 January 2012|
|22.214.171.124||11 July 2011||04 February 2012|
|126.96.36.199||05 September 2011||31 March 2012|
|188.8.131.52||07 November 2011||02 June 2012|
SAN Volume Controller and Storwize V7000 Version 6.3
|184.108.40.206||30 November 2011||25 June 2012|
|220.127.116.11||24 January 2012||19 August 2012|
Storwize V7000 Unified Version 1.3
|18.104.22.168||23 December 2011||18 July 2012|
|22.214.171.124||06 January 2012||01 August 2012|
|126.96.36.199||08 February 2012||03 September 2012|
|188.8.131.52||29 February 2012||24 September 2012|
|184.108.40.206||Never shipped to customers|
A full reboot of a node or node canister will reset the affected counter to zero, effectively extending the length of time until the automatic reboot event occurs by an additional 208 days .
Customers that are nearing the 208 day limit but are unable to perform an upgrade to a level containing the fix for this issue, are strongly advised to perform a controlled reboot on each node or node canister in the cluster, leaving a reasonable time of no less than one hour between rebooting each of the two nodes in a given I/O group.
Before proceeding with controlled node reboots, it is important to ensure that all host multipathing drivers are correctly configured, as these reboots will trigger multipath failover events between the nodes in an I/O group.
For each node that is approaching the 208 day limit, the following sequence of steps will perform a full reboot of the node.
Note: Customers with a 2 node SVC cluster, or a Storwize V7000 with only one Control Enclosure running V220.127.116.11 or earlier are advised to run the following procedure prior to restarting the nodes. This action plan can only be carried out using the Command Line interface.
- Establish a CLI session to the management IP address.
- Run the following commands
- svctask chquorum -active 0
- svctask chquorum -active 1
- svctask chquorum -active 2
- If there is a requirement for the active quorum to reside on a particular quorum disk, then the active quorum should be moved back to the desired location.
Reboot using the Service Assistant GUI
- Connect to the service assistant GUI using http://<service-ip-address>/service
- Select the radio button for the node to be rebooted
- From the Actions drop down box, select "Enter Service State" and press Go
- Wait for the node to enter service state. This is shown in the 'Node Status' column
- From the Actions drop down box, select "Reboot" and press Go
- Wait until the node has completed its reboot
- From the Actions drop down box, select "Exit Service State" and press Go
Note that during this procedure it may be necessary to log back in to the service assistant GUI
Reboot using the Command Line Interface
- Establish a CLI session to the node's Service Assistant IP address.
- Place the node into the Service State by entering "satask startservice".
- Wait until the node has successfully entered the Service State.
- Enter the command "satask stopnode -reboot".
- Wait until the node has completed its reboot.
- Remove the node from the Service State by entering "satask stopservice".
This issue was resolved in the V18.104.22.168 and V22.214.171.124 PTF releases for SAN Volume Controller and Storwize V7000, and the V126.96.36.199 PTF release for V7000 Unified block systems.
|Storage Virtualization||SAN Volume Controller||6.3||SAN Volume Controller||6.2, 6.3|
|Disk Storage Systems||IBM Storwize V7000 Unified (2073)||1.3||IBM Storwize V7000||1.3|