Top 5 Things To Do To Avoid A System Outage Crisis
Server and Storage service outages and disruptions can cost companies anywhere from thousands up to millions of dollars per incident. It impacts your company's image and most certainly client satisfaction. An example was a highly publicized recent outage within Amazon Web Services (AWS). According to several estimates, this 4 hours outage resulted in $150 million within S&P 500 companies and $160 million for US financial services companies. Another recent example is the British Airways outage, resulting in the disruption/cancellation of over 700 flights and a loss of over $100 million.
Following are key steps you should take in order to avoid Storage system failures and minimize the disruption to your business:
1) Be current on firmware and software levels
Your IT vendor regularly releases fixes and upgrades to their systems' firmware, often in response to problems reported by other companies. Keeping your systems' firmware up to date will help you avoid those same problems occurring in your systems, as well as accelerating resolution of issues when you have a problem. The same concept may be applied to software. Make sure you are keeping up with any recommended patches and fixes for your business critical software and applications.
a) Find out if your IT vendor has a notification service available via their support web site. If so, sign up for it and make sure your colleagues in the department sign up too! This is a great way to be automatically notified if a critical fix for your environment is available and to also be notified when new upgrades are released.
b) Sign up to your IT vendor’s support Twitter account. More and more companies today are releasing time sensitive and critical support information using Twitter. Don’t miss out on this great communication vehicle!
c) Upgrading firmware and software levels have many considerations including features and functions you want and need, interoperability, etc. Many companies have tools available to help you make upgrade decisions that best suit your specific environment needs. Speak with your vendor representative or consult their support web site to learn more.
d) Look for a minimum or recommended firmware level page in your IT vendor’s website. Remember, this is constantly changing – typically, every 90 days or so.
e) If your system is sending diagnostic data to your vendor, you can likely access very useful information about your systems via web tools made available by your vendor: contractual coverage, health analysis, firmware levels, open tickets by system, etc. Speak with your vendor representative or consult their support web site to learn more.
2) Enable “Call Home” for Your Systems
Server and Storage systems come with the ability to detect emerging situations and notify your IT vendor or you (the client), often before these conditions adversely affect your business operation. This approach often enables all parties to pro-actively address these incidents before they become a crisis, as well as accelerating support response and pin pointing issues. Net result: reduction in incidents adversely affecting system availability and accelerating resolution of problems as they occur.
a) Refer to your product documentation to understand what “Call Home” options are available for each specific system you deploy
b) Work with your IT vendor representative to configure “Call Home” for those systems
c) Pro-actively validate and test successful operation within your environment to ensure diagnostic data notifications are working as you expect. Simulate failures, and make sure your firewalls as well as networking rules are not preventing notifications to your IT vendor. Speed and proactivity is of the essence.
d) Speak with your vendor representative or consult their support web site to learn what additional tools are available for you to control what is being sent to your vendor and how you can leverage useful features from the diagnostic data sent.
3) Have a Resilient architecture for Business Critical Applications
Unfortunately, despite our best efforts, systems and component failures will occur. Deployment of system configurations with built in redundancy and resiliency will prevent those failures from affecting the availability of your critical systems and business. True resiliency is often a multi-tiered solution, involving recovery from component failures (e.g. disk failures using RAID, system/controller failures using mirroring and redundancy, and even site failures (such as power failures) using disaster recovery strategies. And the only way to ensure you truly have a resilient architecture is to test it, regularly.
a) Being caught having to explain to your superiors that your business is down because you did not consider a resilient architecture for business critical applications may cost your job. Work with your IT vendor representatives to develop a resiliency architecture that aligns with the availability requirements of your business applications
b) Work with your IT vendor representatives to understand the resiliency capabilities available with your Server and Storage systems and to deploy them in production. Refer to product documentation, reference architectures, and solution briefs.
c) Develop and deploy a solution that meets your business requirements, expectations and architecture. Clearly define and document RPO and RTO.
d) Pro-actively validate the resiliency of the system by simulating different failures and validating the continued availability of the business application
e) Not everything can have a resilient architecture. Business often make cost/benefit analysis based decisions. Document the architecture decisions made in your company, as they can help you change minds in the future, post a crisis.
4) Know Your Client Support Plan (CSP)
Client Support Plans are developed to help clients understand their current support options, including how to most effectively engage your IT vendor support team when they have a problem. Using the procedures documented within the CSP will accelerate problem resolution when assistance from Support is required. Who to call, what number to call, contract entitlement numbers, which systems are under warranty, which systems are not, do we have basic or enhanced support, etc. are all useful information to have at your fingertips and well communicated across your team when needed.
a) Work with your IT vendor representative to develop and obtain a Client Support Plan for your Server and Storage Systems
b) Familiarize yourself with the information and procedures within the CSP before problems occur
c) Prepare, prepare and prepare ahead.
5) Effectively Report Problems and Provide Needed Diagnostic Information When Reporting a Problem
The number one factor influencing the time it takes to resolve a problem reported is the amount of time needed to obtain required diagnostic information from the system. It is difficult to pro-actively prepare for the collection of diagnostic information for all types of problems, given the variety of problems and the possible impact to system performance when capturing this information. Steps can be taken though to minimize this impact.
a) Familiarize yourself with the diagnostic tools and procedures available with your systems using the product documentation
b) Pro-actively validate your ability to use these tools and procedures to capture and copy this information from your systems
c) Familiarize yourself with the procedures used to deliver this information to your IT Vendor.
d) Use the information within the Client Support Plan and product documentation to pro-actively understand and deliver the diagnostic data needed for the type of problem you are reporting when you open the problem report with your IT vendor
e) Enhanced service and support offerings are available from all IT vendors where most of this work can be done for you by a certified service representative. Speak with your vendor representative or consult their support web site to learn more.
By following these simple guidelines, your company will stay out of the press headlines and your CEO will thank you for that. In short, Be Prepared!
About the author
This document was authored by Fabricio Amorim (firstname.lastname@example.org), the Global Director of IBM Storage Technical Support. The information provided is based on his extensive, 20 year career handling support crises in IBM.
More support for:
Version: Version Independent
Operating system(s): Platform Independent
Reference #: S1010709
Modified date: 17 October 2017
Translate this page: