One of the most difficult issues that a technology leader faces is addressing systems availability or ‘production’. This reference page provides best practices that have worked in both large and medium size organizations, even when the availability issues and performance seemed intractably poor. And such performance is not an isolated situation, as constrained IT shops trying to manage complex systems often end up increasing the technical debt and miss solving the root causes.
Production issues, as Microsoft and Google can tell you, impact even cloud email apps. Earlier this summer in 2013, Microsoft took an entire weekend to full recover its cloud Outlook service. Perhaps you noted the issues earlier in 2013 in financial services where Bank of America experienced internet site availability issues. Unfortunately for Bank of America that was their second outage in 6 months, though they are not alone in having problems as Chase suffered a similar production outage on their internet services the week following. And these are regular production issues, not the unavailability of websites and services due to a series of DD0S attacks.
Perhaps 10 or certainly 15 years ago, such outages with production systems would have resulted in far less notice by their customers as the front office personnel would have worked alternate systems and manual procedures until the systems were restored. But with customers accessing the heart of most companies systems now through internet and mobile applications, typically on a 7×24 basis, it is very difficult to avoid direct and widespread impact to customers in the event of a system failure. Your production performance becomes very evident to your customers. And your customers’ expectations have continued to increase such that they expect your company and your services to be available pretty much whenever they want to use them. And while being available is not the only attribute that customers value (usability, feature, service and pricing factor in importantly as well) companies that consistently meet or exceed consumer availability expectations gain a key edge in the market.
So how do you deliver to current and future rising expectations around availability of your online and mobile services? And if both BofA and Chase, which are large organizations that offer dozens of services online and have massive IT departments have issues delivering consistently high availability, how can smaller organizations deliver compelling reliability?
And often, the demand for high availability must be achieved in an environment where ongoing efficiencies have eroded the production base and a tight IT labor market has further complicated obtaining adequate expertise. If your organization is struggling with availability or you are looking to achieve top quartile performance and competitive service advantage, here’s where to start:
First, understand that availability, at its root, is a quality issue. And quality issues can only be changed if you address all aspects. You must set quality and availability as a priority, as a critical and primary goal for the organization. And you will need to ensure that incentives and rewards are aligned to your team’s availability goal.
Second, you will need to address the IT change processes. You should look to implement an ITSM change process based on ITIL. But don’t wait for a fully defined process to be implemented. You can start by limiting changes to appropriate windows. Establish release dates for major systems and accompanying subsystems. Avoid changes during key business hours or just before the start of the day. I still remember the ‘night programmer’ at Ameritrade at the beginning of our transformation there. Staying late one night as CIO in my first month, I noticed two guys come in at 10:30 PM. When I asked what they did, they said ‘ We are the night programmers. When something breaks with the nightly batch run, we go in and fix it.’ And done with no change records, minimal testing and minimal documentation. Of course, my hair stood on end hearing this. We quickly discontinued that practice and instead made changes as a team, after they were fully engineered and tested, not in the middle of the night on the fly. I would note that combining this action with a number of other measures mentioned here enabled us to quickly reach a stable platform that had the best track record for availability for all online brokerages.
Importantly, you should ensure that adequate change review and documentation is being done by your teams for their changes. Ensure they take accountability for their work and their quality. Drive to an improved change process with templates for reviews, proper documentation, back out plans, and validation. Most failed changes are due to issues with the basics: a lack of adequate review and planning, poor change documentation of deployment steps, or missing or ineffective validation, or one person doing an implementation in the middle of the night when you should have at least two people doing it together (one to do, and one to check).
Also, you should measure the proportion of incidents due to change. If you experience mediocre or poor availability and failed changes contribute to more than 30% of the incidents, you should recognize change quality is a major contributor to your issues. You will need to zero in on the areas with chronic change issues. Measure the change success rate (percentage of changes executed successfully without production incident) of your teams. Publish the results by team (this will help drive more rapid improvement). Often, you can quickly find which of your teams has inadequate quality because their change success rate ranges from a very poor mid-80s percentage to a mediocre mid-90s percentage. Good shops deliver above 98% and a first quartile shop consistently has a change success rate of 99% or better.
Third, ensure all customer impacting problems are routed through an enterprise command center via an effective incident management process. An Enterprise Command Center (ECC) is basically an enterprise version of a Network Operations Center or NOC, where all of your systems and infrastructure are monitored (not just networks). And the ECC also has capability to facilitate and coordinate triage and resolution efforts for production issues. An effective ECC can bring together the right resources from across the enterprise and supporting vendors to diagnose and fix production issues while providing communication and updates to the rest of the enterprise. Delivering highly available systems requires an investment into an ECC and the supporting diagnostic and monitoring systems. Many companies have partially constructed the diagnostics or have siloed war rooms for some applications or infrastructure components. To fully and properly handle production issues requires consolidating these capabilities and extending their reach. If you have an ECC in place, ensure that all customer impacting issues are fully reported and handled. Underreporting of issues that impact a segment of your customer base, or the siphoning off of a problem to be handled by a local team, is akin to trying to handle a house fire with a garden hose and not calling the fire department. Call the fire department first, and then get the garden hose out while the fire trucks are on their way.
Fourth, you must execute strong root cause and followup. These efforts must be at the individual issue or incident level as well as at a summary or higher level. It is important to not just get focused on fixing the individual incident and getting to root cause for that one incident but to also look for the overall trends and patterns of your issues. Are they cluster with one application or infrastructure component? Are they caused primarily by change? Does a supplier contribute far too many issues? Is inadequate testing a common thread among incidents? Are your designs too complex? Are you using the products in a mainstream or unique manner – especially if you are seeing many OS or product defects? Use these patterns and analysis to identify the systemic issues your organization must fix. They may be process issues (e.g. poor testing), application or infrastructure issues (e.g., obsolete hardware), or other issues (e.g., lack of documentation, incompetent staff). Track both the fixes for individual issues as well as the efforts to address systemic issues. The systemic efforts will begin to yield improvements that eliminate future issues.
These four efforts will set you on a solid course to improved availability. If you couple these efforts will diligent engagement by senior management and disciplined execution, the improvements will come slowly at first, but then will yield substantial gains that can be sustained.
You can achieve further momentum with work in several areas:
- Document configurations for all key systems. If you are doing discovery during incidents it is a clear indicator that your documentation and knowledge base is highly inadequate.
- Review how incidents are reported. Are they user reported or did your monitoring identify the issue first? At least 70% of the issues should be identified first by you, and eventually you will want to drive this to a 90% level. If you are lower, then you need to look to invest in improving your monitoring and diagnostic capabilities.
- Do you report availability in technical measures or business measures? If you report via time based systems availability measures or number of incidents by severity, these are technical measures. You should look to implement business-oriented measures such as customer impact availability. to drive great transparency and more accurate metrics.
- In addition to eliminating issues, reduce your customer impacts by reducing the time to restore service (Microsoft can certainly stand to consider this area given their latest outage was three days!). For mean time to restore (MTTR – note this is not mean time to repair but mean time to restore service), there are three components: time to detect (MTTD), time to diagnose or correlation (MTTC), and time to fix (time to actually implement the fix to restore service or MTTF). An IT shop that is effective at resolution normally will see MTTR at 2 hours or less for its priority issues where the three components each take about 1/3 of the time. If your MTTD is high, again look to invest in better monitoring. If your MTTC is high, look to improve correlation and diagnostic tools, systems documentation or engineering knowledge. And if your MTTF is high, again look to improve documentation (especially of the fix path) or engineering knowledge but also look to automate recovery or failover procedures.
- Consider investing in greater resiliency for key systems. It may be that customer expectations of availability exceed current architecture capabilities. Thus, you may want to invest in greater resiliency and redundancy or build a more highly available platform.
As you can see, providing robust availability for your customers is a complex endeavor. By implementing these steps, you can enable sustainable and substantial progress to top quartile performance and achieve business advantage in today’s 7×24 world.
What would you add to these steps? What were the key factors in your shop’s journey to high availability?
Best, Jim Ditmore