Frequently, IT shops have a difficult time convincing their business users of the value of IT. It is straightforward for a businessperson to look at the cost of IT, which is readily available but not have a good reference for the benefits and the overall value of IT. This lack of a good reference then leads to further concerns on IT budgets and investment. As CIO, it is imperative that you provide transparency on the benefits and impacts of IT so your customer and the business team can understand the value of IT and properly allocate budget and investment to make the business more successful. Yet frequently, IT does a very poor job of measuring IT impact and benefits, and when it is measured it is often done only in technical metrics which are not easily translated into understandable business metrics.
A good area to start to provide transparency to your business partners is with your production services. Typical IT metrics here are done only in technical terms not in business terms. For example, you often find availability measured in outage time or percentage or as the number of incidents per month. And while these are useful metrics to do some technical analysis, they are poor metrics to communicate business impact. Instead you should be providing your production metrics to your customer in business form. For example, you should report on the key service channels that are delivered to the end customer (e.g., ATMs, retail points of sale, call centers, etc) and the underlying metric should be customer impact availability. You derive this metric by counting the number of customer interaction or transactions that were successful divided by the total number of possible customer interactions or transactions for that time period.
Specifically, take an ATM channel as the example. Let’s assume there are normally 100,000 customer transactions in that month. If the ATMs were down for 1 hour that month and there would have been 1000 customer transactions that normally would have been completed in that hour then the customer impact availability was 99% (= (100,000 -1,000)/100,000). Just as importantly, your business also knows 1,000 customers were impacted by the systems issue. And if the outage occurred at peak usage — say 1 hour on a Friday evening rather than 3 AM on a Sunday, you may have 2,000 customers impacted versus 100. And your business-oriented metric, customer impact availability, would show this rather than the constant view that a system time availability metric would show. But this is critical knowledge, for you and for your business partners. You will have to collect information on how your customers use your service channels and understand daily, weekly, and seasonal variations, but you should know this anyway so you can better plan capacity upgrades and implement changes and releases.
Further, you should avoid setting internal service agreements (sometimes called OLAs or Operational Level Agreements) and instead focus on Service Level Agreements for the key business services. Otherwise, your team will waste valuable time and resources monitoring internal measures that do not reflect how well you are doing overall (which is what really matters). In fact, allocating and measuring availability amongst internal teams will often just result in a ping pong effect where each group tries to show the real cause of an incident was another team to avoid it ending counting against them. If you hold everyone accountable to meeting end-to-end key service SLAs and an overall availability mark in terms of customers impacts, you will encourage teamwork and better performance.
To further improve performance and identify problem areas and increase accountability, implement goals around key production performance measures. These would include:
- change success rates – change is typically a major cause of issues and to reach strong production performance you need well-designed and executed change activity. A good goal here is usually 99% of all changes are successful (no production issue or customer impact)
- Median time to restore – not only should you look to eliminate issues, but when they occur, the time of service impact should be as brief as possible. It is important to measure the time to restore (and within it, understand the time to detect, correlate and then fix and restore service). A good goal here is usually 1 hour or less for critical incidents (significant customer impact)
- Chronic or repeat issues – track issues and identify areas where multiple issues are occurring. Often a significant portion of your production issues occur in a handful of applications or infrastructure components. Or it may be that clusters of issues occur in a specific process or phase (e.g., deployment or configuration build) that with focused effort can substantially improve production performance.
By implementing better production metrics particularly business relevant statistics, you will enable greater understanding for your team and the business. Your business partners will understand the impact and value of IT on their customers and services. It will drive better decisions on how much and where to invest in technology. And it will also raise the level of business awareness of your team. When they know an outage impacted 32,000 customers, that comes across as much more material and important than a 40 minute router outage. You can adjust or denote the metrics as appropriate for your business. For example, if you are handling large financial sums then system outages may be expressed in terms of the sum of the financial amounts delayed or not processed. Further, several large Financial Services firms with robust shops report not just the customer impacts but also correlate if there was an industry cause and impact as well as if a portion of the functionality was available (e.g. you could access the ATM and withdraw cash but not see your balance). And with the right intermediate production metrics like change success rate, you can drive accountability and improved quality within your team.
What other production metrics have you used that would augment or improve the list above?
Best, Jim Ditmore