Metrics Roundup: Key Techniques and References

As an IT leader either of a function or of a large IT team with multiple functions, what is the emphasis you should place on metrics and how are you able to leverage them to attain improvement and achieve more predictable delivery? What other foundational elements must be in place for you to effectively leverage the metrics? Which metrics are key measures or leading indicators and which ones are lagging or less relevant?

For those of you just joining, this is the fourth post on metrics and in our previous posts we focused on key aspects of IT metrics (transparency, benchmarking, and a scientific approach). You can find these posts in the archives or, better yet, in the pages linked under the home page menu of Best Practices. The intent of the RecipeforIT blog is to provide IT leaders with useful, actionable practices, techniques and guidance to be more successful IT leaders and enable their shops to achieve much higher performance. I plan to cover most of the key practices for IT leaders in my posts, and as a topic is covered, I try to migrate the material into Best Practices pages. So, now back to metrics.

In essence there are three types of relevant metrics:

  • operational or primary metrics – those metrics used to monitor, track and decision the daily or core work. Operational metrics are the base of effective management and are the fundamental measures of the activity being done. It is important that these metrics are collected inherently as part of the activity and best if the operational team  collects, understands and changes their direction based on these metrics.
  • verification or secondary metrics – those metrics used to verify that the work completed meets standards or is functioning as designed. Verification metrics should also be collected and reviewed by the same operational team though potentially by different members of the team or as participation as part of a broader activity (e.g. DR test). Verification metrics provide an additional measure of either overall quality or critical activity effectiveness.
  • performance or tertiary metrics – those metrics that provide insight as to the performance of the function or activity. Performance metrics enable insight as to the team’s efficiency, timeliness, and effectiveness.

Of course, your metrics for a particular function should consist of those measures needed to successfully execute and manage the function as well as those measures that demonstrate progress towards the goals of your organization. For example, let’s take am infrastructure function: server management. What operational metrics should be in place? What should be verified on a regular basis? And what performance metrics should we have? While this will vary based on the maturity, scale, and complexity of the server team and environment, here is a good subset:

Operational Metrics:

  • Server asset counts (by type, by OS, by age, location, business unit, etc) and server configurations by version (n, n-1, n-2, etc) or virtualized/non-virtual and if EOL or obsolete
  • Individual, grouped and overall server utilization, performance, etc by component (CPU, memory, etc)
  • Server incidents, availability, customer impacts by time period, trended with root cause and chronic or repeat issues areas identified
  • Server delivery time, server upgrade cycle time
  • Server cost overall and by type of server, by cost area (admin, maintenance, HW, etc) and cost by vendor
  • Server backup attempt and completion, server failover in place, etc

Verification metrics:

  • Monthly sample of the configuration management database server records for accuracy and completeness, Ongoing scan of network for servers not in the configuration management database, Regular reporting of all obsolete server configs with callouts on those exceeding planned service or refresh dates
  • Customer transaction times, Regular (every six months) capacity planning and performance reviews of critical business service stacks including servers
  • Root cause review of all significant impacting customer events, auto-detection of server issues versus manual or user detection ratios
  • DR Tests, server privileged access and log reviews, regular monthly server recovery or failover tests (for a sample)

Performance metrics:

  • Level of standardization or virtualization, level of currency/obsolescence
  • Level of customer impact availability, customer satisfaction with performance, amount of headroom to handle business growth
  • Administrators per server, Cost per server, Cost per business transaction
  • Server delivery time, man hours required to deliver a server

Obviously, if you are just setting out, you will collect on some of these metrics first. As you incorporate their collection and automate the work and reporting associated with them you can then tackle the additional metrics. And you will vary them according to the importance of different elements in your shop. If cost is critical, then reporting on cost and efficiency plays such as virtualization will naturally be more important. If time to market or availability are critical, than those elements should receive greater focus. Below is a diagram that reflects the construct of the three types of metrics and their relationship to the different metrics areas and score cards:

Metrics Framework

So, you have your metrics framework, what else is required to be successful leveraging the metrics?

First and foremost, the culture of your team must be open to alternate views and support healthy debate. Otherwise, no amount of data (metrics) or facts will enable the team to change directions from the party line. If you and your management team do not lead regular, fact-based discussions where course can be altered and different alternatives considered based on the facts and the results, you likely do not have the openness needed for this approach to be successful. Consider leading by example here and emphasize fact based discussions and decisions.

Also you must have defined processes that are generally adhered. If your group’s work is heavily ad hoc and different each time, measuring what happened the last time will not yield any benefits. If this is the case, you need to first focus on defining even at a high level, the major IT processes and help your team’s adopt them. Then you can proceed to metrics and the benefits they will accrue.

Accountability, sponsorship and the willingness to invest in the improvement activities are also key factors in the speed and scope of the improvements that can occur. As a leader you need to maintain a personal engagement in the metrics reviews and score card results. They should into your team’s goals and you should monitor the progress in key areas. Your sponsorship and senior business sponsorship where appropriate will be major accelerators to progress. And hold teams accountable for their results and improvements within their domain.

How does this correlate with your experience with metrics? Any server managers out there that would have suggestions on the server metrics? I expect we will have two further posts on metrics:

  • a post on how to evolve the metrics you measure as you increase the maturity and capability of your team,
  • and one on unit costing and allocations

I look forward to your suggestions.

Best, Jim

 

A Scientific Approach to IT Metrics

In order to achieve a world class or first quartile performance, it is critical to take a ‘scientific’ approach to IT metrics. Many shops remain rooted in ‘craft’ approaches to IT where techniques and processes are applied in an ad hoc manner to the work at hand and little is measured. Or, a smattering of process improvement methodologies (such as Six Sigma or Lean) or development approaches (e.g., Agile) are applied indiscriminately across the organization. Frequently then, due to a lack of success, the process methods or metrics focus are then tarred as being ineffective by managers.

Most organizations that I have seen that were mediocre performers typically have such craft or ad hoc approaches to their metrics and processes. And this includes not just the approach at the senior management level but at each of the 20 to 35 distinct functions that make up an IT shop (e.g., Networking, mainframes, servers, desktops, service desk, middleware , etc, and each business-focused area of development and integration). In fact, you must address the process and metrics at each distinct function level in order to then build a strong CIO level process, governance and metrics. And if you want to achieve 1st quartile or world-class performance, a scientific approach to metrics will make a major contribution. So let’s map out how to get to such an approach.

1) Evaluate your current metrics: You can pick several of the current functions you are responsible for and evaluate them to see where you are in your metrics approach and how to adjust to apply best practices. Take the following steps:

  • For each distinct function, identify the current metrics that are routinely used by the team to execute their work or make decisions.
  • Categorize these metrics as either operational metrics or reporting numbers. If they are not used by the team to do their daily work or they are not used routinely to make decisions on the work being done by the team, then these are reporting numbers. For example, they may be summary numbers reported to middle management or reported for audit or risk requirements or even for a legacy report that no one remembers why it is being produced.
  • Is a scorecard being produced for the function? An effective scorecard would have quantitative measures for the deliverables of the functions as well as objective scores for function goals that have been properly cascaded for the overall IT goals

2) Identify gaps with the current metrics: For each of IT functions there should be regular operational metrics for all key dimensions of delivery (quality, availability, cost, delivery against SLAs, schedule). Further, each area should have unit measures to enable an understanding of performance (e.g., unit cost, defects per unit, productivity, etc). As an example, the server team should have the following operational metrics:

    • all server asset inventory and demand volumes maintained and updated
    • operational metrics such as server availability, server configuration currency, server backups, server utilization should all be tracked
    • also time to deliver a server, total server costs, and delivery against performance and availability SLAs should be tracked
    • further secondary or verifying metrics such as server change success, server obsolescense, servers with multiple backup failures, chronic SLA or availability misses, etc should be tracked as well
    • function performance metrics should include cost per server (by type of server), administrators per server, administrator hours to build a server, percent virtualized servers, percent standardized servers, etc should also be derived

3) Establish full coverage: By comparing the existing metrics against the full set of delivery goals, you can quickly establish the appropriate operational metrics along with appropriate verifying metrics. Where there are metrics missing that should be gathered, work with the function to incorporate the additional metrics into their daily operational work and processes. Take care to work from the base metrics up to more advanced:

    • start with base metrics such as asset inventories and staff numbers and overall costs before you move to unit costs and productivity and other derived metrics
    • ensure the metrics are gathered in as automated a fashion as possible and as an inherent part of the overall work (they should not be gathered by a separate team or subsequent to the work being done

Ensure that verifying metrics are established for critical performance areas for the function as well. An example of this for the server function could be for the key activity of backups for a server:

    • the operational metrics would be perhaps backups completed against backups scheduled
    • the verifying metric would be twofold:
      • any backups for a single server that fail twice in a row get an alert and an engineering review as to why they failed (typically, for a variety of reasons 1% or fewer of your backups will fail, this is reasonable operational performance. But is one server does not get a successful backup for many days, you are likely putting the firm at risk if there is a database or disk failure, thus the critical alert)
      • every month or quarter, 3 or more backups are selected at random, and the team ensures they can successfully recover from the backup files. This will verify everything associated with the backup is actually working.

4) Collect the metrics only once: Often, teams collect similar metrics for different audiences. The metrics that they use to monitor for example configuration currency or configuration to standards, can be mostly duplicated by risk data collected against security parameter settings or executive management data on a percent server virtualization. This is a waste of the operational team’s time and can lead to confusing reports where one view doesn’t match another view. I recommend that you establish and overall metrics framework that includes risk and quality metrics as well as management and operational metrics so that all groups agree to the proper metrics. The metrics are then collected once, distilled and analyzed once, and congruent decisions can then be made by all groups. Later this week I will post a recommended metrics framework for a typical IT shop.

5) Drop the non-value numbers activity: For all those numbers that were identified as being gathered for middle management reports or for legacy reports with an uncertain audience; if there is no tie to a corporate or group goal, and the numbers are not being used by the function for operational purposes, I recommend to stop collecting the numbers and stop publishing any associated reports. It is non-value activity.

6) Use the metrics in regular review: At both the function team level and function management level the metrics should be trended, analyzed and discussed. These should be regular activities: monthly, weekly, and even daily depending on the metrics. The focus should be on how to improve, and based on the trends are current actions, staffing, processes, etc, enabling the team to improve and be successful on all goals or not. A clear feedback loop should be in place to enable the team and management to identify actions to take place to correct issues apparent through the metrics as quickly and as locally as possible. This gives control of the line to the team and the end result is better solutions, better work and better quality. This is what has been found in manufacturing time and again and is widely practiced by companies such as Toyota in their factories.

7) Summarize the metrics from across your functions into a scorecard: Ensure you identify the key metrics within each function and properly summarize and aggregate the metrics into an overall group score card. Obviously the score card should match you goals and key services that you deliver. It may be appropriate to rotate in key metrics from a function based on visibility or significant change. For example, if you are looking to improve overall time to market(TTM) of your projects, it may be appropriate to report on server delivery time as a key subcomponent and hopefully leading indicator of your improving TTM.  Including on your score card, even at a summarized level, key metrics from the various functions, will result in greater attention and pride being taken in the work since there is a very visible and direct consequences. I also recommend that on a quarterly basis, that you provide an assessment as to the progress and perhaps highlights of the team’s work as reflected in the score card.

8 ) Drive better results through proactive planning: The team and function management, once the metrics and feedback loop are in place, will be able to drive better performance through ongoing improvement as part of their regular activity. Greater increases in performance may require broader analysis and senior management support. Senior management should do proactive planning sessions with the function team to enable greater improvement to occur. The assignment for the team should be how take key metrics and what would be required to set them on a trajectory to a first quartile level in a certain time frame. For example, you may have both a cost reduction goal overall and within the server function there is a subgoal to achieve greater productivity (at a first quartile level)  and reduce the need for additional staff. By asking the team to map out what is required and by holding a proactive planning session on some of the key metrics (e.g. productivity) you will often identify the path to meet both local objectives that also contribute to the global objectives. Here, in the server example, you may find that with a moderate investment in automation, productivity can be greatly improved and staff costs reduced substantially. Thus both objectives could be obtained by the investment.  By holding such proactive sessions, where you ask the team to try and identify what would needs to be done to achieve a trajectory on their key metrics as well as considering what are the key goals and focus at the corporate or group level, you can often identify such doubly beneficial actions.

By taking these steps, you will employ a scientific approach to your metrics. If you add a degree of process definition and maturity, you will make significant strides to controlling and improving your environment in a sustainable way. This will build momentum and enable your team to enter a virtuous cycle of improvement and better performance. And then if add to the mix process improvement techniques (in moderation and with the right technique for each process and group), you will accelerate your improvement and results.

But start with your metrics and take a scientific approach. In the next week, I will be providing metrics frameworks that have stood well in large, complex shops along with templates that should help the understanding and application of the approach.

What metrics approaches have worked well for you? What keys would you add to this approach? What would you change?

Best, Jim