In the Heat of Battle: Command Center Best Practices

If you have been in IT for any stretch, you will have experienced a significant service outage and the blur of pages, conference calls, analysis and actions to recover. Usually such a service incident call occurs at 2 AM, and there is a fog that occurs as a diverse and distributed team tries to sort through a problem and its impacts while seeking to restore service. And often, poor decisions are made or ineffective directions taken in this fog which extend the outage. Further,  as part of the confusion, there can be poor communications with your business partners or customers. Even for large companies with a dedicated IT operations team and command center, the wrong actions and decisions can be made in the heat of battle as work is being done to restore service. While you can chalk many of the errors to either inherent engineering optimism or a loss of orientation after working a complex problem for many hours, to achieve outstanding service availability you must enable crisp, precise service restoration when an incident occurs. Such precision and avoidance of mistakes in ‘the heat of battle’ comes from a clear command line and operational approach. This ‘best practice’ clarity includes defined incident roles and operational approach communicated and ready well before such an event. Then everyone operates as a well-coordinated team to restore service as quickly as possible.

We explore these best practice roles and operational approaches in today’s post. These recommended practices have been derived over many years at IT shops that have achieved sustained first quartile production performance*. The first step is to have a production incident management processes which based on an ITIL approach. Some variation and adaption of ITIL is of course appropriate to ensure a best fit for your company and operation but ensure you are leveraging these fundamental industry practices and your team is fully up to speed on them. Further, it is  preferable to have a dedicated command center which monitors production and has the resources for managing a significant incident when it occurs.

Assuming those capabilities are in place,  there should be clear roles for your technology team in handling a production issue. The incident management roles that should be employed include:

  • Technical leads — there may be one or more technical leads for an incident depending on the nature of the issue and impact. These leads should have a full understanding of the production environment and be highly capable senior engineers in their specialty. Their role is  to diagnose and lead a problem resolution effort in their component area (e.g. storage, network, DBMS, etc). They also must reach out and coordinate with other technical leads to solve those issues that lie between specialties (e.g. DBMS and storage).
  • Service lead — the service lead is also an experienced engineer or manager and one who understands all systems aspects and delivery requirements of the service that has been impacted. This lead will help direct what restoration efforts are a priority based on their knowledge of what is most important to the business. They would also be familiar with and be able to direct service restoration routines or procedures (e.g. a restart). They also will have full knowledge of the related services and potential downstream impacts that must be considered or addressed. And they will know which business units and contacts must be engaged to enact issue mitigation while the incident is being worked.
  • Incident lead — the incident lead is a command centre member who is experienced in incident management, has strong command skills, and understands problem diagnosis and resolution. Their general knowledge and experience should extend from the systems monitoring and diagnostics tools available to application and infrastructure components and engineering tools as well as a base understanding of the services IT must deliver for the business. The incident lead will drive all problem resolution actions as needed including
    • engaging and directing component and application technical leads and teams and restoration efforts,
    • collection and reporting of impact data,
    • escalation as required to ensure adequate resources and talent are focused on the issue
  • Incident coordinator – in addition to the incident lead there should also be an incident coordinator. This command centre member is knowledgeable on the incident management process and procedures and handles key logistics including setting up conference calls, calling or paging resources, drafting and issuing communications, and importantly, managing to the incident clock for both escalation and task progress. The coordinator can be supplemented by additional command centre staff for a given incident particularly if multiple technical resolution calls are spawned by the incident.
  • Senior IT operations management – for critical issue, it is also appropriate for senior IT operations management to both be present on the technical bridge ensuring proper escalation and response occurs. Further, communications may need to be drafted for senior business personal providing status, impact, and prognosis. If it is a public issue, it may also be necessary to coordinate with corporate public relations and provide information in the issue.
  • Senior management – As is often the case with a major incident, senior management from all areas of IT and perhaps even the business will look to join the technical call and discussions focused on service restoration and problem resolution. While this should be viewed as natural desire (perhaps similar to slowing and staring at a traffic accident), business and senior management presence can be disruptive and prevent the team from timely resolution. So here is what they are not to do:
    • Don’t join the bridge, announce yourself and ask what is going on, this will deflect the team’s attention from the work at hand and waste several minutes bringing you up to speed and extending the problem resolution time (I have seen this happen far too often)
    • Don’t look to blame, the team will likely slow or even shut down due to fear of repercussions when honest open dialogue is needed most to understand the problem.
    • Don’t jump to conclusions on the problem, the team could be led down the wrong path. Few senior managers have the ability to be up-to-date on the technology and have strong enough problem resolution skills to provide reliable suggestions. If you are one of them, go ahead and enable the team to leverage your experience, but be careful if your track record says otherwise.

Before we get to the guidelines to practice during an incident, I also recommend ensuring your team has the appropriate attitude and understanding at the start of the incident. Far too often, problems start small or the local team thinks they have it well in hand. They then avoid escalating the issue or reporting it as a potential critical issue. Meanwhile critical time is lost, and potentially mistakes made by the local team then compound the issue. By the time escalation to the command centre does occur, the customer impact has become severe and the options to resolve are far more limited. I refer to this as trying to put out the fire with a garden hose. It is important to communicate to the team that it is far better to over-report an issue than report it late. There is no ‘crying wolf’ when it comes to production. The team should first call the fire department (the command center) with a full potential severity alert, and then can go back to putting out the fire with the garden hose. Meanwhile the command center will mobilize all the needed resources to arrive and ensure the fire is put out. If everyone arrives and the fire is put out, all will be happy. And if the fire is raging, you now have the full set of resources to properly overcome the issue.

Now let’s turn our attention to best practice guidelines to leverage during a serious IT incident.

Guidelines in the Heat of Battle:

1. One change at a time (and track all changes)

2. Focus on restoring service first, but list out the root causes as you come across them. Remember most root cause analysis and work comes long after service is restored.

3. Ensure configuration information is documented and maintained through the changes

4. Go back to the last known stable configuration (back out all changes if necessary to get back to the stable configuration). Don’t let engineering ‘optimism’ forward engineer to a new solution unless it is the only option.

5. Establish clear command lines (one for technical, one business interface) and ensure full command center support. It is best for the business not to participate in the technology calls — it is akin to watching sausage get made (no one would eat it if they saw it being made). Your business will feel the same way about technology if they are on the calls.

6. Overwhelm the problem (escalate and bring in the key resources – yours and the vendor’s). Don’t dribble in resources because it is 4 AM in the morning. If you work in IT, and you want to be good, this is part of the job. Get the key resources on the call and ensure you hold the vendor to the same bar as you hold your team.

7. Work in parallel wherever reasonable and possible. This should include spawning parallel activities (and technical bridges) to work multiple reasonable solutions or backups.

8. Follow the clock and use the command center to ensure activities stay on schedule. You must be able to decide when a path is not working and focus resources on better options and the clock is a key component of that decision. And escalation and communication must occur with rigor so maintain confidence and bring necessary resources to bear.

9. Peer plan, review and implement. Everything done in an emergency (here, to restore service and fix a problem) is highly likely to inject further defects into your systems. Too many issues have been complicated by during a change implementation when a typo occurs or the command is executed in the wrong environment. Peer planning, review, and implementation will significantly improve the quality of the changes you implement.

10. Be ready for the worst, have additional options and have a backout plan for the fix. You will save time and be more creative to drive better solutions if you address potential setback proactively rather than waiting for them to happen and then reacting.

Recall that the  ITIL incident management objective is to ‘restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price.’  These best practice guidelines will help you drive to a best practice incident management capability.

What would you add or change in the guidelines? How have you been able to excellent service restoration and problem management? I look forward to hearing from you.

P.S. Please note that these best practices  have been honed over the years in world class availability shops for major corporations with significant contributions from such colleagues as Gary Greenwald, Cecilia Murphy, Jim Borendame, Chris Gushue, Marty Metzker, Peter Josse, Craig Bright, and Nick Beavis (and others).

11 Responses to In the Heat of Battle: Command Center Best Practices

  1. tsholo says:

    hi,

    i work in an IT operations environment where we are not really required to be proactive. if something breaks, we escalate it. how do we transform such an environment to one where people are more proactive and are allowed to resolve issues before escalating it?

    • Jim D says:

      Tsholo,

      I think there are two sides to how to improve things in your environment. One is leading by example, and this is typically the most important. Good leaders gain influence because their teams and peers know that they will back up what they say with action. People are much more inclined to follow someone who has a ‘Do as I do’, rather than a ‘Do as I say’, approach. So, start first in your work or your team. You can improve being proactive by having the right data. Collect the information on issues and problems in you area. Like most things with systems, they data will indicate clusters. Investigate the clusters further and understand why the issues are occurring. Once understood, you can now be proactive and initiate action to address the underlying causes. The way to improve things is to be a proponent. Having courage and saying the right things in the right settings is also a key element of leadership. Be measured and have data when speaking in these situations. Express the effects of the desired change in terms of how it will help your enterprise be more successful (like, ‘our customers will not experience these service losses if we do X’). Identify well understood ways to improve your current practices (e.g. ITIL methods), and be a proponent of adopting these best practices which have helped other companies (including likely your competitors) improve. Find allies and join forces to improve accountability and pride in the work your organization does.

      I think these are the best ways to tackle a reactive and complacent environment where external influences have not yet made change compelling. I wish you the best on this journey.

      Jim Ditmore

  2. Dhruv Ghosh says:

    Hello Jim

    What are the best proactive monitoring methods we can introduce from IT networking standpoint when dealing with contact center environment

    Thanx

    • Jim D says:

      Dear Dhruv,

      I think you want to start with the normal workload monitoring and routing capabilities within the contact center. For example, your call center managers will be anxious to know queue depth and wait time in real time throughout the work day. These capabilities are included with any traditional call center management software. From a network perspective, you particularly want to ensure you have proactive circuit quality and utilization monitoring for all connections to the call center. Call abandonment rates are another key indicator that can identify if you have issues with either quality or wait time. Monitoring these metrics will give you a solid understanding of the environment. Remember though, the appropriate practice for call centers is to set the network connections and the call management software and servers to be highly resilient so that on any one failure, you have rapid failover with minimal impact to the call center.

      I hope this answers your question. Best, Jim Ditmore

  3. Dhruv Ghosh says:

    Thank you Jim……!!

  4. Jeannine Holper says:

    We incorporate all the best practices in the article (so nice to see a confirming document). We are looking for information on Best Pratices for Command Center Logistical support. ie. Onsite, Offsite. Any thoughts on that?
    We do “follow the sun”, on the Physicial Command Center Support (California, India, Signapore and Texas). Most of our support today consists of gathering key leads on-site for the various 6 teams that are heavily involved with the Installation/Activation. We can have up to 14 staff in one room during the first 24 to 36 hours, with additional staff onsite or on call. Our Management wants to know if there are best practices for the “lower support” periods”. We run a 4 day / 24-hour onsite command center running Friday to Monday. But on Sunday it is slow going, then Monday picks up with activity. Are there any Best Practices on supporting Major Installations for teams in multiple remote settings? Where no one would be ON-SITE ( ie. working from a hotel, home, starbucks, or other remote sites). Outside of being on your laptop during your support shift and having conference calls, tools like Meeting Place, or using IM; are there any other tools that happen to work well and are considered a part of “best practise” when operating in this mode.

    • Jim D says:

      Dear Jeannine,

      Good questions, I have some initial thoughts but want to check with some of the IT Operations managers in the industry first. Just one clarification: are you running a 4 day/ 24 hour schedule or 7 day / 24 hour schedule?

      Best, Jim Ditmore

  5. Roy McBrayer says:

    Sir,

    Have you published any additional information on this topic? This has direct bearing on a new Command Center initiative within our organization.

    Thanks,
    Roy M

    • Jim D says:

      Roy,

      I can get you plenty of additional information. What topics or areas are of particular interest? I also recommend you or your team visit recent successful implementations of Command Centers. JPMorgan has an excellent center as does Allstate. If you need a contact there, I may be able to arrange.

      Best, Jim

  6. Daniel P says:

    Hi Jim,
    I’ve been working on the feasibility phase of building an operations centre and we are at a point where we would like to dig deeper in a few areas.

    The area where we have the least amount of consensus is in house vs out sourced Ops Center. This particularly evident when comparing the security monitoring function (who would prefer to look for an out sourced solution), and data center monitoring and operations (who would prefer to build the capability in house). Right now I’m thinking that if the business cases are truly different then maybe having a hybrid solution isn’t a bad thing. I’d like to hear more from you on the two options and under what scenarios each makes sense.

    • Jim D says:

      Dear Daniel,

      My apologies for the delay in getting back to you. I think it is reasonable to do a hybrid where the primary command center is in-house and the SOC is out-sourced, particularly if your IT budget is stretched. Typically your applications and underlying support infrastructure is quite custom and requires significant knowledge and experience to understand and manage correctly. But most security attacks and threats follow patterns that a supplier who is on top of latest trends and activity will be able to detect and understand how to thwart or roll back. It is very important that the SOC operate 7×24 which may also be easier to attain with a supplier with critical mass. I hope you and your team have been able to progress this effectively, don’t hesitate to write me directly if you have further questions. Best, Jim Ditmore

Leave a Reply

Your email address will not be published. Required fields are marked *