Special thanks to Matt Minor (MattMinor86) for the submission.
In the tech field, we often find ourselves in reactive situations – with staff, managers, and team-leads breathing down your neck while trying to figure out what happened and get it corrected immediately. It’s stressful. We’re fighting the panic, we are struggling with whether to fight or flight – we’ve had better times.
If only there was some kind of checklist or guideline to follow in situations like these. What if there was a list of ten steps to better organize our troubleshooting procedure? Luckily, it’s right here.
These steps below outline the methodology, The 10 Steps of Effective Troubleshooting. This process is not specific to technology and could be easily modified to work for just about any profession or trade.
The 10 Steps of Effective Troubleshooting
This could mean getting required tools ready or pulling out product documentation. Though valuable, these are not my primary focus in a troubleshooting scenario. The preparation stage for me is mental – getting my attitude in-check and shifting to a CCC mindset (cool, calm, collected).
There is a solution and I can fix this. Jumping straight into a situation without any preparation can be catastrophic. Rebooting a server the second the Internet goes down isn’t the best approach. Get yourself prepared to tackle the situation first, and you just might realize that your network cable was unplugged.
2. Outline a Damage Control Plan
We’re often under huge amounts of pressure to just get things back up-and-running. It’s a mad-dash to the server room and anyone in our way is getting trampled. Kick open the door and hear the fire alarm going off… grab the water-bucket as outlined by the disaster-recovery steering committee and toss it into the server rack! Success! Lives saved.
What you failed to realize was that it was just a fire-drill, and now the backup tapes from last week are water-logged. It’s a little extreme – but the point is, critical data needs to be considered before jumping into corrective action. We know our networks, and we know what matters most. Consider the worst-case scenario for your actions, and have a backup plan.
3. Get the Symptom Description
This is an imperative step in the troubleshooting process. It might sound trivial, in that of course we need to know what’s happening, otherwise we don’t know there’s a problem! Though that may be true, it’s also true that many of the technology issues we deal with are symptomatic to another problem, and we need to carefully distinguish these. When you walk into your physician’s office because your ear hurts, chances are they are not going to give you antibiotics without first verifying the absence of a foreign-object.
Get as much information as possible. This is where a script or flow-chart comes in handy for those front-line staff who are taking problem calls. The quality of the information being passed to the people who are going to be troubleshooting has a direct result on the quality of the way the incident is handled.
The quality of the information gathered during this step has a direct impact on the end result, be sure to ask the right questions:
- When did it start happening?
- What else happened around that time?
- Any installations or configuration changes done around that time?
- The who/what/where/when and why
4. Reproduce the Symptom
This one is simple, but still vital. You can’t possibly begin to implement corrective measures if you don’t have a full understanding of the problem. Using the information gathered in step 3, try and recreate the issue so it can be witnessed first-hand. Though sometimes this isn’t possible, the alternative is to see it from the end-user perspective. Whether that’s a walk to their office or by using remote assistance, you should see for yourself what you’re working with. Many humans, by nature, develop a more solidified understanding of concepts and ideas by visual exposure, as opposed to by reading or hearing. Having this first-hand knowledge of the problem will greatly aid in carrying out Step 5.
5. Take Corrective Action
This step is what I believe to be the easiest, strange as it may sound. Steps 1-4 have you collecting appropriate information, developing your contingency/back-out/backup plans, and generally just getting ready to tackle the task-at-hand – solving the problem. A lot of the time, IT admins and technicians jump straight to this step, setting themselves up for a world of potential new issues, and lost time.
Based on the detailed information you now have, this is where you make your best-informed decision regarding what the issue is, and what actions to take towards resolution.
6. Narrow it Down (Isolate the Root-Cause)
During this step, I perform final validation in terms of what the problem was by reviewing the pertinent Event Logs, specific application logs, device-logs, and more to try and iron out exactly what this issue was caused by. In the ITIL world, this step is critical to the Service Operation stage as it is brought to the table for review and root-cause analysis.
This analysis is not done during this step, of course, because we still have work to do!
The general idea here is to iron out a plan to:
- Stop this from occurring again
- Understand how can we handle this better in case it does happen again
7. Replace or Repair Defective Equipment
The ugly truth of working in technology is that sometimes we don’t have any clue of an issue until it explodes in our face. This stage is about making sure you don’t get blindsided again. If faulty equipment or a bad configuration was the issue, correct that or install a replacement device – whatever you need to do to decrease the chance of a repeat problem.
Once the fire is out and the smoke has settled, it’s time to reflect and verify that the correct response to the problem was taken. Ask the following questions:
- Did the symptom go away?
- Did the right symptom go away?
- Did I fix the right cause?
- Did I create any other problems?
Having just dealt with a crisis, we’re starting to feel the relief and users are back to work as usual. We don’t want any unexpected surprises surfacing as a result of the incident, so asking ourselves these questions and performing any corresponding validation will help keep those users happy, and will ultimately help prevent a relapse.
9. Take Pride
Though not directly linked to the troubleshooting process itself, this step is, indeed, vital. You’ve just been involved in a stressful situation, with people coming at you from every-which-way looking for updates and ETAs.
Now that things are back up and running, take some time to talk about the incident. Tell your co-workers and/or managers/team-leads the process you went through to arrive at the solution. Brag with your teammates, respectfully of course. In the technology field, the concept of burnout is very real. This step is a great way to help prevent this from happening to you – a chance to gloat, a chance to feel great about getting to the bottom of things. Always take this “debrief” period for your own mental sanity – these situations are nerve-racking.
10. Prevent Future Occurrences
This stage is all about communication and documentation. Document the issue including initial symptoms, affected systems, and any other pertinent details. Document your corrective measures and your root-cause analysis. Meet with your team and discuss the findings so that everyone is on the same page. Make sure that there is plenty of supporting detail, enough that you are confident that should this issue recur, your colleagues would have a much easier time with diagnosis and resolution.
The 10 steps outlined above were adapted from The Universal Troubleshooting Process a methodology I have been using throughout my career, and also throughout my adult life. I have held many roles in the technology field since graduating from college – and in each of one I have experienced “cart-before-horse” troubleshooting very often.
The audience for these tips is unrestricted. It doesn’t matter if you are an expert, a novice, intermediate, or a plumber. These steps can be applied to any industry or field, which is one of the things I like most.
Matt Minor (MattMinor86) has been a Technical Specialist at a hospital for seven years, where he works in all things IT. Hailing from a small Northern-Ontario town in Canada, when he isn’t helping out on Experts Exchange, he can be found roughhousing with his son, Jackson.