This post is based on a training presentation that I have given a few times in an effort to improve the debugging skills of my fellow engineers. The aim of the presentation was to give people a strategy to follow when trying so identify the cause of "hard problems" on production systems. The habits are based on my experience and what I had learnt in my career. A lot of the experience came from doing integration work on combat systems and also working on embedded systems. Debugging on a system where it takes 45 minutes to load the software onto the system before you can run the debugger means that you have a lot of time to think about what your next step is so you don't waste another 45 minutes waiting for a reload because the system crashed and you did not have the break point set at the correct place.
The seven habits are:
- Understand the system
- Understand the problem
- Gather Information
- Identifying the cause
- Implement the solution
- Test the solution
- Document the solution
Understanding the system is the first thing you need to do when presented with a problem. If you don't know the following then you will probably waste a lot of time looking for the cause of an issue:
- What the system is meant to be doing -
- How is the system put together - what are the components of the system. You will need to know what parts of the system are being executed when the problem is occurring. Which version of the software/hardware is being used. I have seen cases where this has not been checked and people have wasted a lot of time looking the wrong code.
- How is the system configured - Over the years I have found that incorrect configuration parameters are one of the major causes of problems.
Once you have a good understanding of the system, the next step is to understand the problem. I have this as one of steps because in my experience issues logged by help desks and users are often not precisely reported. Often you don't get a set of repeatable steps on how to reproduce the issue but rather a description such as - "the button did not work". You should try and reproduce the issue for yourself if possible. If you can do this it generally makes solving the problem much easier. Remember this strategy was being put forward to help solve "hard" problems on production systems. These were issues/bugs that had not been detected during the testing of the system. Many of the issues being reported were intermittent and some could not be reproduced on test systems.
Okay now you feel that that you have sufficient knowledge of the system and the issue, the next steps are gathering and analysis of information and then using this to to find the cause of the issue. In practice steps 3 and 4 would be done together. There is no magic answer for identifying the cause of an issue. The exact path you follow needs to be based on what information you have. I generally look at the following things:
- The log files. Generally if asked about an issue the first thing I will ask is "Is there anything in the log files". I once worked along side another senior engineer in a small office. The junior engineers would come in to ask about issues. The first thing he would ask them was to show him the log file. If they did not have the log file he would not talk to them. Even though the junior engineers knew what to do it was amazing how many times they had to be sent back to get a log file before they learnt what was expected of them.
- The code. One of the most important lessons I learnt very early in my career was: Believe the system behaves as it has been configured and programmed. Software generally behaves in the way you tell it. It is only in rare circumstances that this assumption is wrong.
- The configuration files.
- The database.
- The web
If all of the above fails to find the answer then sometimes it helps to discuss the issue with your fellow engineers. Remember "two heads are better then one".
If all of the above fails then you may have to look at modifying the code to add more logging information.
An important thing I try and drill into engineers is that during this stage it is very important that you keep detailed information of what you are doing written down somewhere - preferably attached to the issue report in the issue tracking system.
If you have not found the cause then don't continue reading go back and identify the cause. Once you are confident that the cause has been identified then you can fix it. This may involve modifying code, configuration data, or data in the database. if a code change is required. Be careful that you don't break anything else. You should do at least one of the following to test the solution:
- Run all of your regression tests
- Have the code change inspected by some one else
- Test the system again with the change in place and then without the change to confirm that the "fix" does resolve the issue.
With the solution implemented and tested and hopefully deployed there is one remaining step - document the solution. Doing this will:
Save time later if a similar issue occurs - It should give others information on what to look for as the cause of the issue.
Prevent other similar issues occurring - If you have spotted a coding problem that may be present in other systems or
So hopefully following this strategy will make you more efficient when confronted with hard problems on production systems.