issue-resolved

Saturday, May 16, 2009

“Use the code, Luke”

One of the things I stress to young engineers is to believe that the code does what it is meant to. If the code is meant to log a message and you don’t see the message then the code is not being executed. The following example from a few days ago shows how effective this idea can be debugging relatively complex issues.

An issue was reported by a customer that some status messages were not being sent to them. This was happening after the system was upgraded to the latest version of our server. The issue was escalated to engineering and I had a look at the issue with the engineer responsible for the customer. We had a look at the log files and could not see the entries that indicated status messages being sent. We checked the change log for the code that sent the messages and nothing had been modified in the latest server release. Aside: Each of our customers normally has differing requirements so our system consists of core set of functionality supplemented with with customer specific functionality implemented in what we call customisation classes.

I suggested to the engineer that because we could not see the log file entry for sending the status message I did not think that the code was being called – there must be another path through the system that was now executing. I had to go and check on something else so I left him to see what he could find out.

15 minutes later he came back with the cause of the problem identified. A change had been made in the core code that stopped the customisation code being called under certain circumstances. The way the system was meant to work was that the customisation class extended the core class and over rode certain methods. The code below shows this idea.


class Core
{
    public void methodA()
    {
        methodB();
    }
    
    public void methodB()
    {
        doSomthing();
    }
}

What had happened was that a new method signature had been introduced for methodB and methodA now called this new methodB. Because the customisation code did not over ride method the new methodB the code for sending the status message was never called. The problem is shown in the code sample below.

class Core
{
    public void methodA()
    {
        methodB("No longer overriden)
    }

    public void methodB()
    {
        // normally overriden
    }
    
    public void methodB(String a)
    {
    }
}

The key reasons the cause was found so quickly were:

Good knowledge of the system – The engineer knew what parts of the the system were related to the issue.

He looked at what had changed

He used the knowledge that there must be a different path through the system

He “used the code”

A fix was produced that resolved the issue by reverting the functionality to what it had been in relation to the execution path through the system – Issue Resolved.

Wednesday, May 13, 2009

The missing hour

For one of our customers we produce a series of reports on the transactions performed in their shops. We only capture some of the data on the mobile client and the rest of the information is obtained by querying a web service that accesses the customer’s point of sale system.

During customer acceptance testing they reported that we had the times for some of the transactions wrong. The transactions were an hour off, putting them in the wrong day. Daylight saving had ended during the customer acceptance so someone suggested that this may be the cause of the issue. Unfortunately we could not tie this to the cause and times were still being recorded incorrectly a few weeks later.

The web service ran over HTTPS so it was hard to see what was in the message initially - we could see what was received and stored in our database but couldn't see what they were sending us. They eventually sent us a message log from their server. We then were able to confirm that the time string in the message for the transaction time did not match the one we were storing in the database.

It looked like the problem was somewhere in the decoding of the SOAP message. This was in code generated from the WSDL. We just got a Calendar object back from the generated code. I looked up the W3 spec for SOAP encoding (Remember steps 1-3: Understand the system, Understand the problem, gather information). I noticed that the format used for transmitting times in SOAP messages cannot be easily converted back to a Calender or Date object using the standard java date format converters. From the JDK documentation I saw that there is a special class for converting W3 date/time data types. I built a small test program and ran the date time string from the WS call through the JDK converter class. The program returned the correct value.So why didn’t the generated code work correctly?

We looked through the generated code and and discovered that the generated code used another class to convert the string in the message to a calendar object. We modified the test program to use the other class and it gave the result that had the issue reported. (Now we had a reproducible problem). Looking at the source for the problem class it looked like the class did not handle the timezone component in the date/time string from the WS call. We were in a rush so we just modified the generated code and swapped in the JDK class. Issue Resolved.

Tuesday, May 5, 2009

The worst day of my life

The title of this article should really be "The worst day of my professional life". This is also an article about trust.

It was in 2003, I don't remember the exact date but I could go back and check my log books. I was working on a system being deployed overseas with 400 mobile users. It was our first large deployment. Our software was not really up to the task. There were numerous performance problems with the design of the code that handled the data synchronisation between the server and the mobile clients.

After working for 10+ years this was the first time I had actually been around when a system I had worked on was live all the time and needed to up and running all the time. Sure I had worked on a lot of mission-critical stuff before but I was never around when the software finally got deployed. The customer had been slowly increasing the user numbers and the performance issues had been getting worse.

I performed some analysis of the code and had come up with some design changes that reduced the number of messages exchanged between the mobile devices and servers. These changes wouldn't solve the fundamental design issue - we effectively had a single threaded server handling 400+ clients. However the changes would reduce some of the queuing issues that we were seeing.

We had a test system in Sydney with a fraction of the power of the production hardware and I had used this to test the changes and show that the synchronisation time could be reduced significantly with these changes. A major drawback to the changes was that the mobile software needed to upgraded as well . It was a labour intensive activity to upgrade a few hundred devices with the new software.

So the day arrived and the new software with my changes was turned on and the field workers started logging on. As the number of users increased to around 20 the system ground to a halt. This was not what I was expecting - we had run a test client on the software that simulated a few hundred users without problems. After a few minutes we realised that we had hundreds of synchronisation requests backed up. The system would take over an hour to process the backlog and during this time more requests would be added to the queue. We made the decision to reset the system. We couldn't quite work out why the performance was nowhere near what we had experienced in Sydney.

After resetting the system, the system once again ground to a halt after the number of users reached more than twenty. So after about 1/2 hour we had to reset the system. I was just sitting there watching this live system continually failing helpless to do anything about it. This was not something I had ever experienced before. I was not used to live software that I had worked on failing like this. It was the worst day of my professional life. As expected the customer was not very happy about the situation. After a few hours of this, the decision was made to roll back the changes, which also meant all the mobile device software also needed to be rolled back.

The next day the old version of the software was back online and the synchronisation times were slow but the server was quite happy since the number of messages exchanged in the old protocol effectively meant the server idled most of the time waiting for data to be transferred over the mobile phone network.

I trawled through the log files to try and find out what had happened and saw that the database queries were running slower by a factor of 10 on the production hardware vs. the time we had achieved on our test system. I asked the customer if they were running anything on the database that would affect the times like this. Their answer was no. We started having daily phone calls with the irate customer where we were meant to report our progress on what had happened. We ran our tests again and could not see anything wrong, so we didn't have a lot to report. We kept asking if they had done anything to the system that could affect the performance of our queries. The answer was always no.

One morning I created a new instance of the database on their production hardware and re ran the performance test against the new database. Low and behold the performance matched what had been seen in Sydney. After some further examination of the production database I found that there were a whole lot of triggers added by the customer. This was the reason our queries were running so slowly - Issue resolved! My boss angrily reported the facts to the customer -they had wasted a lot of our time. I think there were some mumbled apologies. Lesson learned: never trust someone when what they are telling you does not fit in with what you are experiencing , go and investigate it for yourself if you can.

Postscript: After the failed upgrade the customer did not want to upgrade the server and mobile software again. To this day over five years later I think they are still running the old version of the software. Last year they approached us to port the mobile software to a new device. When we suggested that they upgrade to our new version of the server, which is probably a hundred times faster than the version they are running, their response was "no thanks"

Thursday, April 23, 2009

The 3 C’s: Configuration, configuration and configuration

One of the biggest causes of issues is configuration parameters. I try and drill into all the young engineers that I work with the three Cs: Configuration, configuration and configuration.

I was approached by one of the operations staff recently about a performance problem that had started occurring in the production environment. Some of the queries to validate data were taking a lot longer than they should and this was happening intermittently. The guy was concerned because they had recently migrated the system from a physical server to a new virtual server. He was worried there was some issue with the server setup that was not expected. So I knew something must have changed. I didn't think it was the virtual server because we have other deployments on virtual servers.

I wanted to have a look at the log files for the server because we had introduced performance logging around the operations in questions. This would allow me to confirm what the customer was saying. Anyway he wanted to look at some other stuff first and I had other things to do so I left it. A while later he approached me again saying that he needed more help.

So I opened up the log files and was able to confirm that some of the operations were running slowly intermittently. The operations normally complete in about 300ms so when they start taking 25 seconds people notice. It also looked like the customer operators were hitting the buttons again because they thought the first button press had not worked.

What I noticed was that every time the operation ran slowly a new connection to the database was also happening. Our database server is configured to do reverse DNS lookups for the client connections. I don't know why this is but it does mean a new connection sometimes takes 20 seconds or more when the client machine does not have a DNS entry. The operations guy said all the production servers had DNS entries and the DNS was configured and other database connections were happening quickly. I had a look at the JDBC connection string. It was wrongly configured to point to the old database server not the new one. The old database server does not connect to the same DNS server as the new database server. When the system was migrated the database parameters were not updated - Issue resolved. Once again the culprit was the 3 Cs.

It reminded me of another issue that had come up recently, when a test server was being upgraded. Someone had a made a mistake and configured the server with the configuration files for the production server. I was called over when they could not work out why all the data was going into the production database, when no users were connecting and all the users seemed to be connecting to the test server. Again the culprit was the 3 Cs.

Wednesday, April 15, 2009

Take a step back

I often have to tell engineers who are very focussed on an issue to “take a step back” and think about why they are seeing/having the issue.

Recently an issue was noticed with one of the dialogs on our scheduling product. The issue only occurred on Firefox 3. The dialog displays a list of items and each item’s row height was growing to fill the available space in the dialog rather than the row height being set based on the height of the contents. It was only a cosmetic problem. One of the engineers tracked the cause back to a known issue in Firefox 3 with table rendering. I asked him if there were any known workarounds and he said no, we would just have to wait until the issue was resolved in Firefox.

Although it was only a cosmetic problem it bugged me! We were not doing anything special on this dialog. The dialog just displayed a list of items in an HTML table. If there was a widespread problem with table rendering in Firefox 3, then everyone would be talking about it. Take a step back and think about it – why were we being affected by this very specific issue?

On a Friday afternoon when I had a few minutes spare I got one of my team mates to show me the Firefox issue report, which detailed the very specific situation that led to the problem. The problem was caused when the table or cell height were specifically set. This was something we did not need to do so we should not have been having the problem. Our table was being dynamically generated with JavaScript so we had a look at the relevant code and found it was setting the height of the table body element, which was not needed in our case. We patched the code on a server and tested the dialog again and it displayed perfectly – Issue Resolved.

It took about 10 minutes to resolve this issue, which had been sitting around for a couple of months because no one had taken a step back to ask why we were having this issue. We knew what caused it but had decided it was a Firefox issue and would wait for it to be fixed.

The speed in which it was resolved in the end was due to

Good system knowledge – we knew where the code was that was causing the issue
A reproducible problem
Two heads are better than one.

Wednesday, April 8, 2009

The curse of daylight savings time

Daylight saving is a nightmare for software companies in the south eastern Australian states. The reason for this is that the state governments have been modifying the start and end dates of daylight saving for the last few years. This all started in 2000 for the Sydney Olympics.

This year the end of daylight saving has been moved back a week. I saw in the paper that some people had glitches with alarms on mobile phones waking them up an hour later.

We had an issue reported, where one of our customers could not see some of their field worker jobs on their scheduling screens. The support desk has been looking at it and confirmed there was an issue and were escalating it to engineering.

I asked the normal questions :

has anything changed - This was the first time this issue had been reported and this customer's system has been running for about six months. No nothing had been changed today.
Have you checked the log files - they were still looking at it - it always amazes me how long it takes people to check the log files.

I checked the log files and there was no indication to the cause of the issue, although something had gone wrong at 11:30AM where the customer's browsers were reporting an issue with the data. The error reported was a pretty vague JavaScript error.

Could we replicate? I got the identifier for job that was not appearing on the scheduling screen and ran up a scheduling screen on Firefox and the job was there. However the customer runs IE7. When I accessed the screen in IE7 the page failed to load. I got one of the other engineers to load the same screen on a machine with a debugger attached and we found part of the puzzle - the job had a negative duration because the start time for the job was later than the end time. The duration controls the width of the box used to display the job on the screen so a negative duration was not good. We set the start time for the job when the field worker starts the job and the end time when they complete the job. Why would they fiddle with the clock on their tablet during a job - they shouldn't change the time at all. Then it clicked daylight savings - the support desk rang the technician and confirmed that he had changed the tablet clock during the job. The clock on his tablet was wrong because something thought daylight savings had already ended.

We updated the database entries for the job so that it now had a positive duration - Issue resolved.

The problem reoccurred 1/2 hour later when another fieldworker did exactly the same thing. Applied the same fix - Issue resolved again.

We will see what happens next year.

Wednesday, March 18, 2009

The seven habits of highly effective debuggers – with apologies to Steven Covey

This post is based on a training presentation that I have given a few times in an effort to improve the debugging skills of my fellow engineers. The aim of the presentation was to give people a strategy to follow when trying so identify the cause of "hard problems" on production systems. The habits are based on my experience and what I had learnt in my career. A lot of the experience came from doing integration work on combat systems and also working on embedded systems. Debugging on a system where it takes 45 minutes to load the software onto the system before you can run the debugger means that you have a lot of time to think about what your next step is so you don't waste another 45 minutes waiting for a reload because the system crashed and you did not have the break point set at the correct place.

The seven habits are:

Understand the system
Understand the problem
Gather Information
Identifying the cause
Implement the solution
Test the solution
Document the solution

Understanding the system is the first thing you need to do when presented with a problem. If you don't know the following then you will probably waste a lot of time looking for the cause of an issue:

What the system is meant to be doing -
How is the system put together - what are the components of the system. You will need to know what parts of the system are being executed when the problem is occurring. Which version of the software/hardware is being used. I have seen cases where this has not been checked and people have wasted a lot of time looking the wrong code.
How is the system configured - Over the years I have found that incorrect configuration parameters are one of the major causes of problems.

Once you have a good understanding of the system, the next step is to understand the problem. I have this as one of steps because in my experience issues logged by help desks and users are often not precisely reported. Often you don't get a set of repeatable steps on how to reproduce the issue but rather a description such as - "the button did not work". You should try and reproduce the issue for yourself if possible. If you can do this it generally makes solving the problem much easier. Remember this strategy was being put forward to help solve "hard" problems on production systems. These were issues/bugs that had not been detected during the testing of the system. Many of the issues being reported were intermittent and some could not be reproduced on test systems.

Okay now you feel that that you have sufficient knowledge of the system and the issue, the next steps are gathering and analysis of information and then using this to to find the cause of the issue. In practice steps 3 and 4 would be done together. There is no magic answer for identifying the cause of an issue. The exact path you follow needs to be based on what information you have. I generally look at the following things:

The log files. Generally if asked about an issue the first thing I will ask is "Is there anything in the log files". I once worked along side another senior engineer in a small office. The junior engineers would come in to ask about issues. The first thing he would ask them was to show him the log file. If they did not have the log file he would not talk to them. Even though the junior engineers knew what to do it was amazing how many times they had to be sent back to get a log file before they learnt what was expected of them.
The code. One of the most important lessons I learnt very early in my career was: Believe the system behaves as it has been configured and programmed. Software generally behaves in the way you tell it. It is only in rare circumstances that this assumption is wrong.
The configuration files.
The database.
The web

If all of the above fails to find the answer then sometimes it helps to discuss the issue with your fellow engineers. Remember "two heads are better then one".

If all of the above fails then you may have to look at modifying the code to add more logging information.

An important thing I try and drill into engineers is that during this stage it is very important that you keep detailed information of what you are doing written down somewhere - preferably attached to the issue report in the issue tracking system.

If you have not found the cause then don't continue reading go back and identify the cause. Once you are confident that the cause has been identified then you can fix it. This may involve modifying code, configuration data, or data in the database. if a code change is required. Be careful that you don't break anything else. You should do at least one of the following to test the solution:

Run all of your regression tests
Have the code change inspected by some one else
Test the system again with the change in place and then without the change to confirm that the "fix" does resolve the issue.

With the solution implemented and tested and hopefully deployed there is one remaining step - document the solution. Doing this will:

Save time later if a similar issue occurs - It should give others information on what to look for as the cause of the issue.

Prevent other similar issues occurring - If you have spotted a coding problem that may be present in other systems or

So hopefully following this strategy will make you more efficient when confronted with hard problems on production systems.