issue-resolved: May 2009

Saturday, May 16, 2009

“Use the code, Luke”

One of the things I stress to young engineers is to believe that the code does what it is meant to. If the code is meant to log a message and you don’t see the message then the code is not being executed. The following example from a few days ago shows how effective this idea can be debugging relatively complex issues.

An issue was reported by a customer that some status messages were not being sent to them. This was happening after the system was upgraded to the latest version of our server. The issue was escalated to engineering and I had a look at the issue with the engineer responsible for the customer. We had a look at the log files and could not see the entries that indicated status messages being sent. We checked the change log for the code that sent the messages and nothing had been modified in the latest server release. Aside: Each of our customers normally has differing requirements so our system consists of core set of functionality supplemented with with customer specific functionality implemented in what we call customisation classes.

I suggested to the engineer that because we could not see the log file entry for sending the status message I did not think that the code was being called – there must be another path through the system that was now executing. I had to go and check on something else so I left him to see what he could find out.

15 minutes later he came back with the cause of the problem identified. A change had been made in the core code that stopped the customisation code being called under certain circumstances. The way the system was meant to work was that the customisation class extended the core class and over rode certain methods. The code below shows this idea.


class Core
{
    public void methodA()
    {
        methodB();
    }
    
    public void methodB()
    {
        doSomthing();
    }
}

What had happened was that a new method signature had been introduced for methodB and methodA now called this new methodB. Because the customisation code did not over ride method the new methodB the code for sending the status message was never called. The problem is shown in the code sample below.

class Core
{
    public void methodA()
    {
        methodB("No longer overriden)
    }

    public void methodB()
    {
        // normally overriden
    }
    
    public void methodB(String a)
    {
    }
}

The key reasons the cause was found so quickly were:

Good knowledge of the system – The engineer knew what parts of the the system were related to the issue.

He looked at what had changed

He used the knowledge that there must be a different path through the system

He “used the code”

A fix was produced that resolved the issue by reverting the functionality to what it had been in relation to the execution path through the system – Issue Resolved.

Wednesday, May 13, 2009

The missing hour

For one of our customers we produce a series of reports on the transactions performed in their shops. We only capture some of the data on the mobile client and the rest of the information is obtained by querying a web service that accesses the customer’s point of sale system.

During customer acceptance testing they reported that we had the times for some of the transactions wrong. The transactions were an hour off, putting them in the wrong day. Daylight saving had ended during the customer acceptance so someone suggested that this may be the cause of the issue. Unfortunately we could not tie this to the cause and times were still being recorded incorrectly a few weeks later.

The web service ran over HTTPS so it was hard to see what was in the message initially - we could see what was received and stored in our database but couldn't see what they were sending us. They eventually sent us a message log from their server. We then were able to confirm that the time string in the message for the transaction time did not match the one we were storing in the database.

It looked like the problem was somewhere in the decoding of the SOAP message. This was in code generated from the WSDL. We just got a Calendar object back from the generated code. I looked up the W3 spec for SOAP encoding (Remember steps 1-3: Understand the system, Understand the problem, gather information). I noticed that the format used for transmitting times in SOAP messages cannot be easily converted back to a Calender or Date object using the standard java date format converters. From the JDK documentation I saw that there is a special class for converting W3 date/time data types. I built a small test program and ran the date time string from the WS call through the JDK converter class. The program returned the correct value.So why didn’t the generated code work correctly?

We looked through the generated code and and discovered that the generated code used another class to convert the string in the message to a calendar object. We modified the test program to use the other class and it gave the result that had the issue reported. (Now we had a reproducible problem). Looking at the source for the problem class it looked like the class did not handle the timezone component in the date/time string from the WS call. We were in a rush so we just modified the generated code and swapped in the JDK class. Issue Resolved.

Tuesday, May 5, 2009

The worst day of my life

The title of this article should really be "The worst day of my professional life". This is also an article about trust.

It was in 2003, I don't remember the exact date but I could go back and check my log books. I was working on a system being deployed overseas with 400 mobile users. It was our first large deployment. Our software was not really up to the task. There were numerous performance problems with the design of the code that handled the data synchronisation between the server and the mobile clients.

After working for 10+ years this was the first time I had actually been around when a system I had worked on was live all the time and needed to up and running all the time. Sure I had worked on a lot of mission-critical stuff before but I was never around when the software finally got deployed. The customer had been slowly increasing the user numbers and the performance issues had been getting worse.

I performed some analysis of the code and had come up with some design changes that reduced the number of messages exchanged between the mobile devices and servers. These changes wouldn't solve the fundamental design issue - we effectively had a single threaded server handling 400+ clients. However the changes would reduce some of the queuing issues that we were seeing.

We had a test system in Sydney with a fraction of the power of the production hardware and I had used this to test the changes and show that the synchronisation time could be reduced significantly with these changes. A major drawback to the changes was that the mobile software needed to upgraded as well . It was a labour intensive activity to upgrade a few hundred devices with the new software.

So the day arrived and the new software with my changes was turned on and the field workers started logging on. As the number of users increased to around 20 the system ground to a halt. This was not what I was expecting - we had run a test client on the software that simulated a few hundred users without problems. After a few minutes we realised that we had hundreds of synchronisation requests backed up. The system would take over an hour to process the backlog and during this time more requests would be added to the queue. We made the decision to reset the system. We couldn't quite work out why the performance was nowhere near what we had experienced in Sydney.

After resetting the system, the system once again ground to a halt after the number of users reached more than twenty. So after about 1/2 hour we had to reset the system. I was just sitting there watching this live system continually failing helpless to do anything about it. This was not something I had ever experienced before. I was not used to live software that I had worked on failing like this. It was the worst day of my professional life. As expected the customer was not very happy about the situation. After a few hours of this, the decision was made to roll back the changes, which also meant all the mobile device software also needed to be rolled back.

The next day the old version of the software was back online and the synchronisation times were slow but the server was quite happy since the number of messages exchanged in the old protocol effectively meant the server idled most of the time waiting for data to be transferred over the mobile phone network.

I trawled through the log files to try and find out what had happened and saw that the database queries were running slower by a factor of 10 on the production hardware vs. the time we had achieved on our test system. I asked the customer if they were running anything on the database that would affect the times like this. Their answer was no. We started having daily phone calls with the irate customer where we were meant to report our progress on what had happened. We ran our tests again and could not see anything wrong, so we didn't have a lot to report. We kept asking if they had done anything to the system that could affect the performance of our queries. The answer was always no.

One morning I created a new instance of the database on their production hardware and re ran the performance test against the new database. Low and behold the performance matched what had been seen in Sydney. After some further examination of the production database I found that there were a whole lot of triggers added by the customer. This was the reason our queries were running so slowly - Issue resolved! My boss angrily reported the facts to the customer -they had wasted a lot of our time. I think there were some mumbled apologies. Lesson learned: never trust someone when what they are telling you does not fit in with what you are experiencing , go and investigate it for yourself if you can.

Postscript: After the failed upgrade the customer did not want to upgrade the server and mobile software again. To this day over five years later I think they are still running the old version of the software. Last year they approached us to port the mobile software to a new device. When we suggested that they upgrade to our new version of the server, which is probably a hundred times faster than the version they are running, their response was "no thanks"