issue-resolved: The worst day of my life

The title of this article should really be "The worst day of my professional life". This is also an article about trust.

It was in 2003, I don't remember the exact date but I could go back and check my log books. I was working on a system being deployed overseas with 400 mobile users. It was our first large deployment. Our software was not really up to the task. There were numerous performance problems with the design of the code that handled the data synchronisation between the server and the mobile clients.

After working for 10+ years this was the first time I had actually been around when a system I had worked on was live all the time and needed to up and running all the time. Sure I had worked on a lot of mission-critical stuff before but I was never around when the software finally got deployed. The customer had been slowly increasing the user numbers and the performance issues had been getting worse.

I performed some analysis of the code and had come up with some design changes that reduced the number of messages exchanged between the mobile devices and servers. These changes wouldn't solve the fundamental design issue - we effectively had a single threaded server handling 400+ clients. However the changes would reduce some of the queuing issues that we were seeing.

We had a test system in Sydney with a fraction of the power of the production hardware and I had used this to test the changes and show that the synchronisation time could be reduced significantly with these changes. A major drawback to the changes was that the mobile software needed to upgraded as well . It was a labour intensive activity to upgrade a few hundred devices with the new software.

So the day arrived and the new software with my changes was turned on and the field workers started logging on. As the number of users increased to around 20 the system ground to a halt. This was not what I was expecting - we had run a test client on the software that simulated a few hundred users without problems. After a few minutes we realised that we had hundreds of synchronisation requests backed up. The system would take over an hour to process the backlog and during this time more requests would be added to the queue. We made the decision to reset the system. We couldn't quite work out why the performance was nowhere near what we had experienced in Sydney.

After resetting the system, the system once again ground to a halt after the number of users reached more than twenty. So after about 1/2 hour we had to reset the system. I was just sitting there watching this live system continually failing helpless to do anything about it. This was not something I had ever experienced before. I was not used to live software that I had worked on failing like this. It was the worst day of my professional life. As expected the customer was not very happy about the situation. After a few hours of this, the decision was made to roll back the changes, which also meant all the mobile device software also needed to be rolled back.

The next day the old version of the software was back online and the synchronisation times were slow but the server was quite happy since the number of messages exchanged in the old protocol effectively meant the server idled most of the time waiting for data to be transferred over the mobile phone network.

I trawled through the log files to try and find out what had happened and saw that the database queries were running slower by a factor of 10 on the production hardware vs. the time we had achieved on our test system. I asked the customer if they were running anything on the database that would affect the times like this. Their answer was no. We started having daily phone calls with the irate customer where we were meant to report our progress on what had happened. We ran our tests again and could not see anything wrong, so we didn't have a lot to report. We kept asking if they had done anything to the system that could affect the performance of our queries. The answer was always no.

One morning I created a new instance of the database on their production hardware and re ran the performance test against the new database. Low and behold the performance matched what had been seen in Sydney. After some further examination of the production database I found that there were a whole lot of triggers added by the customer. This was the reason our queries were running so slowly - Issue resolved! My boss angrily reported the facts to the customer -they had wasted a lot of our time. I think there were some mumbled apologies. Lesson learned: never trust someone when what they are telling you does not fit in with what you are experiencing , go and investigate it for yourself if you can.

Postscript: After the failed upgrade the customer did not want to upgrade the server and mobile software again. To this day over five years later I think they are still running the old version of the software. Last year they approached us to port the mobile software to a new device. When we suggested that they upgrade to our new version of the server, which is probably a hundred times faster than the version they are running, their response was "no thanks"

issue-resolved

Tuesday, May 5, 2009

The worst day of my life

No comments:

Post a Comment