Thursday, April 23, 2009

The 3 C’s: Configuration, configuration and configuration

One of the biggest causes of issues is configuration parameters. I try and drill into all the young engineers that I work with the three Cs: Configuration, configuration and configuration.

I was approached by one of the operations staff recently about a performance problem that had started occurring in the production environment. Some of the queries to validate data were taking a lot longer than they should and this was happening intermittently. The guy was concerned because they had recently migrated the system from a physical server to a new virtual server. He was worried there was some issue with the server setup that was not expected. So I knew something must have changed. I didn't think it was the virtual server because we have other deployments on virtual servers.

I wanted to have a look at the log files for the server because we had introduced performance logging around the operations in questions. This would allow me to confirm what the customer was saying. Anyway he wanted to look at some other stuff first and I had other things to do so I left it. A while later he approached me again saying that he needed more help.

So I opened up the log files and was able to confirm that some of the operations were running slowly intermittently. The operations normally complete in about 300ms so when they start taking 25 seconds people notice. It also looked like the customer operators were hitting the buttons again because they thought the first button press had not worked.

What I noticed was that every time the operation ran slowly a new connection to the database was also happening. Our database server is configured to do reverse DNS lookups for the client connections. I don't know why this is but it does mean a new connection sometimes takes 20 seconds or more when the client machine does not have a DNS entry. The operations guy said all the production servers had DNS entries and the DNS was configured and other database connections were happening quickly. I had a look at the JDBC connection string. It was wrongly configured to point to the old database server not the new one. The old database server does not connect to the same DNS server as the new database server. When the system was migrated the database parameters were not updated - Issue resolved. Once again the culprit was the 3 Cs.

It reminded me of another issue that had come up recently, when a test server was being upgraded. Someone had a made a mistake and configured the server with the configuration files for the production server. I was called over when they could not work out why all the data was going into the production database, when no users were connecting and all the users seemed to be connecting to the test server. Again the culprit was the 3 Cs.

Wednesday, April 15, 2009

Take a step back

I often have to tell engineers who are very focussed on an issue to “take a step back” and think about why they are seeing/having the issue.

Recently an issue was noticed with one of the dialogs on our scheduling product. The issue only occurred on Firefox 3. The dialog displays a list of items and each item’s row height was growing to fill the available space in the dialog rather than the row height being set based on the height of the contents. It was only a cosmetic problem. One of the engineers tracked the cause back to a known issue in Firefox 3 with table rendering. I asked him if there were any known workarounds and he said no, we would just have to wait until the issue was resolved in Firefox.

Although it was only a cosmetic problem it bugged me! We were not doing anything special on this dialog. The dialog just displayed a list of items in an HTML table. If there was a widespread problem with table rendering in Firefox 3, then everyone would be talking about it. Take a step back and think about it – why were we being affected by this very specific issue?

On a Friday afternoon when I had a few minutes spare I got one of my team mates to show me the Firefox issue report, which detailed the very specific situation that led to the problem. The problem was caused when the table or cell height were specifically set. This was something we did not need to do so we should not have been having the problem. Our table was being dynamically generated with JavaScript so we had a look at the relevant code and found it was setting the height of the table body element, which was not needed in our case. We patched the code on a server and tested the dialog again and it displayed perfectly – Issue Resolved.

It took about 10 minutes to resolve this issue, which had been sitting around for a couple of months because no one had taken a step back to ask why we were having this issue. We knew what caused it but had decided it was a Firefox issue and would wait for it to be fixed.

The speed in which it was resolved in the end was due to

  • Good system knowledge – we knew where the code was that was causing the issue
  • A reproducible problem
  • Two heads are better than one.

Wednesday, April 8, 2009

The curse of daylight savings time

Daylight saving is a nightmare for software companies in the south eastern Australian states. The reason for this is that the state governments have been modifying the start and end dates of daylight saving for the last few years. This all started in 2000 for the Sydney Olympics.

This year the end of daylight saving has been moved back a week. I saw in the paper that some people had glitches with alarms on mobile phones waking them up an hour later.

We had an issue reported, where one of our customers could not see some of their field worker jobs on their scheduling screens. The support desk has been looking at it and confirmed there was an issue and were escalating it to engineering.

I asked the normal questions :

  • has anything changed - This was the first time this issue had been reported and this customer's system has been running for about six months. No nothing had been changed today.
  • Have you checked the log files - they were still looking at it - it always amazes me how long it takes people to check the log files.

I checked the log files and there was no indication to the cause of the issue, although something had gone wrong at 11:30AM where the customer's browsers were reporting an issue with the data. The error reported was a pretty vague JavaScript error.

Could we replicate? I got the identifier for job that was not appearing on the scheduling screen and ran up a scheduling screen on Firefox and the job was there. However the customer runs IE7. When I accessed the screen in IE7 the page failed to load. I got one of the other engineers to load the same screen on a machine with a debugger attached and we found part of the puzzle - the job had a negative duration because the start time for the job was later than the end time. The duration controls the width of the box used to display the job on the screen so a negative duration was not good. We set the start time for the job when the field worker starts the job and the end time when they complete the job. Why would they fiddle with the clock on their tablet during a job - they shouldn't change the time at all. Then it clicked daylight savings - the support desk rang the technician and confirmed that he had changed the tablet clock during the job. The clock on his tablet was wrong because something thought daylight savings had already ended.

We updated the database entries for the job so that it now had a positive duration - Issue resolved.

The problem reoccurred 1/2 hour later when another fieldworker did exactly the same thing. Applied the same fix - Issue resolved again.

We will see what happens next year.