Although Microsoft responded in a timely manner to an outage of its Azure Compute service last week, it was left having to admit to a software bug that really should not have occurred.
In a post on the Windows Azure blog, Microsoft has admitted that the reason its Azure cloud platform suffered a series of outages on Wednesday was down to software that hadn’t correctly dealt with the leap year.
Bill Laing, Microsoft Vice President for Server and Cloud, said that while investigations were still underway, the most likely cause looked as though it was February’s extra day:
"Windows Azure operations became aware of an issue impacting the compute service in a number of regions. The issue was quickly triaged and it was determined to be caused by a software bug. While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year. Once we discovered the issue we immediately took steps to protect customer services that were already up and running, and began creating a fix for the issue."
It is, of course, highly embarrassing that a company of Microsoft’s size and resources apparently still hasn’t worked out how to code its software to work in a leap year, and episodes like this do nothing to make the general public more trusting of IT in general. On the other hand, most of us have managed to make a mess of date arithmetic (and taking leap years into account) while coding; it’s just that in general, we’ve only had to admit the mistake to a few people, not half the world.
It is a little late for a New Year's resolution but, in the spirit of getting dates wrong, let's all try to remember to test the outlier cases as well as the median. Even if a leap year only comes once every four years, and affects only one day in one month, it is still an important test case.