Now we know why Microsoft suffered an outage of its Azure Compute service last month. And it really was the dumbest sort of error!
If you are like the majority of programmers, you can't help but want to know about other people's bugs and snigger at how stupid they were. When we heard that Microsoft's Azure service crashed because of a leap year bug, we probably all felt - well dates are hard and and any of us could have made such an error. So, after a little gloating and reflecting on bugs in general, we probably went about business as usual.
Now we can enjoy the whole thing over again as Microsoft gives details of the exact nature of the date error - and it is about as silly as it gets. To quote from Azure’s Bill Laing:
When the GA creates the transfer certificate, it gives it a one year validity range. It uses midnight UST of the current day as the valid-from date and one year from that date as the valid-to date. The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail.
Don't worry what is going on with certificates etc, concentrate on the important part :
The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year
It doesn't get more embarrassing that this.
To be fair adding one to the date year in isolation in this way only goes wrong on February 29th, but notice that it always goes wrong there are no other complicating factors. The programmer concerned must have been completely unaware of leap years while working on the problem.
The only safe way to work with dates is to convert the multi-base date into days from a fiducial point, do what ever arithmetic you need to do, and then convert back to a valid multi-base date. However, this makes operations like deriving a date that is 1 year, or 1 month on more difficult - what do you add 365 or 366?
Interestingly another of Microsoft's big public bugs, when the Zune (the music player they no longer support) fell over, was also due to a leap year error. The mysterious part is that in this case the bug caused the Zune to become a brick on December 31st, 2008 - but the sort of problem that caused it becomes easier to guess at when you realize that this is the last day of a leap year - yes it's also due to the 365 or 366 days in a year problem.
You would think that after the Zune bug caused so much trouble, Microsoft might have a directive in force that all date software was to be checked for leap year problems.
In the case of the Azure problem, Microsoft has decided to refund 33% for February, irrespective of whether or not any problems were caused to the customer by the SSL certificates not working. The whole reliability of cloud infrastructure still depends on the ability of the programmer to create bug free or bug tolerant code - it's not just down to UPSs and backup servers.
There is more information on how the bug was handled in the Azure blog and it too is an interesting story of panic under pressure leading to incompatibilities with new software:
Unfortunately, in our eagerness to get the fix deployed, we had overlooked the fact that the update package we created with the older HA included the networking plugin that was written for the newer HA, and the two were incompatible.
It is refreshing and to its credit that Microsoft decided to share so much information. It also outlined the ways in which it will never happen again. This particular bug might not happen again, but my guess is that this isn't the last time the leap year will generate a bug.
A recent security flaw in iOS is down to an error involving a spurious goto statement - but when you look a little more closely there is a bigger lesson to learn from the incident - and not just "goto [ ... ]