OK, I should now be in bed. But as I was browsing my blogroll, I noticed Dave Sifry's post relating his not-so-cool week-end spent fixing Technorati's infrastructure due to a fire at his colo. Unusual ? Unique ? Hardly so...
This must be the tenth story I hear about a colo facility that had (supposedly) all the required redundancy to "insure" reliability, including the (infamous) Diesel generators that kick in to take over short term UPSs. The issue is that those generators never seem to kick in (I hope that hospitals use a different brand than buildings and datacenters).
One of my former portfolio companies, an ASP, that did not go public on the issue but wrote to its rather unhappy clients, faced exactly the same issue, and the note from the CEO contained very similar statements to David's. Here is a brief excerpt, in which I only removed named references:
On Monday morning at 9.45am there was a complete power outage at ..., our datacenter provider. Although our Uninterruptible Power Supplies (UPSs) were triggered and ran, ...’s diesel generators failed to start before the batteries in the UPSs ran out. As a result, all our servers abruptly lost power. The power was restored by 10.45am, but some severe damage had been done to our infrastructure by the abrupt power failure, and the surge which took place when power was restored.
[Pages of detailed explanations deleted]
On behalf of the whole ... team, I would like to apologize most sincerely to our clients for this severe lapse in our service. Rest assured that we will be working extremely hard in the coming days and weeks to review every aspect of the resilience of our service, and to ensure that an incident like this cannot happen again.
Interestingly similar, ain't it ?
As to the underlying issue, the disaster recovery plan, it seems to be a common mistake to believe that having some level of redundancy leads to reliability... which means that a lot of time is generally spent on designing a technical infrastructure (comms, power, servers, disks, backups,...) that will "always" work, as opposed to defining the processes and procedures that will be applied if the s..t hits the fan, the database is completely corrupted, and servers aren't able to restart. Ie when Murphy's law kicks in (aka the "buttered slice of bread theorem" - or "Theoreme de la Tartine Beurree", which states that if you put some spread on a slice of bread, and the slice falls, it has a %$@^%*$ tendency to repeatedly fall on the wrong side).
Recent Comments