We’ve been having a rough couple of days here in IT land. We experienced two unexpected outages of our service in as many days. Every failure is a chance to learn something, so I’m going to talk about what happened and what we can do better going forward.
But first, the non-technical executive summary:
Non-Technical Executive Summary
Tuesday: Between 11:13 AM and 11:22 AM Eastern US Time, we experienced a system slowdown and then a brief service outage. Our database server had a hardware failure, in which half the available memory was missing. We switched to our redundant database (we operate a database cluster), and after that everything was fine. The failover took just a couple of minutes.
Wednesday: Between 4:53 PM and 6:45 PM Eastern US Time, we experienced a system slowdown, then a brief outage for some customers and an extended outage for other customers. The root cause was completely different from Tuesday’s issue. On Wednesday, the Storage Area Network (SAN) on which our services rely experienced a degradation of service which slowed everything down. We mistook this for a database failure, and opted to fail over to our newly rebuilt database server. The failover did not solve the problem. The SAN’s slow disk issues meant that many of our largest customer databases took a very long time to pass the necessary automatic consistency checks which follow a failover before they came online. By 6:45 PM, our hosting provider had resolved the issue with the SAN and performance returned to normal.
Now, on to the details and the lessons learned.