Well, put simply, we had a bad day on Wednesday. Frankly, I haven’t heard from anyone that they definitively noticed the downtime we had, but we’ve said all along that we would be direct and clear, and so we shall.

<AdmissionOfFailure>

On Wednesday, we were doing a bit of housekeeping. Frankly, Test Track users upload a lot of content. Periodically, we take the time via a manual process to seek out orphaned courses and those that haven’t been accessed in any way during the last 6 months. In doing this, we’re able to limit the vast quantities of SCORM content we have to keep up with.

Well, when we went to hit “delete” on the old, unaccessed courses, things went badly. We run multiple Amazon Webservice Elastic Compute Cloud (EC2) instances. First, the failure was noticed by a secondary instance which discovered it could no longer access content, and told us through an email. Second, the instances, in automated coordination, attempted to rectify the problem. Third, we discovered that this self-healing process failed, and that the Elastic Block Storage (EBS) volume which houses our content could not be remounted, due to an unexpected file system corruption.

An instance going down is no problem… we have more. In fact, this happens occasionally, and is handled without human intervention. The EBS volume going down and becoming unmountable… that’s a problem. Ultimately, this meant that people were unable to access any of the content hosted on SCORM Cloud, including all Test Track content, until we intervened and mounted an older volume… one that was known to function. This was done quickly. Old content and new uploads were available in less than 30 minutes. But here’s the kicker… content uploaded between December 10, 2009 and January 6, 2010 wasn’t available. This incident led us to discover a flaw in our backup scheme that meant recovering that content wasn’t a 10 minute job… In fact, it required recovering from a fatal flaw in the file system we use. The reconstruction/recovery process was kicked off right away, and all content was restored some 11 hours later. So, as of 2am CT on January 7, all content was restored, all users were made whole, and all was well, in a manner of speaking.

</AdmissionOfFailure>

<PostgameAnalysis>

Well, once we addressed the symptom and had everyone up and running again, we thought we should do a bit of analysis. We didn’t like how everything played out, we didn’t like that people were down for varying amounts of time, so we thought we’d go digging. It was time for a little game of 5 Whys.

To play 5 Whys, we started by asking, “What happened?”

We received notification from a secondary machine of an EC2 Failover Event on the primary (a.k.a. “The Sh*t Hit the Fan”)

WHY?

The primary couldn’t access the EBS volume (where the content is stored).

WHY?

Something caused the EBS volume’s XFS file system to crash/become corrupt.

WHY?

The EBS volume’s file system had an inconsistency (that we’ve since found dates back months) and series of aggressive deletes were called in succession from a secondary machine.

WHY?

This why results in many questions…

  • Why was there a series of aggressive deletes? Did we need to be purging courses?
  • Why was there an inconsistency in the file system dating back several months?
  • Why does XFS have trouble freeing space in certain circumstances? Should we continue to use XFS?

Here, though, is a more interesting/actionable string of 5 Whys…

Content uploaded between Dec 10 and January 5 was unavailable for 11 hours.

WHY?

The EBS volume’s file system failed and our backup scheme didn’t allow for immediate or near immediate recovery of recently uploaded files.

WHY?

Our recovery scheme included reconstructing the drive, rather than simply using a more frequent/recent snapshot.

WHY?

Because we didn’t consider this eventuality sufficiently. We made a mistake.

HOW DO YOU REMEDY THAT MISTAKE?

We have already changed our scheme to persist remountable EBS content volumes hourly. This means that we can return to a snapshot that is no more than 1 hour old in a matter of minutes.

</PostgameAnalysis>

So, in total, we had ourselves a bad day on Wednesday. Did we recover completely? Yes. We’re pleased with that. Did we do so as quickly as we feel we should? We did not. Hopefully none of you were actually impacted. If you were, we’re sorry. If you weren’t, we hope we’ve taken the right steps to make you feel comfortable about our approach to mistakes.

Tim is the chief innovation and product officer with our parent company LTG, though he used to be CEO here at Rustici Software. If you’re looking for a plainspoken answer to a standards-based question, or to just play an inane game, Tim is your person.