Well, put simply, we had a bad day on Wednesday. Frankly, I haven’t heard from anyone that they definitively noticed the downtime we had, but we’ve said all along that we would be direct and clear, and so we shall.
On Wednesday, we were doing a bit of housekeeping. Frankly, Test Track users upload a lot of content. Periodically, we take the time via a manual process to seek out orphaned courses and those that haven’t been accessed in any way during the last 6 months. In doing this, we’re able to limit the vast quantities of SCORM content we have to keep up with.
Well, when we went to hit “delete” on the old, unaccessed courses, things went badly. We run multiple Amazon Webservice Elastic Compute Cloud (EC2) instances. First, the failure was noticed by a secondary instance which discovered it could no longer access content, and told us through an email. Second, the instances, in automated coordination, attempted to rectify the problem. Third, we discovered that this self-healing process failed, and that the Elastic Block Storage (EBS) volume which houses our content could not be remounted, due to an unexpected file system corruption.
An instance going down is no problem… we have more. In fact, this happens occasionally, and is handled without human intervention. The EBS volume going down and becoming unmountable… that’s a problem. Ultimately, this meant that people were unable to access any of the content hosted on SCORM Cloud, including all Test Track content, until we intervened and mounted an older volume… one that was known to function. This was done quickly. Old content and new uploads were available in less than 30 minutes. But here’s the kicker… content uploaded between December 10, 2009 and January 6, 2010 wasn’t available. This incident led us to discover a flaw in our backup scheme that meant recovering that content wasn’t a 10 minute job… In fact, it required recovering from a fatal flaw in the file system we use. The reconstruction/recovery process was kicked off right away, and all content was restored some 11 hours later. So, as of 2am CT on January 7, all content was restored, all users were made whole, and all was well, in a manner of speaking.
Well, once we addressed the symptom and had everyone up and running again, we thought we should do a bit of analysis. We didn’t like how everything played out, we didn’t like that people were down for varying amounts of time, so we thought we’d go digging. It was time for a little game of 5 Whys.
To play 5 Whys, we started by asking, “What happened?”
We received notification from a secondary machine of an EC2 Failover Event on the primary (a.k.a. “The Sh*t Hit the Fan”)
The primary couldn’t access the EBS volume (where the content is stored).
Something caused the EBS volume’s XFS file system to crash/become corrupt.
The EBS volume’s file system had an inconsistency (that we’ve since found dates back months) and series of aggressive deletes were called in succession from a secondary machine.
This why results in many questions…
- Why was there a series of aggressive deletes? Did we need to be purging courses?
- Why was there an inconsistency in the file system dating back several months?
- Why does XFS have trouble freeing space in certain circumstances? Should we continue to use XFS?
Here, though, is a more interesting/actionable string of 5 Whys…
Content uploaded between Dec 10 and January 5 was unavailable for 11 hours.
The EBS volume’s file system failed and our backup scheme didn’t allow for immediate or near immediate recovery of recently uploaded files.
Our recovery scheme included reconstructing the drive, rather than simply using a more frequent/recent snapshot.
Because we didn’t consider this eventuality sufficiently. We made a mistake.
HOW DO YOU REMEDY THAT MISTAKE?
We have already changed our scheme to persist remountable EBS content volumes hourly. This means that we can return to a snapshot that is no more than 1 hour old in a matter of minutes.
So, in total, we had ourselves a bad day on Wednesday. Did we recover completely? Yes. We’re pleased with that. Did we do so as quickly as we feel we should? We did not. Hopefully none of you were actually impacted. If you were, we’re sorry. If you weren’t, we hope we’ve taken the right steps to make you feel comfortable about our approach to mistakes.
We are considering the possibility of an update to Test Track and want to get your feedback on what it should include. Please take a few minutes to take our Test Track Improvements Survey.
When we released SCORM Test Track a few years ago, we had no idea it would be this popular. We currently have over 10,000 users with dozens more signing up every day. Test Track has come to be a critical application for many in the community and we take that responsibility seriously. Please help us to make it even more valuable.
To travel hopefully is a better thing than to arrive, and the true success is to labor.
–Robert Louis Stevenson
OK, fine, he’s right and all. Creating the new version of a product is fun and all. But shipping it, polishing it, finishing it, that’s pretty awesome too.
Today is the confluence of a bunch of different work around Rustici Software.
- SCORM Engine? 2009.1 released
- SCORM Driver? A new release including SCORM 2004 4th Edition
- SCORM TestTrack? 2009.1 has already been applied, so this is fresh and clean too.
I’m psyched. Great new versions of old products out, and a clean slate to start on some new stuff. Thanks to the guys at the office, and, importantly, the cookie intern.
Last night we released an implementation of SCORM 2004 4th Edition to the public TestTrack server. For all of you chomping at the bit to take advantage of the new features in 4th Edition, now’s your chance to give it a whirl. Ok, so maybe it’s not all that exciting, but we’re happy to have it out there. As far as we know, we are the first to release a 4th Edition conformant LMS product. Our plan is to make 4th Edition available to all of our active SCORM Engine customers as soon as ADL opens up certification for 4th Edition (last indication was that would be in August).
Note: Details of the 4th Edition changes are available here.
Yes, as we’ve mentioned recently, several things are different about TestTrack. We transitioned upgraded our server, we applied a skin that makes it look like our new website, and most recently, we switched TestTrack over to our “beta” of the SCORM Cloud.
Truth be told, I’m remarkably happy with how well TestTrack has held up through all of these iterations. It’s undoubtedly a testament to the developers who are working on it. The SCORM Cloud, while built around existing SCORM Engine code, is really a very different architecture from what we had been running. What’s different, you may ask?
- Use of Amazon’s SimpleDB for storing aggregate registration and package information
- Use of Amazon’s S3 storage for per registration detail
- Use of memcached to address certain eventual consistency issues with S3
- Use of Amazon’s Elastic Block Store for persisting content and managing the FTP access to TestTrack (read about that here)