Scalable Storage Using Amazon's Elastic Block Store

We’ve recently completed development of a hosted version of our SCORM Engine. In the coming weeks we will be transitioning TestTrack over to using the hosted Engine to enable much greater scalability than the single server install can currently provide. This project will involved liberal use of several of the Amazon Web Services, due largely to their ease of use, low cost and high scalability. Using the Elastic Compute Cloud (EC2) was an obvious choice for us, as we can easily create and destroy machines as our load fluctuates. Less obvious, however, was how we should go about storing content files. These were our requirements for a storage device:

1) easy ability to upload large files using standard FTP clients, and possibly other commonly available protocols. Files over 1 GB aren’t uncommon, and larger files than that are certainly possible.
2) real-time file updates (i.e. when I upload a new version of a file and then immediately request it, there should be no chance that I get back an older version)
3) small amount of storage for now (so it’s cheap) with the ability to grow to a few TB or more as demand requires it
4) storage should all be accessible at a single root location. We currently have files spread over two drives, which requires a small bit of one-off code to determine which files from which users go on which drive.
5) ability to access the content directly from any one of (potentially) several web servers
6) some form of backup and/or redundancy to prevent data loss

Amazon currently provides two different mechanisms for persistent storage: Simple Storage Service (S3) and Elastic Block Store (EBS). Each of these storage methods has its advantages, but at first look, neither will fulfill all of our requirements straight out of the box.

S3 provides limitless storage in “buckets” of up to 5 GB each, while only charging for the amount that you’re actually using (#3). It provides access to your files via HTTP from anywhere in the world (#5), while promising 99.99% availability because of its decentralized, redundant, fault-tolerant architecture (#6). However, it doesn’t directly support FTP (#1), and file uploads don’t necessarily propagate to other nodes instantaneously (#2). We’d also have to span multiple buckets, meaning that we’d have to track which customers were stored in each one (#4). We could potentially overcome the FTP issue by writing an FTP client that uses S3 as its back end, but that’s time-consuming and inelegant, and it makes the cost of switching to another protocol (like SFTP or SCP) extremely high. The other problems are inherent to the S3 architecture, so we’d just have to deal with those.

EBS behaves just like a traditional block storage system. You can think of an EBS volume as a virtual external hard drive attached to your virtual machine. File access is only limited by the security settings on your machine (#1), and files written to the device are immediately available (#2), just as they would be on a “regular” hard drive. Drives can range in size from 1GB to 1TB, and you can mount several EBS drives to one machine (up to 20, I believe), thus providing enough storage to meet our needs for the foreseeable future on a single server (#3). EBS volumes are only available in one Amazon Availability Zone, potentially making them less reliable than S3 storage. However, you can create snapshots of drives and store them in S3, whereafter you can restore those snapshots to any EBS volume in any availability zone. Since EBS volumes behave just like hard drives, you can also mirror them or take any number of other traditional steps to protect your data (#6). The downside to EBS is that you have to pay for all space that’s allocated to you, whether or not you’re actually using it (#3). Additionally, multiple drives mean multiple mount points (#4), and as with regular drives, they can only be attached to one machine at a time (#5).

Ultimately, we decided to go with EBS because all of its shortcomings can be overcome with widely available, common solutions. We can start with small volumes to keep the cost down, and then grow them whenever we need to by taking a snapshot and then restoring it to a new, larger volume (#3). We can overcome having to deal with multiple mount points by using LVM to join multiple volumes into one big one (#4), and we can overcome the single machine limitation by exposing the whole thing with NFS.

With all of the decisions out of the way, it’s now time to actually combine all of these things together. I found a good bit of info about each of these pieces in various places on the web, but it doesn’t appear that anyone has set out to put all of this in one stack (or if they have, they didn’t write about it). As such, I thought I’d take some extra time to record the things I’ve done to make all of these pieces work together.

LVM Config

Before we can begin setting up our storage solution, we first need a machine in the cloud to host the drives. The easiest way to start one up is to use the ElasticFox plugin for Firefox. If you’re not familiar with ElasticFox, go take a minute to play around with it and see how it works. We’ll be using it for quite a few different things throughout this document, so you’ll need to become familiar with it.

Open up ElasticFox and fire up a VM. At some point we may start using home-baked machine images so that everything we need is already installed, but right now we’re using a public Ubuntu 8.04 image from Alestic (ami-51709438). Since we’ll be installing most of the required packages as we go, these steps should work on other distributions as well, but I haven’t tried it.

Once your VM is up and running, take note of which Application Zone it’s in. Now click over to the Volumes and Snapshots tab in ElasticFox and create two new volumes, making sure they’re in the same Application Zone as your VM. For the purposes of this demo, I made my drives 1 GB each. Since this demo is starting off as a proof-of-concept, there’s really no use in paying for more storage until we’re actually going to use it. We can always come back and grow (or replace) those volumes later.

If you’re both good at math and observant, you’ve probably noticed that there’s not really any need for creating two separate elastic block drives that are only 1 GB each. Why not just create a single 2 GB drive? In practice, the single drive is probably the way to go. But one of the points of this exercise is to prove that we can combine two of these things together and make them look like one drive. If you’d rather just take my word for it, you can always go through these steps with just one drive, but if you’re looking for a (practically) infinitely growable, single-volume storage drive, you’ll still need to get lvm set up.

Once your drives are created, use ElasticFox to attach them to your VM. On the Ubuntu image that I’m using, partitions already exist as /dev/sda1, /dev/sda2, and /dev/sda3, so I attached my two drives at /dev/sdb and /dev/sdc to avoid any confusion. It’s probably a good idea to record which drive is attached as which device for future reference. I have tested reconnecting the drives as different devices without any problems, but my testing wasn’t thorough and I can’t guarantee that it’ll always work. If nothing else, it seems like a really bad idea to go switching them around, so I’d recommend coming up with some way to make sure they’re always connected as the same device.

Now SSH to your VM and connect as root. ElasticFox allows you to do this easily by right-clicking on a running VM and selecting “Connect to Public DNS Name.”

The stock image that I’m using doesn’t come with all of the LVM and NFS packages that we’ll need, so before we can begin configuring our drives, there are a few things that need to be installed. Let’s update the apt-get cache and then install everything that we’ll need for LVM.

> apt-get update
> apt-get install lvm2 dmsetup dmapi dmraid

Setting up LVM involves several steps, but they all make sense if you take a step back and look at an overview of what they’re actually doing. LVM allows you to take a group of physical drives and combine them into one giant virtual drive. You can then partition and/or use that drive in any way you see fit. This allows you to create partitions that are actually larger than any of the physical drives you’re using, and it also gives you the ability to expand your drive in the future. Based on that explanation, the following steps should be pretty self-explanatory.

First, create a physical volume on each drive. This allows them to be recognized by LVM. Most LVM tutorials that I’ve read say that you first need to partition the drive with an LVM partition, but that’s only the case if you plan to use parts of this drive for other purposes. In our case, we want LVM to use the entire drive, so we can just create the physical volume directly on it:

> pvcreate /dev/sdb /dev/sdc

Next, create a volume group, which tells LVM which physical volumes should be grouped together. As you would expect, you can add drives to or remove drives from this group in the future.

> vgcreate elastic_drive /dev/sdb /dev/sdc

Finally, create a logical volume (or multiple logical volumes) that can then be formatted and mounted just like a “regular” drive. Here, I’m specifying the size as 100% of the volume group, but you can also specify an absolute size if you’d rather.

> lvcreate -n content -l 100%VG elastic_drive

Now that we have a usable drive created, we’re finally ready to put a filesystem on it. There are a number of options you can use, the most popular of which is probably ext3. ReiserFS and XFS are also pretty popular. After doing a little bit of research, we decided to go with XFS because you can resize it without unmounting it (ext3 can be resized, but only after it’s been unmounted). You can also freeze it at any time to allow for safe snapshots, but LVM already provides that functionality. Before we can create our filesystem, though, we need to install the necessary XFS packages:

> apt-get install xfsprogs

To create our XFS filesystem:

> mkfs.xfs /dev/elastic_drive/content

The last step in creating our super-large, expandable drive is to create a mount point for the new drive and then mount it. Right now, I’m mounting it at /var/content.

> mkdir /var/content
> mount /dev/elastic_drive/content /var/content

To make sure that everything is working, let’s put a couple of files out on our giant drive.

> cp /etc/fstab /var/content
> cp /etc/rc.local /var/content

Now let’s check to make sure they disappear and reappear when they’re supposed to:

> ls /var/content

[should see your files listed]

> umount /var/content

> ls /var/content

[should get no results]

> mount /dev/elastic_drive/content /var/content

> ls /var/content

[files should be back]

OK, that’s progress, but we’re not finished yet. The final step is to make sure our mount persists when we reboot. You _should_ just be able to add the following line to /etc/fstab:

/dev/elastic_drive/content /var/content xfs defaults 0 0

… but that’s not working for me. For some reason, the logical volume isn’t coming up as active, so the mount fails when I reboot. If you’re having the same problem, here’s a little hack that’ll make it work. Just add the following two lines to your /etc/rc.local file:

lvchange -ay /dev/elastic_drive/content

mount /dev/elastic_drive/content /var/content

I’d highly recommend rebooting your server now and making sure that your mount comes back up. It’s much better to discover any problems now before you’re relying on this shared volume in a production environment.

NFS Config

Now that our giant storage drive is configured, the next step is configure NFS to share it amongst all of our other machines. First, let’s load all of the NFS packages well need:

> apt-get install portmap nfs-kernel-server

Next we need to add an entry in /etc/exports to expose the drive:

/var/content *.compute-1.internal(rw,no_subtree_check,sync)

A few things to note about the above line:

1) We’re technically exposing our drive with read/write access to anyone in our portion of the Amazon cloud. However, the security group that we’re in will prevent anyone from outside the group from accessing our machine on the NFS port. As long as that firewall holds, then this is totally secure. I’ve elected to open myself up to anyone in my security group because I don’t want to have to come back and edit this file every time we spin up another machine. If you would like an extra layer of security, you can specify specific machine names here instead.

2) For additional security, you can also add entries to the hosts.allow and hosts.deny files to further prevent unauthorized access. Again, this is redundantly securing something already taken care of by the security group, so it’s not strictly necessary (but it’s not a horrible idea, either).

Now we just have to refresh which shares have been exported, since apt-get was nice enough to have already started the NFS server:

> exportfs -a

Technically, we’re finished now, but let’s verify that our NFS share is actually working. Fire up another vm in the cloud to serve as our client machine, making sure that it’s in the same security group and application zone as the server. Note that if you didn’t set up your exports file to allow anyone in your security group to connect, you’ll have to go specifically add this new machine to your exports file on the server. Once the machine is up and running, we’ll need to install some NFS packages to allow it to run as a client:

> apt-get update
> apt-get install nfs-common

Once the installation is complete, create a directory to serve as your mount point and mount the remote filesystem. I’m mounting mine at /var/content_server.

> mkdir /var/content_server
> mount nfs_server_name:/var/content /var/content_server

Finally, test to make sure that your files are showing up:

> ls /var/content

[should see files from your remote drive]

One final note on Security. For the purposes of this document, I’ve made the assumptions that you both trust and don’t mind sharing your files with all machines in your security group. The alternative steps that I briefly discussed (using hosts.allow and hosts.deny) should further lock down your server, but the one thing I didn’t discuss is sharing your files with a machine outside your security group. Beyond the steps outlined here, you’ll need to add an entry to your security group to open up port 2049 (the default NFS port) to the IP address of your client machine (DNS names won’t work when configuring security groups).

Server Restore

Now we have our file server up and running, with a nice expandable drive for files that’s easily recoverable even if our host machine crashes or is terminated. That all sounds nice, but how do we know that any of that stuff actually works? We don’t… yet. So let’s find out.

Let’s assume that everything is set up as it was at the end of the configuration document: you have a “server” VM running that has two 1GB EBS volumes attached to it. Those volumes are combined into one logical volume that is then mounted on the “server” and shared via NFS. You also have a “client” VM running that has the logical volume mounted via NFS. So what happens if the server restarts?

Terminate your server VM. ElasticFox has a nice terminate button that makes this easy.

Now fire up a new VM, reattach your elastic drives, and SSH to it.

Since this is a brand new VM, all of the packages that we installed on the old one aren’t there. Let’s get those back:

> apt-get update
> apt-get install lvm2 dmsetup dmapi dmraid xfsprogs portmap nfs-kernel-server

Now let’s look and see if our logical volume is still set up across the two elastic drives:

> lvs

Sweet! It’s still there, so we don’t have to go through all of those configuration steps again. Unfortunately, though, it’s not active. Let’s fix that:

> lvchange -ay /dev/elastic_drive/content

Now let’s mount it again:

> mkdir /var/content
> mount /dev/elastic_drive/content /var/content
> ls /var/content

[should see the files you put on there before]

So our drive is back up and running. Now we just need to make it come back after reboots by redoing our changes to /etc/fstab:

/dev/elastic_drive/content /var/content xfs defaults 0 0

… or to /etc/rc.local if you had to use my little hack:

lvchange -ay /dev/elastic_drive/content
mount /dev/elastic_drive/content /var/content

And finally, let’s share it back over NFS with an edit to /etc/exports:

/var/content *.compute-1.internal(rw,no_subtree_check,sync)

and refresh our exported filesystems:

> exportfs -a

Now all that’s left is to update our client machine(s) with the new location of the content server.

> umount /var/content_server
> mount nfs_server_name:/var/content /var/content_server
> ls /var/content_server

[should see the files from your elastic drives]

There are a couple of remaining problems that I see with this setup, and the only way I know of to solve them is to write my own scripts. First, what happens on a client machine when it tries to write to the remotely mounted directory while the server is down? How do we make sure that no data is lost while we wait for the server to come back up? And second, how can we make the client machines aware when the server’s location changes? My thought is to write a script that monitors the server. When it detects that the server can’t be reached, it funnels writes into a temp directory until the server directory can be remounted.

Volume Growth

One of the most important features of our storage setup is that it’s easily expandable, so it’s probably a good idea to make sure we can actually expand it. There are several ways to do this, so let’s outline a few of them now.

The easiest way to increase our storage is to just add another EBS volume, so let’s try that now. Go to the Volumes and Snapshots tab in ElasticFox and create a new volume that’s equal to the size you want to add.

Attach the volume to your server. Since we’re currently using /dev/sdb and /dev/sdc, logic allows that we should probably connect this one at /dev/sdd.

Now we go through steps that are similar to the initial setup, except that we’ll be growing existing entities rather than creating new ones.

First create a new physical volume on your new device:

> pvcreate /dev/sdd

Now add the physical volume to our existing volume group rather than creating a new group:

> vgextend elastic_drive /dev/sdd

Next, extend the logical volume to consume 100% of the free space available (unless you’re planning on saving some of the space for some other purpose):

> lvresize -l 100%VG /dev/elastic_drive/content

And lastly, expand the filesystem to fill the logical volume (note that the path is to the mounted volume, not to the device):

> xfs_growfs -d /var/content

Now let’s check our work:

df -h

Our giant drive should show up as /dev/mapper/elastic_drive-content, with a total capacity equal to the sum of the capacity of the three individual drives.

That was nice and easy, but it’ll only work for so long, as Amazon currently limits each customer to 20 volumes. Assuming that you haven’t made each of your drives the maximum size (currently 1 TB), you can utilize a lot more space before having to beg Amazon for special treatment.

The easiest way to expand the capacity of your existing drives is to take advantage of Amazon’s snapshot feature. Amazon allows you to take a snapshot of an EBS drive at any time, automatically storing it to S3. The transfer cost to S3 is free, but you will be charged for the S3 storage space at the standard rate. Once you’ve created a snapshot, you can restore it to your drive at any time. However, you can also restore that snapshot to a new, larger drive, which is what we’re going to do here. The downside to this method is that it requires taking down your filesystem. That’s simply not an option for some people, but if you don’t mind some downtime, this is definitely the easiest way to expand a single EBS volume.

First, unmount your filesystem.

> umount /var/content

Now set the logical volume to inactive. This is more just a safeguard to make sure that nothing can mount or modify anything on the elastic drive while we’re expanding it.

> lvchange -an /dev/elastic_drive/content

Now open up ElasticFox and go to the Volumes and Snapshots tab. Find the volume you wish to replace, make a snapshot of it, and then detach it. You can either delete it now or wait until later if you want to be extra-safe (your data is already backed up in the snapshot). Now create a new volume from the snapshot you just took, making sure to specify your new, larger size, and attach it back to your machine at the same point as the old one. (As stated earlier, it’s probably not 100% necessary to reattach at the same point, but I’m not willing to say that for sure.)

Back in your ssh window, you should now be able to look for and find all of your physical volumes.

> pvs

At this point, you can go ahead and set the logical volume back to active and remount the drive in order to minimize downtime:

> lvchange -ay /dev/elastic_drive/content
> mount /dev/elastic_drive/content /var/content

Unfortunately, the physical volume on the new drive is still the same size. That’s because we have to explicitly tell it to grow into the new space:

> pvresize /dev/sdd

Since this new drive was created from a snapshot of the old one, it’s already a member of the volume group, so we don’t have to make any changes there. However, we do still have to expand the logical volume and the filesystem to take up the rest of the space:

> lvresize -l 100%VG /dev/elastic_drive/content
> xfs_growfs -d /var/content

That’s it! To check our work, we can run:

> df -h

So that was pretty easy, but what do we do if it’s entirely unacceptable to take the filesystem down? The answer to that question is only slightly more complex, but it’s a good bit more time-intensive. Even though we’ve used LVM to configure our multiple physical drives as one logical one, LVM provides facilities to guarantee that a particular physical volume is no longer in use (and therefore safe for removal). In order to clear off a volume, however, we have to first have enough unallocated space available in the volume group to be able to hold all of the data from the physical volume that we wish to remove. The easiest way to accomplish that is to go ahead and create a new EBS drive and add it to the volume group, so go to ElasticFox and create and attach a new drive that’s equal to the amount of space you want to add PLUS the size of the drive you’re going to remove. For example, if you want to add 100 GB worth of space, but you’re going to remove a 50 GB drive in the process, your new drive needs to be 150 GB.

Once the new drive is attached, we need to set it up for use by LVM. That means creating a physical volume on it and then adding it to the volume group.

> pvcreate /dev/sdd
> vgextend elastic_drive /dev/sdd

It’s important to note that we DO NOT want to extend our logical volume onto the new physical volume just yet. Right now, we need that unallocated space in order to clear off the drive we’re going to remove.

Now we’re ready to clear off the old drive using pvmove. If you’re curious to know more about how this works, the pvmove man page is really good.

> pvmove /dev/sdb

When I tried pvmove the first time, I got this error: “mirror: Required device-mapper target(s) not detected in your kernel.” This is because pvmove uses the device mapper mirroring module, which isn’t loaded by default. If you get the same error, try loading that module and trying again.

> modprobe dm-mirror
> pvmove /dev/sdb

Now that our physical volume is empty, we can remove it from the volume group:

> vgreduce elastic_drives /dev/sdb

Note that if you try to call vgreduce on a volume that isn’t empty, it will NOT get removed. Instead, you’ll get a warning telling you that the physical volume is still in use. This should give you some peace of mind, as you can rest assured that vgreduce won’t mess with the integrity of your data.

At this point, /dev/sdb is ready to be repurposed onto something else, or destroyed altogether. If you’re planning on adding it to another volume group, it’s ready to be added using the vgextend command. If you plan to use it as a “regular” drive, you’ll first need to remove the physical volume information from it using pvremove. However, if you’re planning on just destroying the volume, all you need to do is go to ElasticFox, detach the volume, and delete it. It should be noted that deleting the EBS volume does not destroy any snapshots that were made from it. This is a good thing, as those snapshots are still a vital part of any backups you’ve made. Should you need to restore your drive from an older point, you can always restore one of those snapshots to another drive.

So far, we’ve added our replacement EBS volume, copied some data over to it, and removed/destroyed the volume that we’re replacing. However, we haven’t actually upped the capacity of our logical volume and filesystem, which was the whole point of this exercise in the first place. Let’s do that now.

> lvresize -l 100%VG /dev/elastic_drive/content
> xfs_growfs -d /var/content

And finally, we can check our work for that last bit of reassurance:

> df -h

Backup Strategies

Thus far, we’ve put a lot of effort into creating a flexible, expandable, and accessible file storage device, but there are still three key attributes that we need to address before our drive is ready to use: performance, redundancy, and recovery. I’m moving forward with the assumption that performance of the drive itself is already about as good as it’s going to get. Behind the scenes, Amazon has to be using some pretty hefty hardware, and there’s probably some degree of striping going on as well. Anything we do on top of that is going to introduce a good bit of complexity, and probably won’t yield much, if any, performance gain. Certainly we could run some benchmarks to validate or refute those claims, but at this point in time, I don’t see the need.

Redundancy is also something that we’re leaving up to Amazon. As they state in their description of EBS, “Because Amazon EBS servers are replicated within a single Availability Zone, mirroring data across multiple Amazon EBS volumes in the same Availability Zone will not significantly improve volume durability.” As such, it’s hard to envision any scenario in which mirroring (or other forms of redundancy like RAID 5) would be worth the trouble. If you disagree with my assessment, there are several tutorials out there that describe combining LVM and various flavors of RAID.

The third item that merits discussion is recovery, and that’s definitely something that requires a plan. Amazon has a nice snapshotting feature in place that makes backing up single EBS volumes quick, easy, and inexpensive, but it fails to account for situations like ours, where multiple EBS volumes are directly tied together. Fortunately, we can still take advantage of the EBS snapshots if we do a little bit of legwork before and after.

The reason that we can’t rely on random snapshots taken from our EBS drives is because those snapshots are from different points in time, and the filesystem could have been in a different state at each of those points. Therefore, restoring your EBS drives from snapshots taken even a fraction of a second apart could potentially result in unstable behavior. However, there’s no way to guarantee that multiple snapshots are taken at the same time, so if we’re going to use the Amazon snapshot feature, we need to first figure out a way to guarantee that our filesystem remains stable and unchanged across those snapshots. Fortunately, XFS includes a way to make such a guarantee.

XFS includes a utility to “freeze” the filesystem at a given point in time. When a filesystem is frozen, all writes that were ongoing before the freeze happened are forced to finish, and all writes that are initiated after the freeze are blocked until the filesystem is “thawed.” Any thread attempting to write to the frozen filesystem will simply block until it’s allowed to complete. From a data integrity standpoint, this is just as safe as unmounting the filesystem, and from an application standpoint, it’s much, much better, as applications will now just wait for the filesystem to be available again rather than throwing errors because it appears to be missing.

So now, making our backups becomes a relatively simple process. First, freeze the filesystem:

> xfs_freeze -f /var/content

Next, go to ElasticFox and take a snapshot of each elastic drive.

Now go back to your terminal and thaw the filesystem:

> xfs_freeze -u /var/content

A couple of things to point out here:

1) LVM also has a snapshot feature, but it doesn’t really buy us anything. It requires setting up a second logical volume as a mirror that’s equal in size to the volume we want to back up. Beyond being a giant pain, we’d still also need a way to actually save the snapshots, and that would presumably involve S3. So the end result would be that we’re using twice as much disk space and a significantly more complex setup in order to back up our data to the same place.

2) There is a bit of complexity to restoring from this backup scenario, as you now have multiple snapshots (one for each volume) that represent a single backup. When you do a restore, you’ll have to make sure that you restore all of the snapshots back to the correct drives. This is pretty easy to do at first, as each snapshot includes the Volume ID representing the volume from which it was taken. However, should you ever replace one volume with a new one (as demonstrated in one of the growth strategies), the Volume ID on the snapshot will no longer correspond to the volume that it represents. Therefore, it’s imperative that, whenever you “replace” a volume, you keep a mapping of the old and new Volume IDs somewhere.