Today's guest post comes from David Mytton. David is the CEO of Server Density, a cloud-based monitoring service and Google Cloud Platform Services partner.

Managing backups is a critical task. Many people run backup scripts, but what most fail to do is test to see if those backups actually work. The backup job might be running successfully every day, but how do you know there is no corruption, missing data, or some other problem that might only reveal itself when you actually need to do a restore?

Actually taking a successful backup of the data you need is only the first of two key steps. The second, just as important as the first, is verifying that they can be successfully restored.

This post will look at how Server Density built its own automated MongoDB backup restore process using Google Cloud Storage and Google Compute Engine.

Step 1: Retrieving the backups

We make use of the MongoDB Management Service backup product. This acts as an offsite replica, which means it is only ever a few seconds behind real time, giving us point-in-time restores.

Using the MMS API, we can trigger a restore job and that gives us a URL to download the backup tar file. Our system has a simple Python script which connects to the API, triggers a restore from the latest snapshot, and then gets the archive URL ready to be downloaded.

Step 2: Downloading the backups to Google Cloud Storage

Google Cloud Storage is the best place to stage files we want to make available to Google Compute Engine because it’s within Google’s network and makes setting up permissions for Compute Engine instances easy.

Therefore, when retrieving the backups from the MMS API above, we actually stream them directly into a Google Cloud Storage bucket. We use the same filename (one for each database), enable versioning, and set up the appropriate lifecycle management – letting Google Cloud Storage take over so that we don’t have to build our own retention scripts and naming conventions.

And since we know that MMS is run from within the US (as are our own data centers), we use regional buckets to store the snapshots in Europe, for geographical redundancy.

This allows us to store the latest backups that are ready to be restored directly from MMS, but keep historical snapshots at a much lower cost in Google Cloud Storage, which is essentially acting as a low cost archiving solution.

It also makes it trivial to list the available snapshots:

$ gsutil ls -la gs://our-bucket-name
 506749326  2014-12-25T16:04:06Z  
gs://our-bucket-name/honshuuPerm.tar.gz#1419523446377000  
metageneration=1
 506584457  2014-12-26T04:01:43Z  
gs://our-bucket-name/honshuuPerm.tar.gz#1419566503155000  
metageneration=1
 506394051  2014-12-26T16:03:36Z  
gs://our-bucket-name/honshuuPerm.tar.gz#1419609816648000  
metageneration=1

And it’s just as simple to retrieve a specific one:

$ gsutil cp gs://our-bucket-name/honshuuPerm.tar.gz#1419523446377000 
.

Step 3: Launching a test Compute Engine instance

Now that the snapshots are stored on Google Cloud Storage, the final step is to set up a MongoDB instance with a restored copy so we can test it.

Using the gcloud compute command line interface, we create a new SSD persistent disk and then create a new Compute Engine instance, attaching the SSD disk to it on boot. Google offers several disk types and after some testing, we found that the SSD PD fit our performance vs. cost needs best.

The instance is given a startup script, which is what does all the key tasks on boot:

  1. Format and mount the SSD PD
  2. Use the preinstalled gsutil to copy over the latest snapshot from Google Cloud Storage
  3. Untar the snapshot archive
  4. Install MongoDB
The mongod server automatically starts after installation and picks up the snapshot files, which are just the actual MongoDB database files. This means restoring the backup is simply a case of extracting the archive into the MongoDB directory and starting a mongod instance.

Step 4: Running test queries

By this stage, we know that the backup can be restored and the database started, but we need to verify that the database contains what we were expecting. Testing this is important because the backup could’ve been blank, or could be missing values. We have written a test script, which simply connects to the local MongoDB instance and issues queries against the databases and collections.

For the moment, this test script is hard coded. Periodically, we issue a set of queries on the live environment to get a sample of what data we expect to be there. Then hard code that as a query into the test file. We test against what the expected query results should be (e.g. collection counts) and query specific document _id values we know will be present.

In the future, we want to make this process more intelligent so that we can perform live comparisons against the production databases, then run them on the restored test too. This would automatically update the queries as we change the schema.

Once all of this has completed, we use gcloud compute to delete the instance and the associated SSD, wiping the data. The whole process takes around 10 minutes to complete, with most of the time being in the transfer of the snapshot archive from MMS to Google Cloud Storage via our build server, and extracting the compressed snapshot archive (this is where the SSDs come in handy).

Connecting it all up

We need a way to run all these commands and keep track of the status of each one because we want to be alerted if any of the steps fail, particularly the test queries. Server Density already has a range of tests that run using Buildbot as part of our build process. In fact, the above steps were written in Python then hooked into a Buildbot process. This sends us email alerts and ties into Hipchat so everyone can see the status.

Any build system could be used to achieve the same results. Something like Jenkins or even one of the SaaS build tools would work well, so long as you can execute a series of scripts and run the gcloud commands.

We run our backup tester twice per day – once at the end of the day (to catch any changes that may have gone into production that day) and once first thing in the morning (so we can be sure we have a clean state ready to start the day).

This system allows us to rest easy, knowing that not only do we have offsite backups stored securely with both MongoDB MMS and Google Cloud Storage, but that if we ever needed to restore them, we would have no problems doing so!

-Posted by David Mytton, CEO of Server Density