May 2017 Major Outage

The blog suffered a major outage today - it was offline for around six hours. It took me around 90 minutes to get it working once I had time to do so.

Why

A few days ago I created a new droplet for a homebrew scraping project I’ve been working on lately.

Today I decided its time to throw it, and pressed the big, red, destroy button. Then I noticed I deleted the wrong droplet and accidentally deleted my blog!

Disaster Recovery Plan

I had daily backups setup already (remember Poor mans daily blog backups?) which were almost up to date.

All I needed to do is to add a few updates to the latest post. Fortunately, every time I post to LinkedIn, they cache my posts at “oded-ninja.cdn.ampproject.org/c/s/oded.ninja/…”

Up until now I kind of hated that cache. Every time I updated a blog post, it took forever to refresh, which was a real pain. Only this time I was actually grateful that cache existed.

Anyway, have you ever heard of the AMP Project?

The AMP Project is an open-source initiative aiming to make the web better for all. The project enables the creation of websites and ads that are consistently fast, beautiful and high-performing across devices and distribution platforms.

Ghost has pre-baked AMP support, Which I’ve set up to automatically redirect mobile clients.

Snapshots

DigitalOcean provide a droplet snapshot service for only 20% of the cost of the virtual server.

The problem is that snapshots wouldn’t help me at this point, because they get deleted with the droplet. Good thing I created that backup service.

Restore

I forked Ghost a few months ago, and instead of testing my changes on production, I put all my configuration in a private git repository.

That proved really useful for local testing, but especially for restoration. I had a habit of checking changes on up-to-date backups, which meant I restored the blog locally every few days.

Plan Execution

  1. Create a new Droplet
  2. Create new Read-only Deploy Keys
  3. Clone the repository, and build it.
  4. Restore the backup from Google Drive
  5. Pause Cloudflare
  6. Check the website is working
  7. Setup Lets Encrypt
  8. Resume Cloudflare
  9. Re-create my Keybase website proof.

Issues I Encountered

  1. I forgot to update the new droplet’s IP at namecheap, my DNS provider.
  2. When checking the website, before turning on SSL, I kept getting redirected to the https endpoint. That’s because it was saved in the browsers HSTS set. Fix? clear my browser’s HSTS settings
  3. Chrome has an internal DNS cache which needed to be refreshed. Fix? flush DNS records and sockets at chrome://net-internals/#dns and chrome://net-internals/#sockets respectively.
  4. I didn’t read the Lets Encrypt setup instructions thoroughly and got blocked for an hour after several failed attempts to set it up. Conclusion? RTFM!
  5. I forgot to update my website proof on Keybase.

Lessons Learned

First of all, DON’T PRESS THE BIG, RED, DESTROY BUTTON BEFORE MAKING SURE YOU ARE REMOVING THE RIGHT DROPLET!

Second, I need to make a few adjustments -

  • Enhance my backup script to include nginx’s configuration as well.
  • Setup a mechanism to perform backup after every update, or at least reduce the interval between backups.
  • Automate recovery steps, or at least document them. When I’m under pressure (or drunk), I forget steps :|