The blog suffered a major outage today - it was offline for around six hours. It took me around 90 minutes to get it working once I had time to do so.
Why
A few days ago I created a new droplet for a homebrew scraping project I’ve been working on lately.
Today I decided its time to throw it, and pressed the big, red, destroy button. Then I noticed I deleted the wrong droplet and accidentally deleted my blog!
Disaster Recovery Plan
I had daily backups setup already (remember Poor mans daily blog backups?) which were almost up to date.
All I needed to do is to add a few updates to the latest post. Fortunately, every time I post to LinkedIn, they cache my posts at “oded-ninja.cdn.ampproject.org/c/s/oded.ninja/…”
Up until now I kind of hated that cache. Every time I updated a blog post, it took forever to refresh, which was a real pain. Only this time I was actually grateful that cache existed.
Anyway, have you ever heard of the AMP Project?
The AMP Project is an open-source initiative aiming to make the web better for all. The project enables the creation of websites and ads that are consistently fast, beautiful and high-performing across devices and distribution platforms.
Ghost has pre-baked AMP support, Which I’ve set up to automatically redirect mobile clients.
Snapshots
DigitalOcean provide a droplet snapshot service for only 20% of the cost of the virtual server.
The problem is that snapshots wouldn’t help me at this point, because they get deleted with the droplet. Good thing I created that backup service.
Restore
I forked Ghost a few months ago, and instead of testing my changes on production, I put all my configuration in a private git repository.
That proved really useful for local testing, but especially for restoration. I had a habit of checking changes on up-to-date backups, which meant I restored the blog locally every few days.
Plan Execution
- Create a new Droplet
- Create new Read-only Deploy Keys
- Clone the repository, and build it.
- Restore the backup from Google Drive
- Pause Cloudflare
- Check the website is working
- Setup Lets Encrypt
- Resume Cloudflare
- Re-create my Keybase website proof.
Issues I Encountered
- I forgot to update the new droplet’s IP at namecheap, my DNS provider.
- When checking the website, before turning on SSL, I kept getting redirected to the https endpoint. That’s because it was saved in the browsers HSTS set. Fix? clear my browser’s HSTS settings
- Chrome has an internal DNS cache which needed to be refreshed. Fix? flush DNS records and sockets at chrome://net-internals/#dns and chrome://net-internals/#sockets respectively.
- I didn’t read the Lets Encrypt setup instructions thoroughly and got blocked for an hour after several failed attempts to set it up. Conclusion? RTFM!
- I forgot to update my website proof on Keybase.
Lessons Learned
First of all, DON’T PRESS THE BIG, RED, DESTROY BUTTON BEFORE MAKING SURE YOU ARE REMOVING THE RIGHT DROPLET!
Second, I need to make a few adjustments -
- Enhance my backup script to include nginx’s configuration as well.
- Setup a mechanism to perform backup after every update, or at least reduce the interval between backups.
- Automate recovery steps, or at least document them. When I’m under pressure (or drunk), I forget steps :|