An hour or so ago, I tried to check a Wikipedia entry and my browser told me it couldn’t find en.wikipedia.org. Surely that’s wrong, I thought, but pushed “Check Wikipedia” onto the stack and went on to something else. Then, coincidentally, while searching for DNS-related news articles to inspire my next blog entry, I ran across this one from PC Magazine. Turns out Wikipedia’s European data center had an overheating problem that caused many of their servers to shu…. To shunt European traffic to their servers in Florida, they enacted their failure procedure, which modifies their DNS records.
Unfortunately, that failover mechanism was broken (they didn’t specify how), and broken so badly that it interrupted DNS resolution for all Wikimedia sites globally. While they quickly recognized and fixed the problem, it took as long as an hour for the corrected data to propagate because of TTLs.
There’s certainly nothing to crow or chuckle about here. I use Wikipedia all the time; why, it’s how I learned that telepathy is real! Ha! I’m just kidding. (But you telepaths out there knew that I was, right?)
No, Wikipedia’s sad fate should remind us to check and double-check our own DNS-related disaster recovery plans. And by “DNS-related disaster recovery plans,” I mean two things: How we’ll use DNS in the event of a disaster to re-route traffic to our backup servers and data centers, and also how we’ll recover management and ensure ongoing operation of DNS in a disaster.
Many of us use scripts that send dynamic updates, make API calls or rewrite zone data files to switch from primary to backup addresses in the event of a catastrophe. First, make sure your script incorporates sanity checks whenever possible. In particular, if you’re rewriting zone data files using a script, you should use a program like BIND’s named-checkzone to validate the result before trying to load it. Then make sure you test your script periodically to ensure it functions as expected. If you have addresses that you’ll need to change quickly in a disaster, consider reducing the TTLs on those records ahead of time. If you use small but sane TTLs, the added load on your authoritative name servers won’t be much, but you’ll gain the ability to fail over more quickly. If you can’t afford to reduce the TTLs ahead of time, you could still use low TTLs in the “backup” records your script adds, which would at least enable you to fail back quickly – which might have helped in Wikipedia’s case.
Forgetting to make disaster recovery plans for your DNS infrastructure is all too common – in fact, it’s what got me into this business. (HP lost the primary hp.com name server in the Loma Prieta earthquake, and my colleague David and I spent the night resurrecting it.) You need to make sure that DNS keeps running after (or during) the disaster, which means running a distributed set of name servers in diverse physical locations. If you’re lucky enough to have multiple connections to the Internet, make sure you have a forwarder and an external authoritative name server near each. And don’t forget your seat of administration: The ability to change your data to fail over to backup servers will be critical, but you need a primary name server to do that. If you use BIND name servers, you might just copy zone data files, as well as scripts used to manage or generate them, to a backup primary periodically. You could also configure your secondaries to use your production primary as their first master, and then to try your backup. If you use a commercial product to manage DNS, make sure it supports disaster recovery and that you’ve got the DR features configured. (In my opinion, Infoblox really shines in this area.)
If you do all this, you’ll be much more likely to emerge from the next disaster unscathed. And you can tell people you learned it all from Wikipedia.