Noticed some issues loading web applications in recent weeks? The recent major cloud outage was a reminder of something we should all know by now, that we seem to keep forgetting until the next crisis: if DNS stumbles, everything depending on it does too.
This time it was a problem with one of the major public cloud providers. But every digital service provider—really, every business running a network—has dealt with DNS issues at some point. In this incident, a domain name pointed to a database service that wasn’t resolvable, which prevented many of that provider’s customers—and many of their own internal services—from reaching that database service, causing widespread outages.
We’ve seen this movie before. In 2016, a withering distributed denial-of-service (DDoS) attack on Dyn, one of the largest DNS hosting providers, knocked many well-known web properties offline for hours. Last year’s big CrowdStrike outage was actually caused by a software bug that had nothing to do with DNS, but crashed Windows Servers and endpoints the CrowdStrike agent was running on. It still caused global network failures—because many of those servers were also running DNS.
Why does this problem keep biting us? Because virtually every digital connection starts by asking a DNS server how to reach some destination. As the “digital phonebook” for IP networks, DNS handles the mundane but extremely important task of translating human-readable domain names into IP addresses. When DNS fails, applications can’t find other endpoints’ addresses. No address, no connection. No connection, no business.
Why Does DNS Fail?
As a protocol, DNS is designed to be fairly resilient (if one authoritative DNS server fails, for example, recursive DNS servers on the internet will automatically find another to resolve queries). The problem is that DNS is so fundamental to digital communication, when there’s an issue, it tends to have a large blast radius, rippling out to many other connections and dependencies. Additionally, as one of the foundational elements of a business’s online presence, DNS is a prime target for cyberattacks.
Common causes of DNS failures include:
- Attacks on DNS Infrastructure: Threat actors relentlessly probe public-facing authoritative servers. This subjects the servers to all manner of threats, from volumetric DDoS attacks, which overwhelm DNS servers with massive traffic, to exploits like DNS hijacking, which alter authoritative DNS records to direct queries to malicious sites.
- Configuration Errors: Unfortunately, many DNS issues are self-inflicted, because DNS servers can be pretty unforgiving in how they handle errors. Seemingly minor issues—say, transposing digits when entering header information—can cause major outages. Or if you introduce an error in an authoritative DNS server, it might start sending non-authoritative responses—which recursive DNS servers on the internet won’t accept.
- Automation Gone Sideways: At cloud scale, DNS changes are scripted. If bad information somehow finds its way into a zone, that bad data spreads widely and instantly.
- Single Points of Failure: It’s never a good idea to put all your eggs in one basket—even if it’s the best basket money can buy. If one provider handles all your authoritative DNS, and that provider has an outage, you’re out of commission until they resolve it.
- Sharing Space with Other Services: Many organizations still run DNS from the same servers that handle identity or other services—effectively handcuffing DNS to another system’s health. As we saw during the CrowdStrike outage, problems don’t have to originate with DNS itself. If the box DNS lives in breaks, you get the same results.
A Practical Playbook for Resilient DNS
The good news is that there are many things organizations can do to avoid these kinds of problems. The simplest steps take advantage of DNS’s built-in resiliency: Give recursive servers multiple authoritative DNS servers to choose from. Give stub resolvers, or DNS clients, multiple recursive servers to choose from. And use DNS anycast to automatically route queries to the nearest healthy resolver in the event of a failure.
Here are some other best practices I’ve recommended over the years:
- Diversify Authoritative DNS: The best way to eliminate single points of failure is to run a heterogeneous set of authoritative DNS servers—some hosted by a reputable cloud or SaaS provider, some self-hosted on-premises. That way, if your cloud or SaaS DNS service ever goes down, the self-hosted authoritative DNS servers are still available—and your sites and services stay online. Back when the Dyn outage happened, I spoke with a team using this exact architecture. (They supported online trading and monitored their site reachability minute by minute.) While other companies went dark for up to eight hours, they saw barely a hiccup.
- Separate Roles, Decouple Shared Fate: Don’t co-host authoritative DNS with identity or other services on the same infrastructure. DNS is mission-critical for modern businesses and should be treated that way, with strict separation of duties. New draft recommendations from the National Institute of Standards and Technology (NIST) make this explicit: “Even if a DNS is run on a secure operating system, vulnerabilities in other software programs on that OS can be used to compromise the security and availability of DNS software. Hence, the infrastructure that hosts DNS services should be dedicated to that task and hardened for this purpose.”1 Along those lines, don’t use the same external DNS servers for both authoritative and recursive roles. Always separate them.
- Protect External Authoritative DNS against DDoS and Abuse: A surprising number of companies leave external authoritative DNS servers unprotected. Public-facing name servers should be able to absorb DDoS floods and block protocol exploits while continuing to serve legitimate queries.
- Diversify Internal Name Servers: To ensure resiliency of internal DNS, host critical zones on multiple DNS authoritative servers on different subnets—ideally at different sites. And deploy them as close as possible to the clients that use them, so queries don’t have to traverse the whole network, adding latency.
- Continuously Test and Verify: Any time you make any non-trivial change to zone data or DNS server configuration, confirm your servers are responding—and responding authoritatively. Run periodic probes from the internet or remote parts of your network to verify that key domain names are resolving and services are reachable.
- Centralize DNS Management: If you run a hybrid and/or multi-cloud environment with teams constantly swiveling between dedicated DNS systems and tools, you’re always one fat finger away from an outage. Wherever possible, consolidate DNS services within a unified, cloud-agnostic point of management, control and visibility. When you can monitor all DNS from one place and use the same consistent workflows and automation across all environments, there’s a much lower chance of something going wrong.
Here’s a video where I discuss some of the best practices above.
| It’s Always DNS |
The most important recommendation: Take these steps when your network is healthy, instead of waiting until there’s a problem. The internet’s been resilient for decades because DNS is resilient—but only if you design it that way. Invest in heterogeneity, separation and verification while skies are clear, and the next time there’s a DNS hiccup, you’ll keep right on resolving.
Footnotes
- Secure Domain Name System (DNS) Deployment Guide | Comment on NIST SP 800-81r3 | NIST, S. Rose, C. Liu, R. Gibson. April 2025.



