Crux Read real dig output, a zone file, and a resolver config, predict the DNS behaviour, and pick the diagnosis or fix a senior engineer would reach for first.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
DNS incidents are diagnosed in dig output and zone files, not in slides. Read each artefact, predict what it tells you, and choose the call a senior engineer would make first.
Goal
Practise the loop you run in every DNS incident: read the trace or zone file, locate the broken link, and reach for the diagnosis or the highest-leverage fix before guessing.
Snippet 1 — a +trace referral
$ dig +trace shop.example.com A;; received referral from the .com TLD:example.com. 172800 IN NS ns1.example.net.example.com. 172800 IN NS ns2.example.net.;; (Additional section empty);; query to ns1.example.net times out;; query to ns2.example.net times out;; SERVFAIL
Quiz
Completed
The .com referral lists the nameservers but the lookup SERVFAILs. The NS names are ns1/ns2.example.net. What is the most useful first read?
Heads-up Glue is only required for in-bailiwick nameservers (ns1.example.com). These are in example.net, an out-of-bailiwick zone the resolver can resolve independently, so empty Additional is expected, not a bug.
Heads-up 172800 s (2 days) is a normal NS TTL and has nothing to do with a timeout. The lookup fails because both example.net nameservers are unreachable, not because of TTL.
Heads-up A TLD never holds leaf A records; it returns the delegation (NS) and refers onward. The failure is downstream at the example.net nameservers.
Snippet 2 — a zone file edit
$ORIGIN example.com.@ IN SOA ns1.example.com. admin.example.com. ( 2026051300 ; serial (was 2026051905) 7200 ; refresh 3600 ; retry 1209600 ; expire 300 ) ; minimum (negative-cache TTL)@ IN NS ns1.example.com.@ IN CNAME shop.cdnprovider.net.www IN CNAME shop.cdnprovider.net.
Quiz
Completed
This zone edit has TWO defects that will bite in production. What are they?
Heads-up 300 s is a fine negative-cache TTL, and an in-zone NS like ns1.example.com gets glue from the parent .com delegation, not from this file. The real defects are the decremented serial and the apex CNAME.
Heads-up Refresh (7200) above retry (3600) is the normal ordering. The actual breakers are the decremented serial and the forbidden apex CNAME.
Heads-up A CNAME on a non-apex label like www pointing to another domain is completely legal and common for CDNs. The illegal one is the apex (@) CNAME; the other defect is the decremented serial.
Snippet 3 — a DNSSEC diagnosis
$ dig @1.1.1.1 +dnssec api.bank.example A;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 41832;; flags: qr rd ra; QUERY: 1, ANSWER: 0$ dig @1.1.1.1 +cd api.bank.example A # +cd = checking disabled;; ->>HEADER<<- status: NOERROR; flags: qr rd ra cd; ANSWER: 1api.bank.example. 60 IN A 203.0.113.42
Quiz
Completed
Validating resolver returns SERVFAIL, but the same query with +cd returns a clean A record. What does this prove, and what is the customer impact?
Heads-up +cd returns a live NOERROR answer, so the authoritative is up and serving correct data. The failure is validation, not availability.
Heads-up DNSSEC verifies signatures, not whether an IP is 'right'. +cd shows the signed-or-not data is present; the break is in the signature chain (DS/KSK), not a wrong IP.
Heads-up Any DNSSEC-validating resolver (8.8.8.8, Quad9 included) returns the same SERVFAIL because the chain is broken upstream. The +cd success rules out rate limiting.
Snippet 4 — a resolver config and a latency reading
serve-expired is on, the resolver's nearest Cloudflare PoP is AMS, yet example.com costs 70 ms on every query and never drops to sub-ms. Which two readings are correct?
Heads-up serve-expired serves stale data faster during outages; it never adds latency to a healthy query. The constant 70 ms is a caching failure (TTL=0 or validation discard), unrelated to this directive.
Heads-up AMS is just the PoP that answered (Amsterdam); a single anycast hop is fine. The 70 ms recurring on every query is about the answer not being cached, not PoP selection.
Heads-up A warm cache returns in sub-ms to low-ms. A steady 70 ms means every query is doing the full upstream walk — the answer is not landing in cache (TTL=0, expiry, or DNSSEC discard).
Recap
Every DNS incident is read in artefacts: a +trace referral tells you whether glue is even required (in-bailiwick vs out-of-bailiwick) and where the walk stalls; a zone file exposes decremented serials and forbidden apex CNAMEs at a glance; a SERVFAIL that clears under +cd is a DNSSEC chain break, almost always a DS-vs-KSK mismatch after a rollover; and a constant query time means the answer is not being cached, while serve-expired only helps during outages. Diagnose from the artefact, name the broken link, then fix the one mechanism it points to.