Observation and triage — dig / logs / common misdiagnoses
Look at the actual responses first and rule out misdiagnoses such as negative caching, split-horizon, and browser cache before anything else.
Recap of the previous chapter: In Chapter 5 we organized the scope of DNSSEC, the last-hop problem, and how to assemble defense in depth. In this chapter we extend that defense to "what to observe on a running resolver to notice when it breaks down." The basics of dig (the range covered in DNS101) are only the foundation; we go further to patterns for spotting poisoning indicators in logs.
Look at the actual responses before touching the admin console
When name resolution goes wrong, comparing "what the affected resolver is returning right now" with "what the authoritative path looks like" is a faster shortcut than taking a screenshot of the record management console.
$ dig @192.0.2.53 www.example.com A
$ dig +trace www.example.com A
$ dig +dnssec www.example.com A
The 192.0.2.53 used here is a TEST-NET address for illustration only. In real checks, use only resolvers you operate yourself or legitimate targets of observation.
Practice 6-1 — The first triage step
Compare actual responses before trusting the impression from an admin console.
Q1. A user reports that "only a particular name jumps to a strange IP." Which is the most appropriate first safe check?
The appropriate first check is to compare what the affected recursive resolver returns against what the authoritative path / +trace shows. Don't judge from the impression of an admin console alone; look at the actual responses.
Q2. Which best describes what dig +trace is suited for?
dig +trace is suited for walking the delegation path from the root in order and seeing where it bends. It helps observe delegation boundaries and how NS records are turning the path.
Q3. Which of the following situations makes you more likely to suspect a shared recursive cache problem than a per-browser cache?
If multiple users sharing the same resolver see the same wrong answer with similar remaining TTL, you should suspect the shared recursive cache rather than a per-browser issue.
Triaging common symptoms
| Symptom | First suspect | First check |
|---|---|---|
| Multiple users sharing the same resolver see the same wrong answer | shared recursive cache | Compare affected resolver with authoritative path |
| Only internal users see a different IP | split-horizon / internal-facing design | Design intent and presence of an internal zone |
| A name just created is still NXDOMAIN | negative caching | TTL on the negative response |
| Anomalies involving RRSIG / AD | DNSSEC validation / last hop | dig +dnssec and the trust chain |
negative caching and misdiagnoses
"The name I just created isn't there yet," "internal and external see different things" — these are classic situations that get confused with poisoning. Ruling out negative caching and split-horizon first cuts down on unnecessary false alarms.
Hint: The TTL for negative caching is governed by the SOA minimum field and the behavior described in RFC 2308. Don't forget that even a "does not exist" result has a lifetime.
Practice 6-2 — negative caching and misdiagnoses
Rule out classic look-alikes for poisoning first.
Q4. At 13:00:00 an NXDOMAIN was negative-cached with a 300-second TTL. At 13:01:00 you created the record for that name, but at 13:02:00 it is still NXDOMAIN. Which explanation is closest?
It is because the negative caching TTL has not yet expired. Even if you create the record, until the NXDOMAIN lifetime in the resolver runs out, the visible state may not change immediately.
Q5. Which of the following is the closest hint that points to split-horizon / internal-facing design rather than cache poisoning?
If only the internal resolver returns an internal IP, while the public authoritative side returns a different answer as intended, suspect split-horizon / internal-facing design first. Verify the design intent before concluding it is an anomaly.
Patterns for spotting anomalies in logs
On top of one-shot dig checks, keeping an eye on resolver / query logs as a routine helps you notice "something is happening now" or "something happened in the past" earlier. As intermediate-course observation, the following three patterns are worth remembering.
Practice 6-3 — Patterns for spotting anomalies in logs
As an intermediate course, take one further step and confirm viewpoints for picking up poisoning indicators from resolver / query logs.
Q6. A flood of queries to unfamiliar subdomains under a particular zone is concentrated in a short period of time, and each one returns NXDOMAIN. From a cache poisoning standpoint, which is the most plausible hypothesis?
Concentrated queries to random-looking, non-existent labels are consistent with deliberately generating fresh misses to pile up attempts. Looking together at the rise in NXDOMAIN ratio for the same zone, the bias of querier IPs, and the time-of-day concentration helps you make the call.
Q7. On a validating resolver, you see RRSIG validation failures against the same zone in a row over a short period. Which combination of hypotheses is most appropriate to investigate first?
RRSIG validation failures can come from (1) operational issues on the zone side (key rollover, expired signatures), (2) tampering or forged-response acceptance on the path, or (3) clock skew on the validating side. First triage on these three axes; check your own clock and the zone's operational status, then chase what remains as poisoning or man-in-the-middle suspicion.
Key takeaways from this chapter
- Start by comparing the affected resolver with the authoritative path
+traceis well suited for observing the delegation path- Rule out misdiagnosis sources — negative caching, split-horizon, browser cache — first
- For logs, continuously watch "random-label concentration," "RRSIG validation failure streaks," and "port distribution collapse"