KomuraSoft LLC
Chapter 6

Observation and triage — dig / logs / common misdiagnoses

Look at the actual responses first and rule out misdiagnoses such as negative caching, split-horizon, and browser cache before anything else.

Recap of the previous chapter: In Chapter 5 we organized the scope of DNSSEC, the last-hop problem, and how to assemble defense in depth. In this chapter we extend that defense to "what to observe on a running resolver to notice when it breaks down." The basics of dig (the range covered in DNS101) are only the foundation; we go further to patterns for spotting poisoning indicators in logs.

Look at the actual responses before touching the admin console

When name resolution goes wrong, comparing "what the affected resolver is returning right now" with "what the authoritative path looks like" is a faster shortcut than taking a screenshot of the record management console.

$ dig @192.0.2.53 www.example.com A
$ dig +trace www.example.com A
$ dig +dnssec www.example.com A

The 192.0.2.53 used here is a TEST-NET address for illustration only. In real checks, use only resolvers you operate yourself or legitimate targets of observation.

Practice 6-1 — The first triage step

Compare actual responses before trusting the impression from an admin console.

Q1. A user reports that "only a particular name jumps to a strange IP." Which is the most appropriate first safe check?

Q2. Which best describes what dig +trace is suited for?

Q3. Which of the following situations makes you more likely to suspect a shared recursive cache problem than a per-browser cache?

Triaging common symptoms

SymptomFirst suspectFirst check
Multiple users sharing the same resolver see the same wrong answershared recursive cacheCompare affected resolver with authoritative path
Only internal users see a different IPsplit-horizon / internal-facing designDesign intent and presence of an internal zone
A name just created is still NXDOMAINnegative cachingTTL on the negative response
Anomalies involving RRSIG / ADDNSSEC validation / last hopdig +dnssec and the trust chain

negative caching and misdiagnoses

"The name I just created isn't there yet," "internal and external see different things" — these are classic situations that get confused with poisoning. Ruling out negative caching and split-horizon first cuts down on unnecessary false alarms.

Hint: The TTL for negative caching is governed by the SOA minimum field and the behavior described in RFC 2308. Don't forget that even a "does not exist" result has a lifetime.

Practice 6-2 — negative caching and misdiagnoses

Rule out classic look-alikes for poisoning first.

Q4. At 13:00:00 an NXDOMAIN was negative-cached with a 300-second TTL. At 13:01:00 you created the record for that name, but at 13:02:00 it is still NXDOMAIN. Which explanation is closest?

Q5. Which of the following is the closest hint that points to split-horizon / internal-facing design rather than cache poisoning?

Patterns for spotting anomalies in logs

On top of one-shot dig checks, keeping an eye on resolver / query logs as a routine helps you notice "something is happening now" or "something happened in the past" earlier. As intermediate-course observation, the following three patterns are worth remembering.

1. Concentration of random labels under a single zone
A flood of queries to unfamiliar subdomains under the same parent zone is concentrated in a short period, with a high NXDOMAIN ratio. This is consistent with deliberately generating fresh misses to pile up attempts. Look together at the bias of querier IPs and the time-of-day concentration.
2. Streaks of RRSIG validation failures
When validation failures against the same zone appear in a row over a short period, triage on three axes: (a) key rollover / signature expiration on the zone side, (b) tampering or forged-response acceptance on the path, (c) clock skew on the validating side. Check your own clock and the zone's operational status first, then chase what remains as poisoning or man-in-the-middle suspicion.
3. Degradation in NAT / port observation
Use NetFlow / pcap to take a distribution of outbound UDP source ports, and check periodically that it has not collapsed from the expected high ephemeral range into a narrow sequential band. This catches NAT configuration changes or middlebox behavior changes early.

Practice 6-3 — Patterns for spotting anomalies in logs

As an intermediate course, take one further step and confirm viewpoints for picking up poisoning indicators from resolver / query logs.

Q6. A flood of queries to unfamiliar subdomains under a particular zone is concentrated in a short period of time, and each one returns NXDOMAIN. From a cache poisoning standpoint, which is the most plausible hypothesis?

Q7. On a validating resolver, you see RRSIG validation failures against the same zone in a row over a short period. Which combination of hypotheses is most appropriate to investigate first?

Key takeaways from this chapter

  • Start by comparing the affected resolver with the authoritative path
  • +trace is well suited for observing the delegation path
  • Rule out misdiagnosis sources — negative caching, split-horizon, browser cache — first
  • For logs, continuously watch "random-label concentration," "RRSIG validation failure streaks," and "port distribution collapse"