Chapter 6

Observation and triage — dig / logs / common misdiagnoses

Look at the actual responses first and rule out misdiagnoses such as negative caching, split-horizon, and browser cache before anything else.

Recap of the previous chapter: In Chapter 5 we organized the scope of DNSSEC, the last-hop problem, and how to assemble defense in depth. In this chapter we extend that defense to "what to observe on a running resolver to notice when it breaks down." The basics of dig (the range covered in DNS101) are only the foundation; we go further to patterns for spotting poisoning indicators in logs.

Look at the actual responses before touching the admin console

When name resolution goes wrong, comparing "what the affected resolver is returning right now" with "what the authoritative path looks like" is a faster shortcut than taking a screenshot of the record management console.

$ dig @192.0.2.53 www.example.com A
$ dig +trace www.example.com A
$ dig +dnssec www.example.com A

The 192.0.2.53 used here is a TEST-NET address for illustration only. In real checks, use only resolvers you operate yourself or legitimate targets of observation.

Triaging common symptoms

Symptom	First suspect	First check
Multiple users sharing the same resolver see the same wrong answer	shared recursive cache	Compare affected resolver with authoritative path
Only internal users see a different IP	split-horizon / internal-facing design	Design intent and presence of an internal zone
A name just created is still NXDOMAIN	negative caching	TTL on the negative response
Anomalies involving RRSIG / AD	DNSSEC validation / last hop	`dig +dnssec` and the trust chain

negative caching and misdiagnoses

"The name I just created isn't there yet," "internal and external see different things" — these are classic situations that get confused with poisoning. Ruling out negative caching and split-horizon first cuts down on unnecessary false alarms.

Hint: The TTL for negative caching is governed by the SOA minimum field and the behavior described in RFC 2308. Don't forget that even a "does not exist" result has a lifetime.

Practice 6-2 — negative caching and misdiagnoses

Rule out classic look-alikes for poisoning first.

Q4. At 13:00:00 an NXDOMAIN was negative-cached with a 300-second TTL. At 13:01:00 you created the record for that name, but at 13:02:00 it is still NXDOMAIN. Which explanation is closest?

Because the authoritative server returns only MX Because the negative caching TTL has not expired yet Because DNSSEC has automatically been disabled Because of the browser's favicon cache

Q5. Which of the following is the closest hint that points to split-horizon / internal-facing design rather than cache poisoning?

On the same shared resolver, random-label queries are surging Logs show Additional out-of-bailiwick data being accepted carelessly Signature validation failures happen in a row Only the internal resolver returns an internal IP, while the public authoritative side returns a different answer as intended

Patterns for spotting anomalies in logs

On top of one-shot dig checks, keeping an eye on resolver / query logs as a routine helps you notice "something is happening now" or "something happened in the past" earlier. As intermediate-course observation, the following three patterns are worth remembering.

1. Concentration of random labels under a single zone

A flood of queries to unfamiliar subdomains under the same parent zone is concentrated in a short period, with a high NXDOMAIN ratio. This is consistent with deliberately generating fresh misses to pile up attempts. Look together at the bias of querier IPs and the time-of-day concentration.

2. Streaks of RRSIG validation failures

When validation failures against the same zone appear in a row over a short period, triage on three axes: (a) key rollover / signature expiration on the zone side, (b) tampering or forged-response acceptance on the path, (c) clock skew on the validating side. Check your own clock and the zone's operational status first, then chase what remains as poisoning or man-in-the-middle suspicion.

3. Degradation in NAT / port observation

Use NetFlow / pcap to take a distribution of outbound UDP source ports, and check periodically that it has not collapsed from the expected high ephemeral range into a narrow sequential band. This catches NAT configuration changes or middlebox behavior changes early.

Practice 6-3 — Patterns for spotting anomalies in logs

As an intermediate course, take one further step and confirm viewpoints for picking up poisoning indicators from resolver / query logs.

Q6. A flood of queries to unfamiliar subdomains under a particular zone is concentrated in a short period of time, and each one returns NXDOMAIN. From a cache poisoning standpoint, which is the most plausible hypothesis?

Users are coincidentally making a lot of typos It may be a sign of behavior similar to the Kaminsky type, deliberately generating fresh misses to pile up attempts Negative caching is malfunctioning It is evidence that DNSSEC validation succeeded

Q7. On a validating resolver, you see RRSIG validation failures against the same zone in a row over a short period. Which combination of hypotheses is most appropriate to investigate first?

All browser caches should be cleared (1) Key rollover / signature expiration on the zone side, (2) tampering or forged-response acceptance somewhere on the path, (3) clock skew on your own resolver Putting the DS record on the child side will solve it Setting the TTL to 0 will make validation pass

Key takeaways from this chapter

Start by comparing the affected resolver with the authoritative path
+trace is well suited for observing the delegation path
Rule out misdiagnosis sources — negative caching, split-horizon, browser cache — first
For logs, continuously watch "random-label concentration," "RRSIG validation failure streaks," and "port distribution collapse"

Observation and triage — dig / logs / common misdiagnoses

Look at the actual responses before touching the admin console

Practice 6-1 — The first triage step

Triaging common symptoms

negative caching and misdiagnoses

Practice 6-2 — negative caching and misdiagnoses

Patterns for spotting anomalies in logs

Practice 6-3 — Patterns for spotting anomalies in logs

Key takeaways from this chapter