draft-ietf-dnsop-serve-stale-03.txt   draft-ietf-dnsop-serve-stale-04.txt 
DNSOP Working Group D. Lawrence DNSOP Working Group D. Lawrence
Internet-Draft Oracle Internet-Draft Oracle
Updates: 1034, 1035 (if approved) W. Kumari Updates: 1034, 1035 (if approved) W. Kumari
Intended status: Standards Track P. Sood Intended status: Standards Track P. Sood
Expires: August 27, 2019 Google Expires: September 10, 2019 Google
February 23, 2019 March 09, 2019
Serving Stale Data to Improve DNS Resiliency Serving Stale Data to Improve DNS Resiliency
draft-ietf-dnsop-serve-stale-03 draft-ietf-dnsop-serve-stale-04
Abstract Abstract
This draft defines a method for recursive resolvers to use stale DNS This draft defines a method (serve-stale) for recursive resolvers to
data to avoid outages when authoritative nameservers cannot be use stale DNS data to avoid outages when authoritative nameservers
reached to refresh expired data. It updates the definition of TTL cannot be reached to refresh expired data. It updates the definition
from [RFC1034], [RFC1035], and [RFC2181] to make it clear that data of TTL from [RFC1034], [RFC1035], and [RFC2181] to make it clear that
can be kept in the cache beyond the TTL expiry and used for responses data can be kept in the cache beyond the TTL expiry and used for
when a refreshed answer is not readily available. One of the responses when a refreshed answer is not readily available. One of
motivations for serve-stale is to make the DNS more resilient to DoS the motivations for serve-stale is to make the DNS more resilient to
attacks, and thereby make them less attractive as an attack vector. DoS attacks, and thereby make them less attractive as an attack
vector.
Ed note Ed note
Text inside square brackets ([]) is additional background Text inside square brackets ([]) is additional background
information, answers to frequently asked questions, general musings, information, answers to frequently asked questions, general musings,
etc. They will be removed before publication. This document is etc. They will be removed before publication. This document is
being collaborated on in GitHub at <https://github.com/vttale/serve- being collaborated on in GitHub at <https://github.com/vttale/serve-
stale>. The most recent version of the document, open issues, etc stale>. The most recent version of the document, open issues, etc
should all be available here. The authors gratefully accept pull should all be available here. The authors gratefully accept pull
requests. requests.
skipping to change at page 1, line 48 skipping to change at page 2, line 4
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 10, 2019.
This Internet-Draft will expire on August 27, 2019.
Copyright Notice Copyright Notice
Copyright (c) 2019 IETF Trust and the persons identified as the Copyright (c) 2019 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of (https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 27 skipping to change at page 2, line 28
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Background . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Background . . . . . . . . . . . . . . . . . . . . . . . . . 3
4. Standards Action . . . . . . . . . . . . . . . . . . . . . . 4 4. Standards Action . . . . . . . . . . . . . . . . . . . . . . 4
5. Example Method . . . . . . . . . . . . . . . . . . . . . . . 4 5. Example Method . . . . . . . . . . . . . . . . . . . . . . . 4
6. Implementation Caveats . . . . . . . . . . . . . . . . . . . 6 6. Implementation Considerations . . . . . . . . . . . . . . . . 6
6.1. Implementation Considerations . . . . . . . . . . . . . . 7 7. Implementation Caveats . . . . . . . . . . . . . . . . . . . 8
7. Implementation Status . . . . . . . . . . . . . . . . . . . . 8 8. Implementation Status . . . . . . . . . . . . . . . . . . . . 9
8. EDNS Option . . . . . . . . . . . . . . . . . . . . . . . . . 8 9. EDNS Option . . . . . . . . . . . . . . . . . . . . . . . . . 9
9. Security Considerations . . . . . . . . . . . . . . . . . . . 9 10. Security Considerations . . . . . . . . . . . . . . . . . . . 10
10. Privacy Considerations . . . . . . . . . . . . . . . . . . . 9 11. Privacy Considerations . . . . . . . . . . . . . . . . . . . 10
11. NAT Considerations . . . . . . . . . . . . . . . . . . . . . 9 12. NAT Considerations . . . . . . . . . . . . . . . . . . . . . 10
12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10
13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 14. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10
14. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 11
14.1. Normative References . . . . . . . . . . . . . . . . . . 10 15.1. Normative References . . . . . . . . . . . . . . . . . . 11
14.2. Informative References . . . . . . . . . . . . . . . . . 10 15.2. Informative References . . . . . . . . . . . . . . . . . 11
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12
1. Introduction 1. Introduction
Traditionally the Time To Live (TTL) of a DNS resource record has Traditionally the Time To Live (TTL) of a DNS resource record has
been understood to represent the maximum number of seconds that a been understood to represent the maximum number of seconds that a
record can be used before it must be discarded, based on its record can be used before it must be discarded, based on its
description and usage in [RFC1035] and clarifications in [RFC2181]. description and usage in [RFC1035] and clarifications in [RFC2181].
This document proposes that the definition of the TTL be explicitly This document proposes that the definition of the TTL be explicitly
expanded to allow for expired data to be used in the exceptional expanded to allow for expired data to be used in the exceptional
circumstance that a recursive resolver is unable to refresh the circumstance that a recursive resolver is unable to refresh the
information. It is predicated on the observation that authoritative information. It is predicated on the observation that authoritative
server unavailability can cause outages even when the underlying data answer unavailability can cause outages even when the underlying data
those servers would return is typically unchanged. those servers would return is typically unchanged.
We describe a method below for this use of stale data, balancing the We describe a method below for this use of stale data, balancing the
competing needs of resiliency and freshness. competing needs of resiliency and freshness.
2. Terminology 2. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in BCP "OPTIONAL" in this document are to be interpreted as described in BCP
skipping to change at page 4, line 8 skipping to change at page 4, line 8
capped. (It also has the curious suggestion that a value in the capped. (It also has the curious suggestion that a value in the
range 2147483648 to 4294967295 should be treated as zero.) It closes range 2147483648 to 4294967295 should be treated as zero.) It closes
that section by noting, "The TTL specifies a maximum time to live, that section by noting, "The TTL specifies a maximum time to live,
not a mandatory time to live." This is again not [RFC2119]-normative not a mandatory time to live." This is again not [RFC2119]-normative
language, but does convey the natural language connotation that data language, but does convey the natural language connotation that data
becomes unusable past TTL expiry. becomes unusable past TTL expiry.
Several major recursive resolver operators currently use stale data Several major recursive resolver operators currently use stale data
for answers in some way, including Akamai (in three different for answers in some way, including Akamai (in three different
resolver implementations), BIND, Knot, OpenDNS, and Unbound. Apple resolver implementations), BIND, Knot, OpenDNS, and Unbound. Apple
can also use stale data as part of the Happy Eyeballs algorithms in MacOS can also use stale data as part of the Happy Eyeballs
mDNSResponder. The collective operational experience is that it algorithms in mDNSResponder. The collective operational experience
provides significant benefit with minimal downside. is that it provides significant benefit with minimal downside.
4. Standards Action 4. Standards Action
The definition of TTL in [RFC1035] Sections 3.2.1 and 4.1.3 is The definition of TTL in [RFC1035] Sections 3.2.1 and 4.1.3 is
amended to read: amended to read:
TTL a 32-bit unsigned integer number of seconds that specifies the TTL a 32-bit unsigned integer number of seconds that specifies the
duration that the resource record MAY be cached before the source duration that the resource record MAY be cached before the source
of the information MUST again be consulted. Zero values are of the information MUST again be consulted. Zero values are
interpreted to mean that the RR can only be used for the interpreted to mean that the RR can only be used for the
transaction in progress, and should not be cached. Values SHOULD transaction in progress, and should not be cached. Values SHOULD
be capped on the orders of days to weeks, with a recommended cap be capped on the orders of days to weeks, with a recommended cap
of 604,800 seconds. If the authority for the data is unavailable of 604,800 seconds. If the data is unable to be authoritatively
when attempting to refresh, the record MAY be used as though it is refreshed when the TTL expires, the record MAY be used as though
unexpired. it is unexpired.
Interpreting values which have the high order bit set as being Interpreting values which have the high order bit set as being
positive, rather than 0, is a change from [RFC2181]. Suggesting a positive, rather than 0, is a change from [RFC2181]. Suggesting a
cap of seven days, rather than the 68 years allowed by [RFC2181], cap of seven days, rather than the 68 years allowed by [RFC2181],
reflects the current practice of major modern DNS resolvers. reflects the current practice of major modern DNS resolvers.
When returning a response containing stale records, the recursive
resolver MUST set the TTL of each expired record in the message to a
value greater than 0, with 30 seconds RECOMMENDED.
Answers from authoritative servers that have a DNS Response Code of
either 0 (NoError) or 3 (NXDomain) and the Authoritative Answers (AA)
bit set MUST be considered to have refreshed the data at the
resolver. Answers from authoritative servers that have any other
response code SHOULD be considered a failure to refresh the data and
therefor leave any previous state intact.
5. Example Method 5. Example Method
There is conceivably more than one way a recursive resolver could There is conceivably more than one way a recursive resolver could
responsibly implement this resiliency feature while still respecting responsibly implement this resiliency feature while still respecting
the intent of the TTL as a signal for when data is to be refreshed. the intent of the TTL as a signal for when data is to be refreshed.
In this example method four notable timers drive considerations for In this example method four notable timers drive considerations for
the use of stale data, as follows: the use of stale data, as follows:
o A client response timer, which is the maximum amount of time a o A client response timer, which is the maximum amount of time a
recursive resolver should allow between the receipt of a recursive resolver should allow between the receipt of a
resolution request and sending its response. resolution request and sending its response.
o A query resolution timer, which caps the total amount of time a o A query resolution timer, which caps the total amount of time a
recursive resolver spends processing the query. recursive resolver spends processing the query.
o A resolution recheck timer, which limits the frequency at which a o A failure recheck timer, which limits the frequency at which a
failed lookup will be attempted. failed lookup will be attempted again.
o A maximum stale timer, which caps the amount of time that records o A maximum stale timer, which caps the amount of time that records
will be kept past their expiration. will be kept past their expiration.
Most recursive resolvers already have the query resolution timer, and Most recursive resolvers already have the query resolution timer, and
effectively some kind of resolution recheck timer. The client effectively some kind of failure recheck timer. The client response
response timer and maximum stale timer are new concepts for this timer and maximum stale timer are new concepts for this mechanism.
mechanism.
When a request is received by the recursive resolver, it SHOULD start When a request is received by the recursive resolver, it SHOULD start
the client response timer. This timer is used to avoid client the client response timer. This timer is used to avoid client
timeouts. It SHOULD be configurable, with a recommended value of 1.8 timeouts. It SHOULD be configurable, with a recommended value of 1.8
seconds as being just under a common timeout value of 2 seconds while seconds as being just under a common timeout value of 2 seconds while
still giving the resolver a fair shot at resolving the name. still giving the resolver a fair shot at resolving the name.
The resolver then checks its cache for any unexpired data that The resolver then checks its cache for any unexpired data that
satisfies the request and of course returns them if available. If it satisfies the request and of course returns them if available. If it
finds no relevant unexpired data and the Recursion Desired flag is finds no relevant unexpired data and the Recursion Desired flag is
not set in the request, it SHOULD immediately return the response not set in the request, it SHOULD immediately return the response
without consulting the cache for expired records. without consulting the cache for expired records. Typically this
response would be a referral to authoritative nameservers covering
the zone, but the specifics are implementation dependent.
If iterative lookups will be, done then the resolution recheck timer If iterative lookups will be done, then the failure recheck timer is
is consulted. Attempts to refresh from the authorities are consulted. Attempts to refresh from non-responsive or otherwise
recommended to be done no more frequently than every 30 seconds. If failing authoritative nameservers are recommended to be done no more
this request was received within this period, the cache may be frequently than every 30 seconds. If this request was received
immediately consulted for stale data to satisfy the request. within this period, the cache may be immediately consulted for stale
data to satisfy the request.
Outside the period of the resolution recheck timer, the resolver Outside the period of the failure recheck timer, the resolver SHOULD
SHOULD start the query resolution timer and begin the iterative start the query resolution timer and begin the iterative resolution
resolution process. This timer bounds the work done by the resolver process. This timer bounds the work done by the resolver when
when contacting external authorities, and is commonly around 10 to 30 contacting external authorities, and is commonly around 10 to 30
seconds. seconds. If this timer expires on an attempted lookup that is still
being processed, the resolution effort is abandoned.
If the answer has not been completely determined by the time the If the answer has not been completely determined by the time the
client response timer has elapsed, the resolver SHOULD then check its client response timer has elapsed, the resolver SHOULD then check its
cache to see whether there is expired data that would satisfy the cache to see whether there is expired data that would satisfy the
request. If so, it adds that data to the response message; it MUST request. If so, it adds that data to the response message with a TTL
set the TTL of each expired record in the message greater than 0, greater than 0 per Section 4. The response is then sent to the
with 30 seconds recommended. The response is then sent to the client client while the resolver continues its attempt to refresh the data.
while the resolver continues its attempt to refresh the data.
When no authorities are able to be reached during a resolution When no authorities are able to be reached during a resolution
attempt, the resolver SHOULD attempt to refresh the delegation. attempt, the resolver SHOULD attempt to refresh the delegation and
restart the iterative lookup process with the remaining time on the
query resolution timer. This resumption should be done only once
during one resolution effort.
Outside the resolution process, the maximum stale timer is used for Outside the resolution process, the maximum stale timer is used for
cache management and is independent of the query resolution process. cache management and is independent of the query resolution process.
This timer is conceptually different from the maximum cache TTL that This timer is conceptually different from the maximum cache TTL that
exists in many resolvers, the latter being a clamp on the value of exists in many resolvers, the latter being a clamp on the value of
TTLs as received from authoritative servers and recommended to be 7 TTLs as received from authoritative servers and recommended to be 7
days in the TTL definition above. The maximum stale timer SHOULD be days in the TTL definition above. The maximum stale timer SHOULD be
configurable, and defines the length of time after a record expires configurable, and defines the length of time after a record expires
that it SHOULD be retained in the cache. The suggested value is 7 that it SHOULD be retained in the cache. The suggested value is 7
days, which gives time for monitoring to notice the resolution days, which gives time for monitoring to notice the resolution
problem and for human intervention to fix it. problem and for human intervention to fix it.
6. Implementation Caveats 6. Implementation Considerations
Answers from authoritative servers that have a DNS Response Code of
either 0 (NOERROR) or 3 (NXDOMAIN) MUST be considered to have
refreshed the data at the resolver. In particular, this means that
this method is not meant to protect against operator error at the
authoritative server that turns a name that is intended to be valid
into one that is non-existent, because there is no way for a resolver
to know intent.
Stale data is used only when refreshing has failed, in order to
adhere to the original intent of the design of the DNS and the
behaviour expected by operators. If stale data were to always be
used immediately and then a cache refresh attempted after the client
response has been sent, the resolver would frequently be sending data
that it would have had no trouble refreshing. As modern resolvers
use techniques like pre-fetching and request coalescing for
efficiency, it is not necessary that every client request needs to
trigger a new lookup flow in the presence of stale data, but rather
that a good-faith effort has been recently made to refresh the stale
data before it is delivered to any client. The recommended period
between attempting refreshes is 30 seconds.
It is important to continue the resolution attempt after the stale
response has been sent, until the query resolution timeout, because
some pathological resolutions can take many seconds to succeed as
they cope with unavailable servers, bad networks, and other problems.
Stopping the resolution attempt when the response with expired data
has been sent would mean that answers in these pathological cases
would never be refreshed.
Canonical Name (CNAME) records mingled in the expired cache with
other records at the same owner name can cause surprising results.
This was observed with an initial implementation in BIND when a
hostname changed from having an IPv4 Address (A) record to a CNAME.
The version of BIND being used did not evict other types in the cache
when a CNAME was received, which in normal operations is not a
significant issue. However, after both records expired and the
authorities became unavailable, the fallback to stale answers
returned the older A instead of the newer CNAME.
6.1. Implementation Considerations
This document mainly describes the issues behind serving stale data This document mainly describes the issues behind serving stale data
and intentionally does not provide a formal algorithm. The concept and intentionally does not provide a formal algorithm. The concept
is not overly complex, and the details are best left to resolver is not overly complex, and the details are best left to resolver
authors to implement in their codebases. The processing of serve- authors to implement in their codebases. The processing of serve-
stale is a local operation, and consistent variables between stale is a local operation, and consistent variables between
deployments are not needed for interoperability. However, we would deployments are not needed for interoperability. However, we would
like to highlight the impact of various variables. like to highlight the impact of various implementation choices,
starting with the timers involved.
The most obvious of these is the maximum stale timer. If this The most obvious of these is the maximum stale timer. If this
variable is too large it could cause excessive cache memory usage, variable is too large it could cause excessive cache memory usage,
but if it is too small, the serve-stale technique becomes less but if it is too small, the serve-stale technique becomes less
effective, as the record may not be in the cache to be used if effective, as the record may not be in the cache to be used if
needed. Memory consumption could be mitigated by prioritizing needed. Increased memory consumption could be mitigated by
removal of stale records over non-expired records during cache prioritizing removal of stale records over non-expired records during
exhaustion. Implementations may also wish to consider whether to cache exhaustion. Implementations may also wish to consider whether
track the names in requests for their last time of use or their to track the names in requests for their last time of use or their
popularity, using that as an additional factor when considering cache popularity, using that as an additional factor when considering cache
eviction. A feature to manually flush only stale records could also eviction. A feature to manually flush only stale records could also
be useful. be useful.
The client response timer is another variable which deserves The client response timer is another variable which deserves
consideration. If this value is too short, there exists the risk consideration. If this value is too short, there exists the risk
that stale answers may be used even when the authoritative server is that stale answers may be used even when the authoritative server is
actually reachable but slow; this may result in sub-optimal answers actually reachable but slow; this may result in sub-optimal answers
being returned. Conversely, waiting too long will negatively impact being returned. Conversely, waiting too long will negatively impact
user experience. user experience.
The balance for the resolution recheck timer is responsiveness in The balance for the failure recheck timer is responsiveness in
detecting the renewed availability of authorities versus the extra detecting the renewed availability of authorities versus the extra
resource use of resolution. If this variable is set too large, stale resource use for resolution. If this variable is set too large,
answers may continue to be returned even after the authoritative stale answers may continue to be returned even after the
server is reachable. If this variable is too small, authoritative authoritative server is reachable; per [RFC2308], Section 7, this
servers may be rapidly hit with a significant amount of traffic when should be no more than five minutes. If this variable is too small,
they become reachable again. authoritative servers may be rapidly hit with a significant amount of
traffic when they become reachable again.
Regarding the TTL to set on stale records in the response, Regarding the TTL to set on stale records in the response,
historically TTLs of zero seconds have been problematic for some historically TTLs of zero seconds have been problematic for some
implementations, and negative values can't effectively be implementations, and negative values can't effectively be
communicated to existing software. Other very short TTLs could lead communicated to existing software. Other very short TTLs could lead
to congestive collapse as TTL-respecting clients rapidly try to to congestive collapse as TTL-respecting clients rapidly try to
refresh. The recommended 30 seconds not only sidesteps those refresh. The recommended value of 30 seconds not only sidesteps
potential problems with no practical negative consequences, it also those potential problems with no practical negative consequences, it
rate limits further queries from any client that honors the TTL, such also rate limits further queries from any client that honors the TTL,
as a forwarding resolver. such as a forwarding resolver.
Apart from timers, one more implementation consideration is the use Another implementation consideration is the use of stale nameserver
of stale nameserver addresses for lookups. This is mentioned addresses for lookups. This is mentioned explicitly because, in some
explicitly because, in some resolvers, getting the addresses for resolvers, getting the addresses for nameservers is a separate path
nameservers is a separate path from a normal cache lookup. If from a normal cache lookup. If authoritative server addresses are
authoritative server addresses are not able to be refreshed, not able to be refreshed, resolution can possibly still be successful
resolution can possibly still be successful if the authoritative if the authoritative servers themselves are up. For instance,
servers themselves are up. For instance, consider an attack on a consider an attack on a top-level domain that takes its nameservers
toplevel domain that takes its nameservers offline; serve-stale offline; serve-stale resolvers that had expired glue addresses for
resolvers that had expired glue addresses for subdomains within that subdomains within that TLD would still be able to resolve names
TLD would still be able to resolve names within those subdomains, within those subdomains, even those it had not previously looked up.
even those it had not previously looked up.
7. Implementation Status The directive in Section 4 that only NoError and NXDomain responses
should invalidate any previously associated answer stems from the
fact that no other RCODEs which a resolver normally encounters makes
any assertions regarding the name in the question or any data
associated with it. This comports with existing resolver behavior
where a failed lookup (say, during pre-fetching) doesn't impact the
existing cache state. Some authoritative servers operators have said
that they would prefer stale answers to be used in the event that
their servers are responding with errors like ServFail instead of
giving true authoritative answers. Implementers MAY decide to return
stale answers in this situation.
Since the goal of serve-stale is to provide resiliency for all
obvious errors to refresh data, these other RCODEs are treated as
though they are equivalent to not getting an authoritative response.
Although NXDomain for a previously existing name might well be an
error, it is not handled that way because there is no effective way
to distinguish operator intent for legitimate cases versus error
cases.
During discussion in dnsop it was suggested that Refused from all
authorities should be treated, from a serve-stale perspective, as
though it were equivalent to NXDomain because it represents an
explicit signal to take down the zone from servers that still have
the zone's delegation pointed to them. Refused, however, is also
overloaded to mean multiple possible failures which could represent
transient configuration failures. Operational experience has shown
that purposely returning Refused is a poor way to achieve an explicit
takedown of a zone compared to either updating the delegation or
returning NXDomain with a suitable SOA for extended negative caching.
Implementers MAY nonetheless consider whether to treat all
authorities returning Refused as preempting the use of stale data.
7. Implementation Caveats
Stale data is used only when refreshing has failed in order to adhere
to the original intent of the design of the DNS and the behaviour
expected by operators. If stale data were to always be used
immediately and then a cache refresh attempted after the client
response has been sent, the resolver would frequently be sending data
that it would have had no trouble refreshing. As modern resolvers
use techniques like pre-fetching and request coalescing for
efficiency, it is not necessary that every client request needs to
trigger a new lookup flow in the presence of stale data, but rather
that a good-faith effort has been recently made to refresh the stale
data before it is delivered to any client.
It is important to continue the resolution attempt after the stale
response has been sent, until the query resolution timeout, because
some pathological resolutions can take many seconds to succeed as
they cope with unavailable servers, bad networks, and other problems.
Stopping the resolution attempt when the response with expired data
has been sent would mean that answers in these pathological cases
would never be refreshed.
The continuing prohibition against using data with a 0 second TTL
beyond the current transaction explicitly extends to it being
unusable even for stale fallback, as it is not to be cached at all.
Be aware that Canonical Name (CNAME) records mingled in the expired
cache with other records at the same owner name can cause surprising
results. This was observed with an initial implementation in BIND
when a hostname changed from having an IPv4 Address (A) record to a
CNAME. The version of BIND being used did not evict other types in
the cache when a CNAME was received, which in normal operations is
not a significant issue. However, after both records expired and the
authorities became unavailable, the fallback to stale answers
returned the older A instead of the newer CNAME.
8. Implementation Status
[RFC Editor: per RFC 6982 this section should be removed prior to [RFC Editor: per RFC 6982 this section should be removed prior to
publication.] publication.]
The algorithm described in the Section 5 section was originally The algorithm described in Section 5 was originally implemented as a
implemented as a patch to BIND 9.7.0. It has been in production on patch to BIND 9.7.0. It has been in production on Akamai's
Akamai's production network since 2011, and effectively smoothed over production network since 2011, and effectively smoothed over
transient failures and longer outages that would have resulted in transient failures and longer outages that would have resulted in
major incidents. The patch was contributed to Internet Systems major incidents. The patch was contributed to Internet Systems
Consortium and the functionality is now available in BIND 9.12 via Consortium and the functionality is now available in BIND 9.12 via
the options stale-answer-enable, stale-answer-ttl, and max-stale-ttl. the options stale-answer-enable, stale-answer-ttl, and max-stale-ttl.
Unbound has a similar feature for serving stale answers, but will Unbound has a similar feature for serving stale answers, but will
respond with stale data immediately if it has recently tried and respond with stale data immediately if it has recently tried and
failed to refresh the answer by pre-fetching. failed to refresh the answer by pre-fetching.
Knot Resolver has a demo module here: https://knot- Knot Resolver has a demo module here: https://knot-
skipping to change at page 8, line 47 skipping to change at page 9, line 40
Details of Apple's implementation are not currently known. Details of Apple's implementation are not currently known.
In the research paper "When the Dike Breaks: Dissecting DNS Defenses In the research paper "When the Dike Breaks: Dissecting DNS Defenses
During DDoS" [DikeBreaks], the authors detected some use of stale During DDoS" [DikeBreaks], the authors detected some use of stale
answers by resolvers when authorities came under attack. Their answers by resolvers when authorities came under attack. Their
research results suggest that more widespread adoption of the research results suggest that more widespread adoption of the
technique would significantly improve resiliency for the large number technique would significantly improve resiliency for the large number
of requests that fail or experience abnormally long resolution times of requests that fail or experience abnormally long resolution times
during an attack. during an attack.
8. EDNS Option 9. EDNS Option
During the discussion of serve-stale in the IETF dnsop working group, During the discussion of serve-stale in the IETF dnsop working group,
it was suggested that an EDNS option should be available to either it was suggested that an EDNS option should be available to either
explicitly opt-in to getting data that is possibly stale, or at least explicitly opt-in to getting data that is possibly stale, or at least
as a debugging tool to indicate when stale data has been used for a as a debugging tool to indicate when stale data has been used for a
response. response.
The opt-in use case was rejected as the technique was meant to be The opt-in use case was rejected as the technique was meant to be
immediately useful in improving DNS resiliency for all clients. immediately useful in improving DNS resiliency for all clients.
The reporting case was ultimately also rejected as working group The reporting case was ultimately also rejected as working group
participants determined that even the simpler version of a proposed participants determined that even the simpler version of a proposed
option was still too much bother to implement for too little option was still too much bother to implement for too little
perceived value. perceived value.
9. Security Considerations 10. Security Considerations
The most obvious security issue is the increased likelihood of DNSSEC The most obvious security issue is the increased likelihood of DNSSEC
validation failures when using stale data because signatures could be validation failures when using stale data because signatures could be
returned outside their validity period. This would only be an issue returned outside their validity period. This would only be an issue
if the authoritative servers are unreachable, the only time the if the authoritative servers are unreachable, the only time the
techniques in this document are used, and thus does not introduce a techniques in this document are used, and thus does not introduce a
new failure in place of what would have otherwise been success. new failure in place of what would have otherwise been success.
Additionally, bad actors have been known to use DNS caches to keep Additionally, bad actors have been known to use DNS caches to keep
records alive even after their authorities have gone away. This records alive even after their authorities have gone away. This
skipping to change at page 9, line 40 skipping to change at page 10, line 35
hostnames pointing to addresses that are no longer in use by the hostnames pointing to addresses that are no longer in use by the
owner of the name, can be used to co-opt security such as to get owner of the name, can be used to co-opt security such as to get
domain-validated certificates fraudulently issued to an attacker. domain-validated certificates fraudulently issued to an attacker.
While this RFC does not create a new vulnerability in this area, it While this RFC does not create a new vulnerability in this area, it
does potentially enlarge the window in which such an attack could be does potentially enlarge the window in which such an attack could be
made. An obvious mitigation is that not only should a certificate made. An obvious mitigation is that not only should a certificate
authority not use a resolver that has this feature enabled, it should authority not use a resolver that has this feature enabled, it should
probably not use a caching resolver at all and instead fully look up probably not use a caching resolver at all and instead fully look up
each name freshly from the root. each name freshly from the root.
10. Privacy Considerations 11. Privacy Considerations
This document does not add any practical new privacy issues. This document does not add any practical new privacy issues.
11. NAT Considerations 12. NAT Considerations
The method described here is not affected by the use of NAT devices. The method described here is not affected by the use of NAT devices.
12. IANA Considerations 13. IANA Considerations
There are no IANA considerations. There are no IANA considerations.
13. Acknowledgements 14. Acknowledgements
The authors wish to thank Robert Edmonds, Tony Finch, Bob Harold, The authors wish to thank Robert Edmonds, Tony Finch, Bob Harold,
Matti Klock, Jason Moreau, Giovane Moura, Jean Roy, Mukund Sivaraman, Tatuya Jinmei, Matti Klock, Jason Moreau, Giovane Moura, Jean Roy,
Davey Song, Paul Vixie, Ralf Weber and Paul Wouters for their review Mukund Sivaraman, Davey Song, Paul Vixie, Ralf Weber and Paul Wouters
and feedback. for their review and feedback.
14. References 15. References
14.1. Normative References 15.1. Normative References
[RFC1034] Mockapetris, P., "Domain names - concepts and facilities", [RFC1034] Mockapetris, P., "Domain names - concepts and facilities",
STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987, STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987,
<https://www.rfc-editor.org/info/rfc1034>. <https://www.rfc-editor.org/info/rfc1034>.
[RFC1035] Mockapetris, P., "Domain names - implementation and [RFC1035] Mockapetris, P., "Domain names - implementation and
specification", STD 13, RFC 1035, DOI 10.17487/RFC1035, specification", STD 13, RFC 1035, DOI 10.17487/RFC1035,
November 1987, <https://www.rfc-editor.org/info/rfc1035>. November 1987, <https://www.rfc-editor.org/info/rfc1035>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997, DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>. <https://www.rfc-editor.org/info/rfc2119>.
[RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS [RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS
Specification", RFC 2181, DOI 10.17487/RFC2181, July 1997, Specification", RFC 2181, DOI 10.17487/RFC2181, July 1997,
<https://www.rfc-editor.org/info/rfc2181>. <https://www.rfc-editor.org/info/rfc2181>.
[RFC2308] Andrews, M., "Negative Caching of DNS Queries (DNS
NCACHE)", RFC 2308, DOI 10.17487/RFC2308, March 1998,
<https://www.rfc-editor.org/info/rfc2308>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/info/rfc8174>. May 2017, <https://www.rfc-editor.org/info/rfc8174>.
14.2. Informative References 15.2. Informative References
[CloudStrife] [CloudStrife]
Borgolte, K., Fiebig, T., Hao, S., Kruegel, C., and G. Borgolte, K., Fiebig, T., Hao, S., Kruegel, C., and G.
Vigna, "Cloud Strife: Mitigating the Security Risks of Vigna, "Cloud Strife: Mitigating the Security Risks of
Domain-Validated Certificates", ACM 2018 Applied Domain-Validated Certificates", ACM 2018 Applied
Networking Research Workshop, DOI 10.1145/3232755.3232859, Networking Research Workshop, DOI 10.1145/3232755.3232859,
July 2018, <https://www.ndss-symposium.org/wp- July 2018, <https://www.ndss-symposium.org/wp-
content/uploads/2018/02/ content/uploads/2018/02/
ndss2018_06A-4_Borgolte_paper.pdf>. ndss2018_06A-4_Borgolte_paper.pdf>.
 End of changes. 36 change blocks. 
138 lines changed or deleted 188 lines changed or added

This html diff was produced by rfcdiff 1.47. The latest version is available from http://tools.ietf.org/tools/rfcdiff/