draft-ietf-dnsop-serve-stale-07.txt   draft-ietf-dnsop-serve-stale-08.txt 
DNSOP Working Group D. Lawrence DNSOP Working Group D. Lawrence
Internet-Draft Oracle Internet-Draft Oracle
Updates: 1034, 1035, 2181 (if approved) W. Kumari Updates: 1034, 1035, 2181 (if approved) W. Kumari
Intended status: Standards Track P. Sood Intended status: Standards Track P. Sood
Expires: March 2, 2020 Google Expires: March 21, 2020 Google
August 30, 2019 September 18, 2019
Serving Stale Data to Improve DNS Resiliency Serving Stale Data to Improve DNS Resiliency
draft-ietf-dnsop-serve-stale-07 draft-ietf-dnsop-serve-stale-08
Abstract Abstract
This draft defines a method (serve-stale) for recursive resolvers to This draft defines a method (serve-stale) for recursive resolvers to
use stale DNS data to avoid outages when authoritative nameservers use stale DNS data to avoid outages when authoritative nameservers
cannot be reached to refresh expired data. One of the motivations cannot be reached to refresh expired data. One of the motivations
for serve-stale is to make the DNS more resilient to DoS attacks, and for serve-stale is to make the DNS more resilient to DoS attacks, and
thereby make them less attractive as an attack vector. This document thereby make them less attractive as an attack vector. This document
updates the definitions of TTL from RFC 1034 and RFC 1035 so that updates the definitions of TTL from RFC 1034 and RFC 1035 so that
data can be kept in the cache beyond the TTL expiry, and also updates data can be kept in the cache beyond the TTL expiry, updates RFC 2181
RFC 2181 by interpreting values with the high order bit set as being by interpreting values with the high order bit set as being positive,
positive, rather than 0, and also suggests a cap of 7 days. rather than 0, and suggests a cap of 7 days.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on March 2, 2020. This Internet-Draft will expire on March 21, 2020.
Copyright Notice Copyright Notice
Copyright (c) 2019 IETF Trust and the persons identified as the Copyright (c) 2019 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Background . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Background . . . . . . . . . . . . . . . . . . . . . . . . . 3
4. Standards Action . . . . . . . . . . . . . . . . . . . . . . 4 4. Standards Action . . . . . . . . . . . . . . . . . . . . . . 4
5. Example Method . . . . . . . . . . . . . . . . . . . . . . . 4 5. Example Method . . . . . . . . . . . . . . . . . . . . . . . 4
6. Implementation Considerations . . . . . . . . . . . . . . . . 6 6. Implementation Considerations . . . . . . . . . . . . . . . . 6
7. Implementation Caveats . . . . . . . . . . . . . . . . . . . 8 7. Implementation Caveats . . . . . . . . . . . . . . . . . . . 8
8. Implementation Status . . . . . . . . . . . . . . . . . . . . 9 8. Implementation Status . . . . . . . . . . . . . . . . . . . . 9
9. EDNS Option . . . . . . . . . . . . . . . . . . . . . . . . . 9 9. EDNS Option . . . . . . . . . . . . . . . . . . . . . . . . . 10
10. Security Considerations . . . . . . . . . . . . . . . . . . . 10 10. Security Considerations . . . . . . . . . . . . . . . . . . . 10
11. Privacy Considerations . . . . . . . . . . . . . . . . . . . 10 11. Privacy Considerations . . . . . . . . . . . . . . . . . . . 11
12. NAT Considerations . . . . . . . . . . . . . . . . . . . . . 10 12. NAT Considerations . . . . . . . . . . . . . . . . . . . . . 11
13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11
14. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 14. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 11
15. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 11
15.1. Normative References . . . . . . . . . . . . . . . . . . 11 15.1. Normative References . . . . . . . . . . . . . . . . . . 11
15.2. Informative References . . . . . . . . . . . . . . . . . 11 15.2. Informative References . . . . . . . . . . . . . . . . . 12
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12
1. Introduction 1. Introduction
Traditionally the Time To Live (TTL) of a DNS resource record has Traditionally the Time To Live (TTL) of a DNS resource record has
been understood to represent the maximum number of seconds that a been understood to represent the maximum number of seconds that a
record can be used before it must be discarded, based on its record can be used before it must be discarded, based on its
description and usage in [RFC1035] and clarifications in [RFC2181]. description and usage in [RFC1035] and clarifications in [RFC2181].
This document proposes that the definition of the TTL be explicitly This document expands the definition of the TTL to explicitly allow
expanded to allow for expired data to be used in the exceptional for expired data to be used in the exceptional circumstance that a
circumstance that a recursive resolver is unable to refresh the recursive resolver is unable to refresh the information. It is
information. It is predicated on the observation that authoritative predicated on the observation that authoritative answer
answer unavailability can cause outages even when the underlying data unavailability can cause outages even when the underlying data those
those servers would return is typically unchanged. servers would return is typically unchanged.
We describe a method below for this use of stale data, balancing the We describe a method below for this use of stale data, balancing the
competing needs of resiliency and freshness. competing needs of resiliency and freshness.
This document updates the definitions of TTL from [RFC1034] and This document updates the definitions of TTL from [RFC1034] and
[RFC1035] so that data can be kept in the cache beyond the TTL [RFC1035] so that data can be kept in the cache beyond the TTL
expiry, and also updates [RFC2181] by interpreting values with the expiry, and also updates [RFC2181] by interpreting values with the
high order bit set as being positive, rather than 0, and also high order bit set as being positive, rather than 0, and also
suggests a cap of 7 days. suggests a cap of 7 days.
skipping to change at page 3, line 44 skipping to change at page 3, line 44
clear enough that records past their TTL expiration must not be used. clear enough that records past their TTL expiration must not be used.
However, [RFC1035] predates the more rigorous terminology of However, [RFC1035] predates the more rigorous terminology of
[RFC2119] which softened the interpretation of "may" and "should". [RFC2119] which softened the interpretation of "may" and "should".
[RFC2181] aimed to provide "the precise definition of the Time to [RFC2181] aimed to provide "the precise definition of the Time to
Live", but in Section 8 was mostly concerned with the numeric range Live", but in Section 8 was mostly concerned with the numeric range
of values and the possibility that very large values should be of values and the possibility that very large values should be
capped. (It also has the curious suggestion that a value in the capped. (It also has the curious suggestion that a value in the
range 2147483648 to 4294967295 should be treated as zero.) It closes range 2147483648 to 4294967295 should be treated as zero.) It closes
that section by noting, "The TTL specifies a maximum time to live, that section by noting, "The TTL specifies a maximum time to live,
not a mandatory time to live." This is again not [RFC2119]-normative not a mandatory time to live." This wording again does not contain
language, but does convey the natural language connotation that data BCP 14 [RFC2119] key words, but does convey the natural language
becomes unusable past TTL expiry. connotation that data becomes unusable past TTL expiry.
Several recursive resolver operators currently use stale data for Several recursive resolver operators, including Akamai, currently use
answers in some way, including Akamai. A number of recursive stale data for answers in some way. A number of recursive resolver
resolver packages (including BIND, Know, OpenDNS, Unbound) provide packages (including BIND, Knot, OpenDNS, Unbound) provide options to
options to use stale data. Apple MacOS can also use stale data as use stale data. Apple MacOS can also use stale data as part of the
part of the Happy Eyeballs algorithms in mDNSResponder. The Happy Eyeballs algorithms in mDNSResponder. The collective
collective operational experience is that it provides significant operational experience is that using stale data can provide
benefit with minimal downside. significant benefit with minimal downside.
4. Standards Action 4. Standards Action
The definition of TTL in [RFC1035] Sections 3.2.1 and 4.1.3 is The definition of TTL in [RFC1035] Sections 3.2.1 and 4.1.3 is
amended to read: amended to read:
TTL a 32-bit unsigned integer number of seconds that specifies the TTL a 32-bit unsigned integer number of seconds that specifies the
duration that the resource record MAY be cached before the source duration that the resource record MAY be cached before the source
of the information MUST again be consulted. Zero values are of the information MUST again be consulted. Zero values are
interpreted to mean that the RR can only be used for the interpreted to mean that the RR can only be used for the
transaction in progress, and should not be cached. Values SHOULD transaction in progress, and should not be cached. Values SHOULD
be capped on the orders of days to weeks, with a recommended cap be capped on the orders of days to weeks, with a recommended cap
of 604,800 seconds (seven days). If the data is unable to be of 604,800 seconds (seven days). If the data is unable to be
authoritatively refreshed when the TTL expires, the record MAY be authoritatively refreshed when the TTL expires, the record MAY be
used as though it is unexpired. used as though it is unexpired. See the Section 5 and Section 6
sections for details.
Interpreting values which have the high order bit set as being Interpreting values which have the high order bit set as being
positive, rather than 0, is a change from [RFC2181]. Suggesting a positive, rather than 0, is a change from [RFC2181]. Suggesting a
cap of seven days, rather than the 68 years allowed by [RFC2181], cap of seven days, rather than the 68 years allowed by [RFC2181],
reflects the current practice of major modern DNS resolvers. reflects the current practice of major modern DNS resolvers.
When returning a response containing stale records, a recursive When returning a response containing stale records, a recursive
resolver MUST set the TTL of each expired record in the message to a resolver MUST set the TTL of each expired record in the message to a
value greater than 0, with 30 seconds RECOMMENDED. value greater than 0, with a RECOMMENDED value of 30 seconds. See
Section 6 for explanation.
Answers from authoritative servers that have a DNS Response Code of Answers from authoritative servers that have a DNS Response Code of
either 0 (NoError) or 3 (NXDomain) and the Authoritative Answers (AA) either 0 (NoError) or 3 (NXDomain) and the Authoritative Answers (AA)
bit set MUST be considered to have refreshed the data at the bit set MUST be considered to have refreshed the data at the
resolver. Answers from authoritative servers that have any other resolver. Answers from authoritative servers that have any other
response code SHOULD be considered a failure to refresh the data and response code SHOULD be considered a failure to refresh the data and
therefor leave any previous state intact. therefor leave any previous state intact. See Section 6 for a
discussion.
5. Example Method 5. Example Method
There is more than one way a recursive resolver could responsibly There is more than one way a recursive resolver could responsibly
implement this resiliency feature while still respecting the intent implement this resiliency feature while still respecting the intent
of the TTL as a signal for when data is to be refreshed. of the TTL as a signal for when data is to be refreshed.
In this example method four notable timers drive considerations for In this example method four notable timers drive considerations for
the use of stale data: the use of stale data:
skipping to change at page 5, line 30 skipping to change at page 5, line 34
timeouts. It should be configurable, with a recommended value of 1.8 timeouts. It should be configurable, with a recommended value of 1.8
seconds as being just under a common timeout value of 2 seconds while seconds as being just under a common timeout value of 2 seconds while
still giving the resolver a fair shot at resolving the name. still giving the resolver a fair shot at resolving the name.
The resolver then checks its cache for any unexpired records that The resolver then checks its cache for any unexpired records that
satisfy the request and returns them if available. If it finds no satisfy the request and returns them if available. If it finds no
relevant unexpired data and the Recursion Desired flag is not set in relevant unexpired data and the Recursion Desired flag is not set in
the request, it should immediately return the response without the request, it should immediately return the response without
consulting the cache for expired records. Typically this response consulting the cache for expired records. Typically this response
would be a referral to authoritative nameservers covering the zone, would be a referral to authoritative nameservers covering the zone,
but the specifics are implementation dependent. but the specifics are implementation-dependent.
If iterative lookups will be done, then the failure recheck timer is If iterative lookups will be done, then the failure recheck timer is
consulted. Attempts to refresh from non-responsive or otherwise consulted. Attempts to refresh from non-responsive or otherwise
failing authoritative nameservers are recommended to be done no more failing authoritative nameservers are recommended to be done no more
frequently than every 30 seconds. If this request was received frequently than every 30 seconds. If this request was received
within this period, the cache may be immediately consulted for stale within this period, the cache may be immediately consulted for stale
data to satisfy the request. data to satisfy the request.
Outside the period of the failure recheck timer, the resolver should Outside the period of the failure recheck timer, the resolver should
start the query resolution timer and begin the iterative resolution start the query resolution timer and begin the iterative resolution
skipping to change at page 7, line 41 skipping to change at page 7, line 46
from a normal cache lookup. If authoritative server addresses are from a normal cache lookup. If authoritative server addresses are
not able to be refreshed, resolution can possibly still be successful not able to be refreshed, resolution can possibly still be successful
if the authoritative servers themselves are up. For instance, if the authoritative servers themselves are up. For instance,
consider an attack on a top-level domain that takes its nameservers consider an attack on a top-level domain that takes its nameservers
offline; serve-stale resolvers that had expired glue addresses for offline; serve-stale resolvers that had expired glue addresses for
subdomains within that TLD would still be able to resolve names subdomains within that TLD would still be able to resolve names
within those subdomains, even those it had not previously looked up. within those subdomains, even those it had not previously looked up.
The directive in Section 4 that only NoError and NXDomain responses The directive in Section 4 that only NoError and NXDomain responses
should invalidate any previously associated answer stems from the should invalidate any previously associated answer stems from the
fact that no other RCODEs which a resolver normally encounters makes fact that no other RCODEs that a resolver normally encounters make
any assertions regarding the name in the question or any data any assertions regarding the name in the question or any data
associated with it. This comports with existing resolver behavior associated with it. This comports with existing resolver behavior
where a failed lookup (say, during pre-fetching) doesn't impact the where a failed lookup (say, during pre-fetching) doesn't impact the
existing cache state. Some authoritative servers operators have said existing cache state. Some authoritative server operators have said
that they would prefer stale answers to be used in the event that that they would prefer stale answers to be used in the event that
their servers are responding with errors like ServFail instead of their servers are responding with errors like ServFail instead of
giving true authoritative answers. Implementers MAY decide to return giving true authoritative answers. Implementers MAY decide to return
stale answers in this situation. stale answers in this situation.
Since the goal of serve-stale is to provide resiliency for all Since the goal of serve-stale is to provide resiliency for all
obvious errors to refresh data, these other RCODEs are treated as obvious errors to refresh data, these other RCODEs are treated as
though they are equivalent to not getting an authoritative response. though they are equivalent to not getting an authoritative response.
Although NXDomain for a previously existing name might well be an Although NXDomain for a previously existing name might well be an
error, it is not handled that way because there is no effective way error, it is not handled that way because there is no effective way
skipping to change at page 10, line 13 skipping to change at page 10, line 23
immediately useful in improving DNS resiliency for all clients. immediately useful in improving DNS resiliency for all clients.
The reporting case was ultimately also rejected because even the The reporting case was ultimately also rejected because even the
simpler version of a proposed option was still too much bother to simpler version of a proposed option was still too much bother to
implement for too little perceived value. implement for too little perceived value.
10. Security Considerations 10. Security Considerations
The most obvious security issue is the increased likelihood of DNSSEC The most obvious security issue is the increased likelihood of DNSSEC
validation failures when using stale data because signatures could be validation failures when using stale data because signatures could be
returned outside their validity period. This would only be an issue returned outside their validity period. Stale negative records can
if the authoritative servers are unreachable, the only time the increase the time window where newly published TLSA or DS RRs may not
techniques in this document are used, and thus does not introduce a be used due to cached NSEC or NSEC3 records. These scenarios would
new failure in place of what would have otherwise been success. only be an issue if the authoritative servers are unreachable, the
only time the techniques in this document are used, and thus does not
introduce a new failure in place of what would have otherwise been
success.
Additionally, bad actors have been known to use DNS caches to keep Additionally, bad actors have been known to use DNS caches to keep
records alive even after their authorities have gone away. This records alive even after their authorities have gone away. The serve
potentially makes that easier, although without introducing a new stale feature potentially makes the attack easier, although without
risk. introducing a new risk. In addition, attackers could combine this
with a DDoS attack on authoritative servers with the explicit intent
of having stale information cached for longer. But if attackers have
this capacity, they probably could do much worse than prolonging the
life of old data.
In [CloudStrife], it was demonstrated how stale DNS data, namely In [CloudStrife], it was demonstrated how stale DNS data, namely
hostnames pointing to addresses that are no longer in use by the hostnames pointing to addresses that are no longer in use by the
owner of the name, can be used to co-opt security such as to get owner of the name, can be used to co-opt security such as to get
domain-validated certificates fraudulently issued to an attacker. domain-validated certificates fraudulently issued to an attacker.
While this document does not create a new vulnerability in this area, While this document does not create a new vulnerability in this area,
it does potentially enlarge the window in which such an attack could it does potentially enlarge the window in which such an attack could
be made. A proposed mitigation is that certificate authorities be made. A proposed mitigation is that certificate authorities
should fully look up each name starting at the DNS root for every should fully look up each name starting at the DNS root for every
name lookup. Alternatively, CAs should use a resolver that is not name lookup. Alternatively, CAs should use a resolver that is not
skipping to change at page 11, line 22 skipping to change at page 11, line 41
[RFC1034] Mockapetris, P., "Domain names - concepts and facilities", [RFC1034] Mockapetris, P., "Domain names - concepts and facilities",
STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987, STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987,
<https://www.rfc-editor.org/info/rfc1034>. <https://www.rfc-editor.org/info/rfc1034>.
[RFC1035] Mockapetris, P., "Domain names - implementation and [RFC1035] Mockapetris, P., "Domain names - implementation and
specification", STD 13, RFC 1035, DOI 10.17487/RFC1035, specification", STD 13, RFC 1035, DOI 10.17487/RFC1035,
November 1987, <https://www.rfc-editor.org/info/rfc1035>. November 1987, <https://www.rfc-editor.org/info/rfc1035>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-
<https://www.rfc-editor.org/info/rfc2119>. editor.org/info/rfc2119>.
[RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS [RFC2181] Elz, R. and R. Bush, "Clarifications to the DNS
Specification", RFC 2181, DOI 10.17487/RFC2181, July 1997, Specification", RFC 2181, DOI 10.17487/RFC2181, July 1997,
<https://www.rfc-editor.org/info/rfc2181>. <https://www.rfc-editor.org/info/rfc2181>.
[RFC2308] Andrews, M., "Negative Caching of DNS Queries (DNS [RFC2308] Andrews, M., "Negative Caching of DNS Queries (DNS
NCACHE)", RFC 2308, DOI 10.17487/RFC2308, March 1998, NCACHE)", RFC 2308, DOI 10.17487/RFC2308, March 1998,
<https://www.rfc-editor.org/info/rfc2308>. <https://www.rfc-editor.org/info/rfc2308>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
skipping to change at page 11, line 45 skipping to change at page 12, line 17
May 2017, <https://www.rfc-editor.org/info/rfc8174>. May 2017, <https://www.rfc-editor.org/info/rfc8174>.
15.2. Informative References 15.2. Informative References
[CloudStrife] [CloudStrife]
Borgolte, K., Fiebig, T., Hao, S., Kruegel, C., and G. Borgolte, K., Fiebig, T., Hao, S., Kruegel, C., and G.
Vigna, "Cloud Strife: Mitigating the Security Risks of Vigna, "Cloud Strife: Mitigating the Security Risks of
Domain-Validated Certificates", ACM 2018 Applied Domain-Validated Certificates", ACM 2018 Applied
Networking Research Workshop, DOI 10.1145/3232755.3232859, Networking Research Workshop, DOI 10.1145/3232755.3232859,
July 2018, <https://www.ndss-symposium.org/wp- July 2018, <https://www.ndss-symposium.org/wp-
content/uploads/2018/02/ content/uploads/2018/02/ndss2018_06A-
ndss2018_06A-4_Borgolte_paper.pdf>. 4_Borgolte_paper.pdf>.
[DikeBreaks] [DikeBreaks]
Moura, G., Heidemann, J., Mueller, M., Schmidt, R., and M. Moura, G., Heidemann, J., Mueller, M., Schmidt, R., and M.
Davids, "When the Dike Breaks: Dissecting DNS Defenses Davids, "When the Dike Breaks: Dissecting DNS Defenses
During DDos", ACM 2018 Internet Measurement Conference, During DDos", ACM 2018 Internet Measurement Conference,
DOI 10.1145/3278532.3278534, October 2018, DOI 10.1145/3278532.3278534, October 2018,
<https://www.isi.edu/~johnh/PAPERS/Moura18b.pdf>. <https://www.isi.edu/~johnh/PAPERS/Moura18b.pdf>.
[RFC6672] Rose, S. and W. Wijngaards, "DNAME Redirection in the [RFC6672] Rose, S. and W. Wijngaards, "DNAME Redirection in the
DNS", RFC 6672, DOI 10.17487/RFC6672, June 2012, DNS", RFC 6672, DOI 10.17487/RFC6672, June 2012,
 End of changes. 22 change blocks. 
48 lines changed or deleted 58 lines changed or added

This html diff was produced by rfcdiff 1.47. The latest version is available from http://tools.ietf.org/tools/rfcdiff/