draft-ietf-dnsop-serve-stale-04.txt   draft-ietf-dnsop-serve-stale-05.txt 
DNSOP Working Group D. Lawrence DNSOP Working Group D. Lawrence
Internet-Draft Oracle Internet-Draft Oracle
Updates: 1034, 1035 (if approved) W. Kumari Updates: 1034, 1035 (if approved) W. Kumari
Intended status: Standards Track P. Sood Intended status: Standards Track P. Sood
Expires: September 10, 2019 Google Expires: October 18, 2019 Google
March 09, 2019 April 16, 2019
Serving Stale Data to Improve DNS Resiliency Serving Stale Data to Improve DNS Resiliency
draft-ietf-dnsop-serve-stale-04 draft-ietf-dnsop-serve-stale-05
Abstract Abstract
This draft defines a method (serve-stale) for recursive resolvers to This draft defines a method (serve-stale) for recursive resolvers to
use stale DNS data to avoid outages when authoritative nameservers use stale DNS data to avoid outages when authoritative nameservers
cannot be reached to refresh expired data. It updates the definition cannot be reached to refresh expired data. It updates the definition
of TTL from [RFC1034], [RFC1035], and [RFC2181] to make it clear that of TTL from [RFC1034], [RFC1035], and [RFC2181] to make it clear that
data can be kept in the cache beyond the TTL expiry and used for data can be kept in the cache beyond the TTL expiry and used for
responses when a refreshed answer is not readily available. One of responses when a refreshed answer is not readily available. One of
the motivations for serve-stale is to make the DNS more resilient to the motivations for serve-stale is to make the DNS more resilient to
skipping to change at page 2, line 4 skipping to change at page 2, line 4
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 10, 2019. This Internet-Draft will expire on October 18, 2019.
Copyright Notice Copyright Notice
Copyright (c) 2019 IETF Trust and the persons identified as the Copyright (c) 2019 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of (https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 31 skipping to change at page 2, line 31
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Background . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Background . . . . . . . . . . . . . . . . . . . . . . . . . 3
4. Standards Action . . . . . . . . . . . . . . . . . . . . . . 4 4. Standards Action . . . . . . . . . . . . . . . . . . . . . . 4
5. Example Method . . . . . . . . . . . . . . . . . . . . . . . 4 5. Example Method . . . . . . . . . . . . . . . . . . . . . . . 4
6. Implementation Considerations . . . . . . . . . . . . . . . . 6 6. Implementation Considerations . . . . . . . . . . . . . . . . 6
7. Implementation Caveats . . . . . . . . . . . . . . . . . . . 8 7. Implementation Caveats . . . . . . . . . . . . . . . . . . . 8
8. Implementation Status . . . . . . . . . . . . . . . . . . . . 9 8. Implementation Status . . . . . . . . . . . . . . . . . . . . 9
9. EDNS Option . . . . . . . . . . . . . . . . . . . . . . . . . 9 9. EDNS Option . . . . . . . . . . . . . . . . . . . . . . . . . 10
10. Security Considerations . . . . . . . . . . . . . . . . . . . 10 10. Security Considerations . . . . . . . . . . . . . . . . . . . 10
11. Privacy Considerations . . . . . . . . . . . . . . . . . . . 10 11. Privacy Considerations . . . . . . . . . . . . . . . . . . . 10
12. NAT Considerations . . . . . . . . . . . . . . . . . . . . . 10 12. NAT Considerations . . . . . . . . . . . . . . . . . . . . . 11
13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11
14. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 14. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 11
15. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 11
15.1. Normative References . . . . . . . . . . . . . . . . . . 11 15.1. Normative References . . . . . . . . . . . . . . . . . . 11
15.2. Informative References . . . . . . . . . . . . . . . . . 11 15.2. Informative References . . . . . . . . . . . . . . . . . 12
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12
1. Introduction 1. Introduction
Traditionally the Time To Live (TTL) of a DNS resource record has Traditionally the Time To Live (TTL) of a DNS resource record has
been understood to represent the maximum number of seconds that a been understood to represent the maximum number of seconds that a
record can be used before it must be discarded, based on its record can be used before it must be discarded, based on its
description and usage in [RFC1035] and clarifications in [RFC2181]. description and usage in [RFC1035] and clarifications in [RFC2181].
This document proposes that the definition of the TTL be explicitly This document proposes that the definition of the TTL be explicitly
skipping to change at page 5, line 22 skipping to change at page 5, line 22
o A failure recheck timer, which limits the frequency at which a o A failure recheck timer, which limits the frequency at which a
failed lookup will be attempted again. failed lookup will be attempted again.
o A maximum stale timer, which caps the amount of time that records o A maximum stale timer, which caps the amount of time that records
will be kept past their expiration. will be kept past their expiration.
Most recursive resolvers already have the query resolution timer, and Most recursive resolvers already have the query resolution timer, and
effectively some kind of failure recheck timer. The client response effectively some kind of failure recheck timer. The client response
timer and maximum stale timer are new concepts for this mechanism. timer and maximum stale timer are new concepts for this mechanism.
When a request is received by the recursive resolver, it SHOULD start When a request is received by the recursive resolver, it should start
the client response timer. This timer is used to avoid client the client response timer. This timer is used to avoid client
timeouts. It SHOULD be configurable, with a recommended value of 1.8 timeouts. It should be configurable, with a recommended value of 1.8
seconds as being just under a common timeout value of 2 seconds while seconds as being just under a common timeout value of 2 seconds while
still giving the resolver a fair shot at resolving the name. still giving the resolver a fair shot at resolving the name.
The resolver then checks its cache for any unexpired data that The resolver then checks its cache for any unexpired data that
satisfies the request and of course returns them if available. If it satisfies the request and of course returns them if available. If it
finds no relevant unexpired data and the Recursion Desired flag is finds no relevant unexpired data and the Recursion Desired flag is
not set in the request, it SHOULD immediately return the response not set in the request, it should immediately return the response
without consulting the cache for expired records. Typically this without consulting the cache for expired records. Typically this
response would be a referral to authoritative nameservers covering response would be a referral to authoritative nameservers covering
the zone, but the specifics are implementation dependent. the zone, but the specifics are implementation dependent.
If iterative lookups will be done, then the failure recheck timer is If iterative lookups will be done, then the failure recheck timer is
consulted. Attempts to refresh from non-responsive or otherwise consulted. Attempts to refresh from non-responsive or otherwise
failing authoritative nameservers are recommended to be done no more failing authoritative nameservers are recommended to be done no more
frequently than every 30 seconds. If this request was received frequently than every 30 seconds. If this request was received
within this period, the cache may be immediately consulted for stale within this period, the cache may be immediately consulted for stale
data to satisfy the request. data to satisfy the request.
Outside the period of the failure recheck timer, the resolver SHOULD Outside the period of the failure recheck timer, the resolver should
start the query resolution timer and begin the iterative resolution start the query resolution timer and begin the iterative resolution
process. This timer bounds the work done by the resolver when process. This timer bounds the work done by the resolver when
contacting external authorities, and is commonly around 10 to 30 contacting external authorities, and is commonly around 10 to 30
seconds. If this timer expires on an attempted lookup that is still seconds. If this timer expires on an attempted lookup that is still
being processed, the resolution effort is abandoned. being processed, the resolution effort is abandoned.
If the answer has not been completely determined by the time the If the answer has not been completely determined by the time the
client response timer has elapsed, the resolver SHOULD then check its client response timer has elapsed, the resolver should then check its
cache to see whether there is expired data that would satisfy the cache to see whether there is expired data that would satisfy the
request. If so, it adds that data to the response message with a TTL request. If so, it adds that data to the response message with a TTL
greater than 0 per Section 4. The response is then sent to the greater than 0 per Section 4. The response is then sent to the
client while the resolver continues its attempt to refresh the data. client while the resolver continues its attempt to refresh the data.
When no authorities are able to be reached during a resolution When no authorities are able to be reached during a resolution
attempt, the resolver SHOULD attempt to refresh the delegation and attempt, the resolver should attempt to refresh the delegation and
restart the iterative lookup process with the remaining time on the restart the iterative lookup process with the remaining time on the
query resolution timer. This resumption should be done only once query resolution timer. This resumption should be done only once
during one resolution effort. during one resolution effort.
Outside the resolution process, the maximum stale timer is used for Outside the resolution process, the maximum stale timer is used for
cache management and is independent of the query resolution process. cache management and is independent of the query resolution process.
This timer is conceptually different from the maximum cache TTL that This timer is conceptually different from the maximum cache TTL that
exists in many resolvers, the latter being a clamp on the value of exists in many resolvers, the latter being a clamp on the value of
TTLs as received from authoritative servers and recommended to be 7 TTLs as received from authoritative servers and recommended to be 7
days in the TTL definition above. The maximum stale timer SHOULD be days in the TTL definition above. The maximum stale timer should be
configurable, and defines the length of time after a record expires configurable, and defines the length of time after a record expires
that it SHOULD be retained in the cache. The suggested value is 7 that it should be retained in the cache. The suggested value is
days, which gives time for monitoring to notice the resolution between 1 and 3 days.
problem and for human intervention to fix it.
6. Implementation Considerations 6. Implementation Considerations
This document mainly describes the issues behind serving stale data This document mainly describes the issues behind serving stale data
and intentionally does not provide a formal algorithm. The concept and intentionally does not provide a formal algorithm. The concept
is not overly complex, and the details are best left to resolver is not overly complex, and the details are best left to resolver
authors to implement in their codebases. The processing of serve- authors to implement in their codebases. The processing of serve-
stale is a local operation, and consistent variables between stale is a local operation, and consistent variables between
deployments are not needed for interoperability. However, we would deployments are not needed for interoperability. However, we would
like to highlight the impact of various implementation choices, like to highlight the impact of various implementation choices,
starting with the timers involved. starting with the timers involved.
The most obvious of these is the maximum stale timer. If this The most obvious of these is the maximum stale timer. If this
variable is too large it could cause excessive cache memory usage, variable is too large it could cause excessive cache memory usage,
but if it is too small, the serve-stale technique becomes less but if it is too small, the serve-stale technique becomes less
effective, as the record may not be in the cache to be used if effective, as the record may not be in the cache to be used if
needed. Increased memory consumption could be mitigated by needed. Shorter values, even less than a day, can effectively handle
prioritizing removal of stale records over non-expired records during the vast majority of outages. Longer values, as much as a week, give
cache exhaustion. Implementations may also wish to consider whether time for monitoring systems to notice a resolution problem and for
to track the names in requests for their last time of use or their human intervention to fix it; operational experience has been that
sometimes the right people can be hard to track down and
unfortunately slow to remedy the situation.
Increased memory consumption could be mitigated by prioritizing
removal of stale records over non-expired records during cache
exhaustion. Implementations may also wish to consider whether to
track the names in requests for their last time of use or their
popularity, using that as an additional factor when considering cache popularity, using that as an additional factor when considering cache
eviction. A feature to manually flush only stale records could also eviction. A feature to manually flush only stale records could also
be useful. be useful.
The client response timer is another variable which deserves The client response timer is another variable which deserves
consideration. If this value is too short, there exists the risk consideration. If this value is too short, there exists the risk
that stale answers may be used even when the authoritative server is that stale answers may be used even when the authoritative server is
actually reachable but slow; this may result in sub-optimal answers actually reachable but slow; this may result in sub-optimal answers
being returned. Conversely, waiting too long will negatively impact being returned. Conversely, waiting too long will negatively impact
user experience. user experience.
 End of changes. 15 change blocks. 
23 lines changed or deleted 29 lines changed or added

This html diff was produced by rfcdiff 1.47. The latest version is available from http://tools.ietf.org/tools/rfcdiff/