< draft-moura-dnsop-authoritative-recommendations-02.txt   draft-moura-dnsop-authoritative-recommendations-03.txt >
DNSOP Working Group G. Moura DNSOP Working Group G. Moura
Internet-Draft SIDN Labs/TU Delft Internet-Draft SIDN Labs/TU Delft
Intended status: Informational W. Hardaker Intended status: Informational W. Hardaker
Expires: September 9, 2019 J. Heidemann Expires: September 12, 2019 J. Heidemann
USC/Information Sciences Institute USC/Information Sciences Institute
M. Davids M. Davids
SIDN Labs SIDN Labs
March 08, 2019 March 11, 2019
Recommendations for Authoritative Servers Operators Recommendations for Authoritative Servers Operators
draft-moura-dnsop-authoritative-recommendations-02 draft-moura-dnsop-authoritative-recommendations-03
Abstract Abstract
This document summarizes recent research work exploring DNS This document summarizes recent research work exploring DNS
configurations and offers specific, tangible recommendations to configurations and offers specific, tangible recommendations to
operators for configuring authoritative servers. operators for configuring authoritative servers.
This document is not an Internet Standards Track specification; it is This document is not an Internet Standards Track specification; it is
published for informational purposes. published for informational purposes.
skipping to change at page 2, line 7 skipping to change at page 2, line 7
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 9, 2019. This Internet-Draft will expire on September 12, 2019.
Copyright Notice Copyright Notice
Copyright (c) 2019 IETF Trust and the persons identified as the Copyright (c) 2019 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of (https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 37 skipping to change at page 2, line 37
2. R1: Use equaly strong IP anycast in every authoritative 2. R1: Use equaly strong IP anycast in every authoritative
server to achieve even load distribution . . . . . . . . . . 4 server to achieve even load distribution . . . . . . . . . . 4
3. R2: Routing Can Matter More Than Locations . . . . . . . . . 6 3. R2: Routing Can Matter More Than Locations . . . . . . . . . 6
4. R3: Collecting Detailed Anycast Catchment Maps Ahead of 4. R3: Collecting Detailed Anycast Catchment Maps Ahead of
Actual Deployment Can Improve Engineering Designs . . . . . . 6 Actual Deployment Can Improve Engineering Designs . . . . . . 6
5. R4: When under stress, employ two strategies . . . . . . . . 8 5. R4: When under stress, employ two strategies . . . . . . . . 8
6. R5: Consider longer time-to-live values whenever possible . . 9 6. R5: Consider longer time-to-live values whenever possible . . 9
7. R6: Shared Infrastructure Risks Collateral Damage During 7. R6: Shared Infrastructure Risks Collateral Damage During
Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
8. Security considerations . . . . . . . . . . . . . . . . . . . 12 8. Security considerations . . . . . . . . . . . . . . . . . . . 12
9. IANA considerations . . . . . . . . . . . . . . . . . . . . . 12 9. Privacy Considerations . . . . . . . . . . . . . . . . . . . 12
10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 10. IANA considerations . . . . . . . . . . . . . . . . . . . . . 12
11. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12
11.1. Normative References . . . . . . . . . . . . . . . . . . 13 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 13
11.2. Informative References . . . . . . . . . . . . . . . . . 14 12.1. Normative References . . . . . . . . . . . . . . . . . . 13
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 12.2. Informative References . . . . . . . . . . . . . . . . . 14
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 16
1. Introduction 1. Introduction
The domain name system (DNS) has main two types of DNS servers: The domain name system (DNS) has main two types of DNS servers:
authoritative servers and recursive resolvers. Figure 1 shows their authoritative servers and recursive resolvers. Figure 1 shows their
relationship. An authoritative server (ATn in Figure 1) knows the relationship. An authoritative server (ATn in Figure 1) knows the
content of a DNS zone from local knowledge, and thus can answer content of a DNS zone from local knowledge, and thus can answer
queries about that zone needing to query other servers [RFC2181]. A queries about that zone without needing to query other servers
recursive resolver (Rn) is a program that extracts information from
name servers in response to client requests [RFC1034]. A client [RFC2181]. A recursive resolver (Re_n) is a program that extracts
(stub in Figure 1) refers to stub resolver [RFC1034] that is information from name servers in response to client requests
typically located within the client software. [RFC1034]. A client (stub in Figure 1) refers to stub resolver
[RFC1034] that is typically located within the client software.
+-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
| AT1 | | AT2 | | AT3 | | AT4 | | AT1 | | AT2 | | AT3 | | AT4 |
+--+--+ +--+--+ +---+-+ +--+--+ +-----+ +-----+ +-----+ +-----+
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | |
| +--+--+ | | | +-----+ | |
+------+ R2_1+-------+ | +------|Re_1 |------+ |
| +--^--+ | | +-----+ |
| ^ |
| | | | | |
| +--+--+ +-----+ | | +-----+ +-----+ |
+------+R1_1 | |R1_2 +------+ +------|Re_2 | |Re_3 |-----+
+-+---+ +----+ +-----+ +-----+
^ ^ ^ ^
| | | |
| +------+ | | +------+ |
+-+ stub +--+ +-| stub |--+
+------+ +------+
Figure 1: Relationship between recursive resolvers (R) and Figure 1: Relationship between recursive resolvers (Re_n) and
authoritative name servers (AT) authoritative name servers (ATn)
DNS queries contribute to user's perceived latency and affect user DNS queries/responses contribute to user's perceived latency and
experience [Sigla2014], and the DNS system has been subject to affect user experience [Sigla2014], and the DNS system has been
repeated Denial of Service (DoS) attacks (for example, in November subject to repeated Denial of Service (DoS) attacks (for example, in
2015 [Moura16b]) in order to degrade user experience. November 2015 [Moura16b]) in order to degrade user experience.
To reduce latency and improve resiliency against DoS attacks, DNS To reduce latency and improve resiliency against DoS attacks, DNS
uses several types of server replication. Replication at the uses several types of server replication. Replication at the
authoritative server level can be achieved with the deployment of authoritative server level can be achieved with (i) the deployment of
multiple servers for the same zone [RFC1035] (AT1--AT4 in Figure 1), multiple servers for the same zone [RFC1035] (AT1--AT4 in Figure 1),
the use of IP anycast [RFC1546][RFC4786][RFC7094] that allows the (ii) the use of IP anycast [RFC1546][RFC4786][RFC7094] that allows
same IP address to be announced from multiple locations (each of them the same IP address to be announced from multiple locations (each of
referred to as anycast instance [RFC8499]) and by using load them referred to as anycast instance [RFC8499]) and (iii) by using
balancers to support multiple servers inside a single (potentially load balancers to support multiple servers inside a single
anycasted) instance. As a consequence, there are many possible ways (potentially anycasted) instance. As a consequence, there are many
a DNS provider can engineer its production authoritative server possible ways an authoritative DNS provider can engineer its
network, with multiple viable choices and no single optimal design. production authoritative server network, with multiple viable choices
and no single optimal design.
This document summarizes recent research work exploring DNS This document summarizes recent research work exploring DNS
configurations and offers specific tangible recommendations to DNS configurations and offers specific tangible recommendations to DNS
authoritative servers operators (DNS operators hereafter). authoritative servers operators (DNS operators hereafter).
[RF:JAb2]], [RF:MSJ1], [RF:DW2]. The recommendations (R1-R6) [RF:JAb2]], [RF:MSJ1], [RF:DW2]. The recommendations (R1-R6)
presented in this document are backed by previous research work, presented in this document are backed by previous research work,
which used wide-scale Internet measurements upon which to draw their which used wide-scale Internet measurements upon which to draw their
conclusions. This document describes the key engineering options, conclusions. This document describes the key engineering options,
and points readers to the pertinent papers for details. and points readers to the pertinent papers for details.
[RF:JAb1, Issue#2]. These recommendations are designed for operators [RF:JAb1, Issue#2]. These recommendations are designed for operators
of "large" authoritative servers for domains like TLDs. "Large" of "large" authoritative servers for domains like TLDs. "Large"
authoritative servers mean those with a significant global user authoritative servers refers to those with a significant global user
population. These recommendations may not be appropriate for smaller population. These recommendations may not be appropriate for smaller
domains, such as those used by an organization with users in one city domains, such as those used by an organization with users in one city
or region, where goals such as uniform low latency are less strict. or region, where goals such as uniform low latency are less strict.
It is likely that these recommendations might be useful in a wider It is likely that these recommendations might be useful in a wider
context, such as for any stateless/short-duration, anycasted service. context, such as for any stateless/short-duration, anycasted service.
Because the conclusions of the studies don't verify this fact, the Because the conclusions of the studies don't verify this fact, the
wording in this document discusses DNS authoritative services only. wording in this document discusses DNS authoritative services only.
2. R1: Use equaly strong IP anycast in every authoritative server to 2. R1: Use equaly strong IP anycast in every authoritative server to
achieve even load distribution achieve even load distribution
Authoritative DNS servers operators announce their authoritative Authoritative DNS servers operators announce their authoritative
servers in the form of Name Server (NS)records{ {RFC1034}}. Different servers as NS records[RFC1034]. Different authoritatives for a given
authoritatives for a given zone should return the same content, zone should return the same content, typically by staying
typically by staying synchronized using DNS zone transfers synchronized using DNS zone transfers (AXFR[RFC5936] and
(AXFR[RFC5936] and IXFR[RFC1995]) to coordinate the authoritative IXFR[RFC1995]) to coordinate the authoritative zone data to return to
zone data to return to their clients. their clients.
DNS heavily relies upon replication to support high reliability, DNS heavily relies upon replication to support high reliability,
capacity and to reduce latency [Moura16b]. DNS has two complementary capacity and to reduce latency [Moura16b]. DNS has two complementary
mechanisms to replicate the service. First, the protocol itself mechanisms to replicate the service. First, the protocol itself
supports nameserver replication of DNS service for a DNS zone through supports nameserver replication of DNS service for a DNS zone through
the use of multiple nameservers that each operate on different IP the use of multiple nameservers that each operate on different IP
addresses, listed by a zone's NS records. Second, each of these addresses, listed by a zone's NS records. Second, each of these
network addresses can run from multiple physical locations through network addresses can run from multiple physical locations through
the use of IP anycast[RFC1546][RFC4786][RFC7094], by announcing the the use of IP anycast[RFC1546][RFC4786][RFC7094], by announcing the
same IP address from each instance and allowing Internet routing same IP address from each instance and allowing Internet routing
skipping to change at page 5, line 11 skipping to change at page 5, line 14
servers, it is difficult to ensure that recursives will be served by servers, it is difficult to ensure that recursives will be served by
the closest authoritative server. Server selection is up to the the closest authoritative server. Server selection is up to the
recursive resolver's software implementation, and different software recursive resolver's software implementation, and different software
vendors and releases employ different criteria to chose which vendors and releases employ different criteria to chose which
authoritative servers with which to communicate. authoritative servers with which to communicate.
Knowing how recursives choose authoritative servers is a key step to Knowing how recursives choose authoritative servers is a key step to
better engineer the deployment of authoritative servers. better engineer the deployment of authoritative servers.
[Mueller17b] evaluates this with a measurement study in which they [Mueller17b] evaluates this with a measurement study in which they
deployed seven unicast authoritative name servers in different global deployed seven unicast authoritative name servers in different global
locations and queried these authoritative servers from more than locations and queried these authoritative servers from more than 9k
9,000 RIPE Atlas probes (Vantage Points--VPs) and their respective RIPE Atlas probes and and their respective recursive resolvers.
recursive resolvers.
In the wild, [Mueller17b] found that recursives query all available In the wild, [Mueller17b] found that recursives query all available
authoritative servers, regardless of the observed latency. But the authoritative servers, regardless of the observed latency. But the
distribution of queries tend to be skewed towards authoritatives with distribution of queries tend to be skewed towards authoritatives with
lower latency: the lower the latency between a recursive resolver and lower latency: the lower the latency between a recursive resolver and
an authoritative server, the more often the recursive will send an authoritative server, the more often the recursive will send
queries to that authoritative. These results were obtained by queries to that authoritative. These results were obtained by
aggregating results from all vantage points and not specific to any aggregating results from all vantage points and not specific to any
vendor/version. vendor/version.
skipping to change at page 6, line 18 skipping to change at page 6, line 20
an anycast service is the number of anycast instances[RFC4786], i.e., an anycast service is the number of anycast instances[RFC4786], i.e.,
the number of global locations from which the same address is the number of global locations from which the same address is
announced with BGP. Intuitively, one could think that more instances announced with BGP. Intuitively, one could think that more instances
will lead to shorter response times. will lead to shorter response times.
However, this is not necessarily true. In fact, [Schmidt17a] found However, this is not necessarily true. In fact, [Schmidt17a] found
that routing can matter more than the total number of locations. that routing can matter more than the total number of locations.
They analyzed the relationship between the number of anycast They analyzed the relationship between the number of anycast
instances and the performance of a service (latency-wise, RTT) and instances and the performance of a service (latency-wise, RTT) and
measured the overall performance of four DNS Root servers, namely C, measured the overall performance of four DNS Root servers, namely C,
F, K and L, from more than 7.9K RIPE Atlas probes. F, K and L, from more than 7.9k RIPE Atlas probes.
[Schmidt17a] found that C-Root, a smaller anycast deployment [Schmidt17a] found that C-Root, a smaller anycast deployment
consisting of only 8 instances (they refer to anycast instance as consisting of only 8 instances (they refer to anycast instance as
anycast site), provided a very similar overall performance than that anycast site), provided a very similar overall performance than that
of the much larger deployments of K and L, with 33 and 144 instances of the much larger deployments of K and L, with 33 and 144 instances
respectively. A median RTT was measured between 30ms and 32ms for C, respectively. The median RTT for C, K and L Root was between
K and L roots, and 25ms for F. 30-32ms.
[Schmidt17a] recommendation for DNS operators when engineering [Schmidt17a] recommendation for DNS operators when engineering
anycast services is consider factors other than just the number of anycast services is consider factors other than just the number of
instances (such as local routing connectivity) when designing for instances (such as local routing connectivity) when designing for
performance. They showed that 12 instances can provide reasonable performance. They showed that 12 instances can provide reasonable
latency, given they are globally distributed and have good local latency, given they are globally distributed and have good local
interconnectivity. However, more instances can be useful for other interconnectivity. However, more instances can be useful for other
reasons, such as when handling DDoS attacks [Moura16b]. reasons, such as when handling DDoS attacks [Moura16b].
4. R3: Collecting Detailed Anycast Catchment Maps Ahead of Actual 4. R3: Collecting Detailed Anycast Catchment Maps Ahead of Actual
skipping to change at page 7, line 22 skipping to change at page 7, line 25
carry out active measurements, using aan open-source tool called carry out active measurements, using aan open-source tool called
Verfploeter (available at [VerfSrc]). Verfploeter maps a large Verfploeter (available at [VerfSrc]). Verfploeter maps a large
portion of the IPv4 address space, allowing DNS operators to predict portion of the IPv4 address space, allowing DNS operators to predict
both query distribution and clients catchment before deploying new both query distribution and clients catchment before deploying new
anycast instances. anycast instances.
[Vries17b] shows how this technique was used to predict both the [Vries17b] shows how this technique was used to predict both the
catchment and query load distribution for the new anycast service of catchment and query load distribution for the new anycast service of
B-Root. Using two anycast instances in Miami (MIA) and Los Angeles B-Root. Using two anycast instances in Miami (MIA) and Los Angeles
(LAX) from the operational B-Root server, they sent ICMP echo packets (LAX) from the operational B-Root server, they sent ICMP echo packets
to IP addresses from each IPv4 /24 in on the Internet using a source to IP addresses to each IPv4 /24 on the Internet using a source
address within the anycast prefix. Then, they recorded which address within the anycast prefix. Then, they recorded which
instance the ICMP echo replies arrived at based on the Internet's BGP instance the ICMP echo replies arrived at based on the Internet's BGP
routing. This analysis resulted in an Internet wide catchment map. routing. This analysis resulted in an Internet wide catchment map.
Weighting was then applied to the incoming traffic prefixes based on Weighting was then applied to the incoming traffic prefixes based on
of 1 day of B-Root traffic (2017-04-12, DITL datasets [Ditl17]). The of 1 day of B-Root traffic (2017-04-12, DITL datasets [Ditl17]). The
combination of the created catchment mapping and the load per prefix combination of the created catchment mapping and the load per prefix
created an estimate predicting that 81.6% of the traffic would go to created an estimate predicting that 81.6% of the traffic would go to
the LAX instance. The actual value was 81.4% of traffic going to the LAX instance. The actual value was 81.4% of traffic going to
LAX, showing that the estimation was pretty close and the Verfploeter LAX, showing that the estimation was pretty close and the Verfploeter
technique was a excellent method of predicting traffic loads in technique was a excellent method of predicting traffic loads in
skipping to change at page 8, line 30 skipping to change at page 8, line 33
date reached 1.2 Tbps, by using IoT devices [Perlroth16]. Such date reached 1.2 Tbps, by using IoT devices [Perlroth16]. Such
attacks call for an answer for the following question: how should a attacks call for an answer for the following question: how should a
DNS operator engineer its anycast authoritative DNS server react to DNS operator engineer its anycast authoritative DNS server react to
the stress of a DDoS attack? This question is investigated in study the stress of a DDoS attack? This question is investigated in study
[Moura16b] in which empirical observations are grounded with the [Moura16b] in which empirical observations are grounded with the
following theoretical evaluation of options. following theoretical evaluation of options.
An authoritative DNS server deployed using anycast will have many An authoritative DNS server deployed using anycast will have many
server instances distributed over many networks and instances. server instances distributed over many networks and instances.
Ultimately, the relationship between the DNS provider's network and a Ultimately, the relationship between the DNS provider's network and a
client's ISP will determine which anycast instance will answer for client's ISP will determine which anycast instance will answer
queries for a given client. As a consequence, when an anycast queries for a given client. As a consequence, when an anycast
authoritative server is under attack, the load that each anycast authoritative server is under attack, the load that each anycast
instance receives is likely to be unevenly distributed (a function of instance receives is likely to be unevenly distributed (a function of
the source of the attacks), thus some instances may be more the source of the attacks), thus some instances may be more
overloaded than others which is what was observed analyzing the Root overloaded than others which is what was observed analyzing the Root
DNS events of Nov. 2015 [Moura16b]. Given the fact that different DNS events of Nov. 2015 [Moura16b]. Given the fact that different
instances may have different capacity (bandwidth, CPU, etc.), making instances may have different capacity (bandwidth, CPU, etc.), making
a decision about how to react to stress becomes even more difficult. a decision about how to react to stress becomes even more difficult.
In practice, an anycast instance under stress, overloaded with In practice, an anycast instance under stress, overloaded with
incoming traffic, has two options: incoming traffic, has two options:
o It can withdraw or pre-prepend its route to some or to all of its o It can withdraw or pre-prepend its route to some or to all of its
neighbors, ([RF:Issue3]) perform other traffic shifting tricks neighbors, ([RF:Issue3]) perform other traffic shifting tricks
(such as reducing the propagation of its announcements using BGP (such as reducing the propagation of its announcements using BGP
communities[RFC1997]) which shrinks portions of its catchment), communities[RFC1997]) which shrinks portions of its catchment),
use FlowSpec or other upstream communication mechanisms to deploy use FlowSpec [RFC5575] or other upstream communication mechanisms
upstream filtering. The goals of these techniques is to perform to deploy upstream filtering. The goals of these techniques is to
some combination of shifting of both legitimate and attack traffic perform some combination of shifting of both legitimate and attack
to other anycast instances (with hopefully greater capacity) or to traffic to other anycast instances (with hopefully greater
block the traffic entirely. capacity) or to block the traffic entirely.
o Alternatively, it can be become a degraded absorber, continuing to o Alternatively, it can be become a degraded absorber, continuing to
operate, but with overloaded ingress routers, dropping some operate, but with overloaded ingress routers, dropping some
incoming legitimate requests due to queue overflow. However, incoming legitimate requests due to queue overflow. However,
continued operation will also absorb traffic from attackers in its continued operation will also absorb traffic from attackers in its
catchment, protecting the other anycast instances. catchment, protecting the other anycast instances.
[Moura16b] saw both of these behaviors in practice in the Root DNS [Moura16b] saw both of these behaviors in practice in the Root DNS
events, observed through instance reachability and route-trip time events, observed through instance reachability and route-trip time
(RTTs). These options represent different uses of an anycast (RTTs). These options represent different uses of an anycast
deployment. The withdrawal strategy causes anycast to respond as a deployment. The withdrawal strategy causes anycast to respond as a
waterbed, with stress displacing queries from one instance to others. waterbed, with stress displacing queries from one instance to others.
The absorption strategy behaves as a conventional mattress, The absorption strategy behaves as a conventional mattress,
compressing under load, with some queries getting delayed or dropped. compressing under load, with some queries getting delayed or dropped.
Although described as strategies and policies, these outcomes are the Although described as strategies and policies, these outcomes are the
result of several factors: the combination of operator and host ISP result of several factors: the combination of operator and host ISP
routing policies, routing implementations withdrawing under load, the routing policies, routing implementations withdrawing under load, the
nature of the attack, and the locations of the instances and the nature of the attack, and the locations of the instances and the
attackers. Some policies are explicit, such as the choice of local- attackers. Some policies are explicit, such as the choice of local-
only anycast instances, or operators removing a instance for only anycast instances, or operators removing an instance for
maintenance or modifying routing to manage load. However, under maintenance or modifying routing to manage load. However, under
stress, the choices of withdrawal and absorption can also be results stress, the choices of withdrawal and absorption can also be results
that emerge from a mix of explicit choices and implementation that emerge from a mix of explicit choices and implementation
details, such as BGP timeout values. details, such as BGP timeout values.
[Moura16b] speculates that more careful, explicit, and automated [Moura16b] speculates that more careful, explicit, and automated
management of policies may provide stronger defenses to overload, an management of policies may provide stronger defenses to overload, an
area currently under study. For DNS operators, that means that area currently under study. For DNS operators, that means that
besides traditional filtering, two other options are available besides traditional filtering, two other options are available
(withdraw/prepend/communities or isolate instances), and the best (withdraw/prepend/communities or isolate instances), and the best
choice depends on the specifics of the attack. choice depends on the specifics of the attack.
Note that this recommendation refers to the operation of one anycast Note that this recommendation refers to the operation of one anycast
service, i.e., one anycast NS record. However, DNS zones with service, i.e., one anycast NS record. However, DNS zones with
multiple NS anycast services may expect load to spill from one multiple NS anycast services may expect load to spill from one
anycast server to another ,as resolvers switch from authoritative to anycast server to another,as resolvers switch from authoritative to
authoritative when attempting to resolve a name [Mueller17b]. authoritative when attempting to resolve a name [Mueller17b].
6. R5: Consider longer time-to-live values whenever possible 6. R5: Consider longer time-to-live values whenever possible
In a DNS response, each resource record is accompanied by a time-to- In a DNS response, each resource record is accompanied by a time-to-
live value (TTL), which "describes how long a RR can be cached before live value (TTL), which "describes how long a RR can be cached before
it should be discarded" [RFC1034]. The TTL values are set by zone it should be discarded" [RFC1034]. The TTL values are set by zone
owners in their zone files - either specifically per record or by owners in their zone files - either specifically per record or by
using default values for the entire zone. Sometimes the same using default values for the entire zone. Sometimes the same
resource record may have different TTL values - one from the parent resource record may have different TTL values - one from the parent
and one from the child DNS server. In this cases, resolvers are and one from the child DNS server. In this case, resolvers are
expected to prioritize the answer according to Section 5.4.1 in expected to prioritize the answer according to Section 5.4.1 in
[RFC2181]. [RFC2181].
While set at authoritative servers, (labeled "AT"s in Figure 1), the While set at authoritative servers, (ATn in Figure 1), the TTL value
TTL value in fact influences the behavior of recursive resolvers (and in fact influences the behavior of recursive resolvers (and their
their operators - "Rn" in the same figure), by setting an upper limit operators - "Re_n" in the same figure), by setting an upper limit on
on how long a record should be cached before discarded. In this how long a record should be cached before discarded. In this sense,
sense, caching can be seen as a sort of "ephemeral replication", caching can be seen as a sort of "ephemeral replication", i.e., the
i.e., the contents of an authoritative server are placed at a contents of an authoritative server are placed at a recursive
recursive resolver cache for a period of time up to the TTL value. resolver cache for a period of time up to the TTL value. Caching
Caching improves response times by avoiding repeated queries between improves response times by avoiding repeated queries between
recursive resolvers and authoritative. recursive resolvers and authoritative.
Besides improving performance, it has been argued that caching plays Besides improving performance, it has been argued that caching plays
a significant role in protecting users during DDoS attacks against a significant role in protecting users during DDoS attacks against
authoritative servers. To investigate that, [Moura18b] evaluates the authoritative servers. To investigate that, [Moura18b] evaluates the
role of caching (and retries) in DNS resiliency to DDoS attacks. Two role of caching (and retries) in DNS resiliency to DDoS attacks. Two
authoritative servers were configured for a newly registered domain authoritative servers were configured for a newly registered domain
and a series of experiments were carried out using various TTL values and a series of experiments were carried out using various TTL values
(60,1800, 3600, 86400s) for records. Unique DNS queries were sent (60,1800, 3600, 86400s) for records. Unique DNS queries were sent
from roughly 15,000 vantage points, using RIPE Atlas. from roughly 15,000 vantage points, using RIPE Atlas.
[Moura18b] found that , under normal operations, caching works as [Moura18b] found that, under normal operations, caching works as
expected 70% of the times in the wild. It is believe that complex expected 70% of the times in the wild. It is believed that complex
recursive infrastructure (such as anycast recursives with fragmented recursive infrastructure (such as anycast recursives with fragmented
cache), besides cache flushing and hierarchy explains these other 30% cache), besides cache flushing and hierarchy explains these other 30%
of the non-cached records. The results from the experiments were of the non-cached records. The results from the experiments were
confirmed by analyzing authoritative traffic for the .nl TLD, which confirmed by analyzing authoritative traffic for the .nl TLD, which
showed similar figures. showed similar figures.
[Moura18b] also emulated DDoS attacks on authoritative servers were [Moura18b] also emulated DDoS attacks on authoritative servers by
emulated by dropping all incoming packets for various TTLs values. dropping all incoming packets for various TTLs values. For
For experiments when all authoritative servers were completely experiments when all authoritative servers were completely
unreachable, they found that TTL value on the DNS records determined unreachable, they found that the TTL value on the DNS records
how long clients received responses, together with the status of the determined how long clients received responses, together with the
cache at the attack time. Given the TTL value decreases as time status of the cache at the attack time. Given the TTL value
passes at the cache, it protected clients for up to its value in decreases as time passes at the cache, it protected clients for up to
cache. Once the TTL expires, there was some evidence of some its value in cache. Once the TTL expires, there was some evidence of
recursives serving stale content [I-D.ietf-dnsop-serve-stale]. some recursives serving stale content [I-D.ietf-dnsop-serve-stale].
Serving stale is the only viable option when TTL values expire in Serving stale is the only viable option when TTL values expire in
recursive caches and authoritative servers became completely recursive caches and authoritative servers became completely
unavailable. unavailable.
They also emulated partial-failure DDoS failures were also emulated They also emulated partial-failure DDoS, i.e., DDoS that cause
(similar to Dyn 2016 [Perlroth16], by dropping packet at rates of authoritative to respond to be able to respond part of the queries
50-90%, for various TTL values. They found that: (similar to Dyn 2016 [Perlroth16]). They emulate such scenario by
dropping incoming packet at rates of 50-90%, for various TTL values.
They found that:
o Caching was a key component in the success of queries. For o Caching was a key component in the success of queries. For
example, with a 50% packet drop rate at the authoritatives, most example, with a 50% packet drop rate at the authoritatives, most
clients eventually got an answer. clients eventually got an answer.
o Recursives retries was also a key part of resilience: when caching o Recursives retries was also a key part of resilience: when caching
could not help (for a scenario with TTL of 60s, and time in could not help (for a scenario with TTL of 60s, and time in
between probing of 10 minutes), recursive servers kept retrying between probing of 10 minutes), recursive servers kept retrying
queries to authoritatives. With 90% packet drop on both queries to authoritatives. With 90% packet drop on both
authoritatives (with TTL of 60s), 27% of clients still got an authoritatives (with TTL of 60s), 27% of clients still got an
skipping to change at page 12, line 21 skipping to change at page 12, line 26
many of their customers, including Airbnb, HBO, Netflix, and Twitter many of their customers, including Airbnb, HBO, Netflix, and Twitter
experienced issues with clients failing to resolve their domains, experienced issues with clients failing to resolve their domains,
since the servers partially shared the same infrastructure. since the servers partially shared the same infrastructure.
It is recommended, therefore, when choosing third-party DNS It is recommended, therefore, when choosing third-party DNS
providers, operators should be aware of shared infrastructure risks. providers, operators should be aware of shared infrastructure risks.
By sharing infrastructure, there is an increased attack surface. By sharing infrastructure, there is an increased attack surface.
8. Security considerations 8. Security considerations
o to be added This document suggests the use of [I-D.ietf-dnsop-serve-stale]. It
be noted that usage of such methods may affect data integrity of DNS
information. This document describes methods of mitigating changes
of a denial of service threat within a DNS service.
9. IANA considerations As this document discusses research, there are no further security
considerations, other than the ones mentioned in the normative
references.
9. Privacy Considerations
This document does not add any practical new privacy issues.
10. IANA considerations
This document has no IANA actions. This document has no IANA actions.
10. Acknowledgements 11. Acknowledgements
This document is a summary of the main recommendations of five This document is a summary of the main recommendations of five
research works referred in this document. As such, they were only research works referred in this document. As such, they were only
possible thanks to the hard work of the authors of these research possible thanks to the hard work of the authors of these research
works. works.
The authors of this document are also co-authors of these research The authors of this document are also co-authors of these research
works. However, not all thirteen authors of these research papers works. However, not all thirteen authors of these research papers
are also authors of this document. We would like to thank those not are also authors of this document. We would like to thank those not
included in this document's author list for their work: Ricardo de O. included in this document's author list for their work: Ricardo de O.
Schmidt, Wouter B de Vries, Moritz Mueller, Lan Wei, Cristian Schmidt, Wouter B de Vries, Moritz Mueller, Lan Wei, Cristian
Hesselman, Jan Harm Kuipers, Pieter-Tjerk de Boer and Aiko Pras. Hesselman, Jan Harm Kuipers, Pieter-Tjerk de Boer and Aiko Pras.
We would like also to thank the various reviewers of different We would like also to thank the various reviewers of different
versions of this draft: Duane Wessels, Joe Abley, Toema Gavrichenkov, versions of this draft: Duane Wessels, Joe Abley, Toema Gavrichenkov,
John Levine, Michael StJohns, and Kristof Tuyteleers. John Levine, Michael StJohns, Kristof Tuyteleers, and Stefan Ubbink.
Besides those, we would like thank those who have been individually Besides those, we would like thank those who have been individually
thanked in each research work, RIPE NCC and DNS OARC for their tools thanked in each research work, RIPE NCC and DNS OARC for their tools
and datasets used in this research, as well as the funding agencies and datasets used in this research, as well as the funding agencies
sponsoring the individual research works. sponsoring the individual research works.
11. References 12. References
11.1. Normative References 12.1. Normative References
[I-D.ietf-dnsop-serve-stale] [I-D.ietf-dnsop-serve-stale]
Lawrence, D., Kumari, W., and P. Sood, "Serving Stale Data Lawrence, D., Kumari, W., and P. Sood, "Serving Stale Data
to Improve DNS Resiliency", draft-ietf-dnsop-serve- to Improve DNS Resiliency", draft-ietf-dnsop-serve-
stale-03 (work in progress), February 2019. stale-03 (work in progress), February 2019.
[RFC1034] Mockapetris, P., "Domain names - concepts and facilities", [RFC1034] Mockapetris, P., "Domain names - concepts and facilities",
STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987, STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987,
<https://www.rfc-editor.org/info/rfc1034>. <https://www.rfc-editor.org/info/rfc1034>.
skipping to change at page 13, line 47 skipping to change at page 14, line 14
[RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
Border Gateway Protocol 4 (BGP-4)", RFC 4271, Border Gateway Protocol 4 (BGP-4)", RFC 4271,
DOI 10.17487/RFC4271, January 2006, DOI 10.17487/RFC4271, January 2006,
<https://www.rfc-editor.org/info/rfc4271>. <https://www.rfc-editor.org/info/rfc4271>.
[RFC4786] Abley, J. and K. Lindqvist, "Operation of Anycast [RFC4786] Abley, J. and K. Lindqvist, "Operation of Anycast
Services", BCP 126, RFC 4786, DOI 10.17487/RFC4786, Services", BCP 126, RFC 4786, DOI 10.17487/RFC4786,
December 2006, <https://www.rfc-editor.org/info/rfc4786>. December 2006, <https://www.rfc-editor.org/info/rfc4786>.
[RFC5575] Marques, P., Sheth, N., Raszuk, R., Greene, B., Mauch, J.,
and D. McPherson, "Dissemination of Flow Specification
Rules", RFC 5575, DOI 10.17487/RFC5575, August 2009,
<https://www.rfc-editor.org/info/rfc5575>.
[RFC5936] Lewis, E. and A. Hoenes, Ed., "DNS Zone Transfer Protocol [RFC5936] Lewis, E. and A. Hoenes, Ed., "DNS Zone Transfer Protocol
(AXFR)", RFC 5936, DOI 10.17487/RFC5936, June 2010, (AXFR)", RFC 5936, DOI 10.17487/RFC5936, June 2010,
<https://www.rfc-editor.org/info/rfc5936>. <https://www.rfc-editor.org/info/rfc5936>.
[RFC7094] McPherson, D., Oran, D., Thaler, D., and E. Osterweil, [RFC7094] McPherson, D., Oran, D., Thaler, D., and E. Osterweil,
"Architectural Considerations of IP Anycast", RFC 7094, "Architectural Considerations of IP Anycast", RFC 7094,
DOI 10.17487/RFC7094, January 2014, DOI 10.17487/RFC7094, January 2014,
<https://www.rfc-editor.org/info/rfc7094>. <https://www.rfc-editor.org/info/rfc7094>.
[RFC8499] Hoffman, P., Sullivan, A., and K. Fujiwara, "DNS [RFC8499] Hoffman, P., Sullivan, A., and K. Fujiwara, "DNS
Terminology", BCP 219, RFC 8499, DOI 10.17487/RFC8499, Terminology", BCP 219, RFC 8499, DOI 10.17487/RFC8499,
January 2019, <https://www.rfc-editor.org/info/rfc8499>. January 2019, <https://www.rfc-editor.org/info/rfc8499>.
11.2. Informative References 12.2. Informative References
[AnyTest] Schmidt, R., "Anycast Testbed", December 2018, [AnyTest] Schmidt, R., "Anycast Testbed", December 2018,
<http://www.anycast-testbed.com/>. <http://www.anycast-testbed.com/>.
[Ditl17] OARC, D., "2017 DITL data", October 2018, [Ditl17] OARC, D., "2017 DITL data", October 2018,
<https://www.dns-oarc.net/oarc/data/ditl/2017>. <https://www.dns-oarc.net/oarc/data/ditl/2017>.
[IcannHedge18] [IcannHedge18]
ICANN, ., "DNS-STATS - Hedgehog 2.4.1", October 2018, ICANN, ., "DNS-STATS - Hedgehog 2.4.1", October 2018,
<http://stats.dns.icann.org/hedgehog/>. <http://stats.dns.icann.org/hedgehog/>.
 End of changes. 37 change blocks. 
90 lines changed or deleted 112 lines changed or added

This html diff was produced by rfcdiff 1.47. The latest version is available from http://tools.ietf.org/tools/rfcdiff/