draft-ietf-v6ops-pmtud-ecmp-problem-02.txt | draft-ietf-v6ops-pmtud-ecmp-problem-03.txt | |||
---|---|---|---|---|
v6ops M. Byerly | v6ops M. Byerly | |||
Internet-Draft Fastly | Internet-Draft Fastly | |||
Intended status: Informational M. Hite | Intended status: Informational M. Hite | |||
Expires: December 19, 2015 Evernote | Expires: December 30, 2015 Evernote | |||
J. Jaeggli | J. Jaeggli | |||
Fastly | Fastly | |||
June 17, 2015 | June 28, 2015 | |||
Close encounters of the ICMP type 2 kind (near misses with ICMPv6 PTB) | Close encounters of the ICMP type 2 kind (near misses with ICMPv6 PTB) | |||
draft-ietf-v6ops-pmtud-ecmp-problem-02 | draft-ietf-v6ops-pmtud-ecmp-problem-03 | |||
Abstract | Abstract | |||
This document calls attention to the problem of delivering ICMPv6 | This document calls attention to the problem of delivering ICMPv6 | |||
type 2 "Packet Too Big" (PTB) messages to the intended destination in | type 2 "Packet Too Big" (PTB) messages to the intended destination in | |||
ECMP load balanced or anycast network architectures. It discusses | ECMP load balanced or anycast network architectures. It discusses | |||
operational mitigations that can be employed to address this class of | operational mitigations that can be employed to address this class of | |||
failure. | failures. | |||
Status of This Memo | Status of This Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
This Internet-Draft will expire on December 19, 2015. | This Internet-Draft will expire on December 30, 2015. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2015 IETF Trust and the persons identified as the | Copyright (c) 2015 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
skipping to change at page 2, line 45 | skipping to change at page 2, line 45 | |||
2. Problem | 2. Problem | |||
A common application for stateless load balancing of TCP or UDP flows | A common application for stateless load balancing of TCP or UDP flows | |||
is to perform an initial subdivision of flows in front of a stateful | is to perform an initial subdivision of flows in front of a stateful | |||
load balancer tier or multiple servers so that the workload becomes | load balancer tier or multiple servers so that the workload becomes | |||
divided into manageable fractions of the total number of flows. The | divided into manageable fractions of the total number of flows. The | |||
flow division is performed using ECMP forwarding and a stateless but | flow division is performed using ECMP forwarding and a stateless but | |||
sticky algorithm for hashing across the available paths. This | sticky algorithm for hashing across the available paths. This | |||
nexthop selection for the purposes of flow distribution is a | nexthop selection for the purposes of flow distribution is a | |||
constrained form of anycast topology where all anycast destinations | constrained form of anycast topology, where all anycast destinations | |||
are equidistant from the upstream router responsible for making the | are equidistant from the upstream router responsible for making the | |||
last next-hop forwarding decision before the flow arrives on the | last next-hop forwarding decision before the flow arrives on the | |||
destination device. In this approach, the hash is performed across | destination device. In this approach, the hash is performed across | |||
some set of available protocol headers. Typically, these headers may | some set of available protocol headers. Typically, these headers may | |||
include all or a subset of (IPv6) Flow-Label, IP-source, IP- | include all or a subset of (IPv6) Flow-Label, IP-source, IP- | |||
destination, protocol, source-port, destination-port and potentially | destination, protocol, source-port, destination-port and potentially | |||
others such as ingress interface. | others such as ingress interface. | |||
A problem common to this approach of distribution through hashing is | A problem common to this approach of distribution through hashing is | |||
impact on path MTU discovery. An ICMPv6 type 2 PTB message generated | impact on path MTU discovery. An ICMPv6 type 2 PTB message generated | |||
on an intermediate device for a packet sent from a server that is | on an intermediate device for a packet sent from a server that is | |||
part of an ECMP load balanced service to a client will have the load | part of an ECMP load balanced service to a client will have the load | |||
balanced anycast address as the destination and hence will be | balanced anycast address as the destination and hence will be | |||
statelessly load balanced to one of the servers. While the ICMPv6 | statelessly load balanced to one of the servers. While the ICMPv6 | |||
PTB message contains as much of the packet that could not be | PTB message contains as much of the packet that could not be | |||
forwarded as possible, the payload headers are not considered in the | forwarded as possible, the payload headers are not considered in the | |||
forwarding decision and are ignored. Because the PTB message is not | forwarding decision and are ignored. Because the PTB message is not | |||
identifiable as part of the original flow by the IP or upper layer | identifiable as part of the original flow by the IP or upper layer | |||
packet headers, the results of the ICMPv6 ECMP hash are unlikely to | packet headers, the results of the ICMPv6 ECMP hash calculation are | |||
be hashed to the same nexthop as packets matching TCP or UDP ECMP | unlikely to be hashed to the same nexthop as packets matching the TCP | |||
hash. | or UDP ECMP hash of the flow. | |||
An example packet flow and topology follow. | An example packet flow and topology follow. | |||
ptb -> router ecmp -> nexthop L4/L7 load balancer -> destination | ptb -> router ecmp -> nexthop L4/L7 load balancer -> destination | |||
router --> load balancer 1 ---> | router --> load balancer 1 ---> | |||
\\--> load balancer 2 ---> load-balanced service | \\--> load balancer 2 ---> load-balanced service | |||
\--> load balancer N ---> | \--> load balancer N ---> | |||
Figure 1 | Figure 1 | |||
skipping to change at page 4, line 20 | skipping to change at page 4, line 20 | |||
independently set the TCP MSS for different address families on some | independently set the TCP MSS for different address families on some | |||
end-systems. On Linux platforms, advmss may be set on a per route | end-systems. On Linux platforms, advmss may be set on a per route | |||
basis for selected destinations in cases where discrimination by | basis for selected destinations in cases where discrimination by | |||
route is possible. | route is possible. | |||
The problem as described does also impact IPv4; however | The problem as described does also impact IPv4; however | |||
implementation of RFC 4821 [RFC4821] TCP MTU probing, the ability to | implementation of RFC 4821 [RFC4821] TCP MTU probing, the ability to | |||
fragment on wire at tunnel ingress points and the relative rarity of | fragment on wire at tunnel ingress points and the relative rarity of | |||
sub-1500 byte MTUs that are not coupled to changes in client behavior | sub-1500 byte MTUs that are not coupled to changes in client behavior | |||
(for example, endpoint VPN clients set the tunnel interface MTU | (for example, endpoint VPN clients set the tunnel interface MTU | |||
accordingly for performance reasons) makes the problem sufficiently | accordingly to avoid fragmentation for performance reasons) makes the | |||
rare that some existing deployments have choosen to ignore it. | problem sufficiently rare that some existing deployments have choosen | |||
to ignore it. | ||||
3. Mitigation | 3. Mitigation | |||
Mitigation of the potential for PTB messages to be mis-delivered | Mitigation of the potential for PTB messages to be mis-delivered | |||
involves ensuring that an ICMPv6 error message is distributed to the | involves ensuring that an ICMPv6 error message is distributed to the | |||
same anycast server responsible for the flow for which the error is | same anycast server responsible for the flow for which the error is | |||
generated. Ideally, mitigation could be done by the mechanism hosts | generated. Ideally, mitigation could be done by the mechanism hosts | |||
use to identify the flow, by looking into the payload of the ICMPv6 | use to identify the flow, by looking into the payload of the ICMPv6 | |||
message (to determine which TCP flow it was associated with) before | message (to determine which TCP flow it was associated with) before | |||
making a forwarding decision. Because the encapsulated IP header | making a forwarding decision. Because the encapsulated IP header | |||
skipping to change at page 4, line 44 | skipping to change at page 4, line 45 | |||
capability could parse that far into the payload. Employing a | capability could parse that far into the payload. Employing a | |||
mediation device that handles the parsing and distribution of PTB | mediation device that handles the parsing and distribution of PTB | |||
messages after policy routing or on each load-balancer/server is a | messages after policy routing or on each load-balancer/server is a | |||
possibility. | possibility. | |||
Another mitigation approach is predicated upon distributing the PTB | Another mitigation approach is predicated upon distributing the PTB | |||
message to all anycast servers under the assumption that the one for | message to all anycast servers under the assumption that the one for | |||
which the message was intended will be able to match it to the flow | which the message was intended will be able to match it to the flow | |||
and update the route cache with the new MTU and that devices not able | and update the route cache with the new MTU and that devices not able | |||
to match the flow will discard these packets. Such distribution has | to match the flow will discard these packets. Such distribution has | |||
potentially significant implications for resource consumption and the | potentially significant implications for resource consumption and for | |||
potential for self-inflicted denial-of-service if not carefully | self-inflicted denial-of-service if not carefully employed. | |||
employed. Fortunately, in real-world deployments we have observed | Fortunately, in real-world deployments we have observed that the | |||
that the number of flows for which this problem occurs is relatively | number of flows for which this problem occurs is relatively small | |||
small (example, 10 or fewer pps on 1Gb/s or more worth of https | (example, 10 or fewer pps on 1Gb/s or more worth of https traffic in | |||
traffic) and sensible ingress rate limiters which will discard | a real world deployment); sensible ingress rate limiters which will | |||
excessive message volume can be applied to protect even very large | discard excessive message volume can be applied to protect even very | |||
anycast server tiers with the potential for fallout only under | large anycast server tiers with the potential for fallout limited to | |||
circumstances of deliberate duress. | circumstances of deliberate duress. | |||
3.1. Alternatives | 3.1. Alternatives | |||
As an alternative, it may be appropriate to lower the TCP MSS to 1220 | As an alternative, it may be appropriate to lower the TCP MSS to 1220 | |||
in order to accommodate 1280 byte MTU. We consider this undesirable | in order to accommodate 1280 byte MTU. We consider this undesirable | |||
as hosts may not be able to independently set TCP MSS by address- | as hosts may not be able to independently set TCP MSS by address- | |||
family thereby impacting IPv4, or alternatively that middle-boxes | family thereby impacting IPv4, or alternatively that middle-boxes | |||
need to be employed to clamp the MSS independently from the end- | need to be employed to clamp the MSS independently from the end- | |||
systems. Potentialy, extension might further alter the lower bound | systems. Potentially, extension headers might further alter the | |||
that the mss would have to be set to making clamping still more | lower bound that the MSS would have to be set to, making clamping | |||
undesirable. | still more undesirable. | |||
3.2. Implementation | 3.2. Implementation | |||
1. Filter-based-forwarding matches next-header ICMPv6 type-2 and | 1. Filter-based-forwarding matches next-header ICMPv6 type-2 and | |||
matches a next-hop on a particular subnet directly attached to | matches a next-hop on a particular subnet directly attached to | |||
both border routers. The filter is policed to reasonable limits | both border routers. The filter is policed to reasonable limits | |||
(we chose 1000pps more conservative rates might be required in | (we chose 1000pps, more conservative rates might be required in | |||
other imlementations). | other implementations). | |||
2. Filter is applied on input side of all external interfaces | 2. Filter is applied on input side of all external interfaces | |||
3. A proxy located at the next-hop forwards ICMPv6 type-2 packets | 3. A proxy located at the next-hop forwards ICMPv6 type-2 packets | |||
received at the next-hop to an Ethernet broadcast address | received at the next-hop to an Ethernet broadcast address | |||
(example ff:ff:ff:ff:ff:ff) on all specified subnets. This was | (example ff:ff:ff:ff:ff:ff) on all specified subnets. This was | |||
necessitated by router inability (in IPv6) to forward the same | necessitated by router inability (in IPv6) to forward the same | |||
packet to multiple unicast next-hops. | packet to multiple unicast next-hops. | |||
4. Anycast servers receive the PTB error and process packet as | 4. Anycast servers receive the PTB error and process packet as | |||
skipping to change at page 6, line 31 | skipping to change at page 6, line 31 | |||
sniff(prn=icmp6_callback, filter="icmp6 \ | sniff(prn=icmp6_callback, filter="icmp6 \ | |||
and (ip6[40+0] == 2)", store=0) | and (ip6[40+0] == 2)", store=0) | |||
if __name__ == '__main__': | if __name__ == '__main__': | |||
main() | main() | |||
This example script listens on all interfaces for IPv6 PTB errors | This example script listens on all interfaces for IPv6 PTB errors | |||
being forwarded using filter-based-forwarding. It removes the | being forwarded using filter-based-forwarding. It removes the | |||
existing Ethernet source and rewrites a new Ethernet destination of | existing Ethernet source and rewrites a new Ethernet destination of | |||
the Ethernet broadcast address. It then sends the resulting frame | the Ethernet broadcast address. It then sends the resulting frame | |||
out the p2p1 and p2p2 interfaces where our anycast servers reside. | out the p2p1 and p2p2 interfaces which attached to vlans where our | |||
anycast servers reside. | ||||
3.2.1. Alternatives | 3.2.1. Alternatives | |||
Alternatively, network designs in which a common layer 2 network | Alternatively, network designs in which a common layer 2 network | |||
exists on the ECMP hop could distribute the proxy onto the end | exists on the ECMP hop could distribute the proxy onto the end | |||
systems, eleminating the need for policy routing. They could then | systems, eliminating the need for policy routing. They could then | |||
rewrite the destination -- for example, using iptables before | rewrite the destination -- for example, using iptables before | |||
forwarding the packet back to the network containing all of the | forwarding the packet back to the network containing all of the | |||
server or load balancer interfaces. This implmentation can be done | server or load balancer interfaces. This implmentation can be done | |||
entirely within the Linux iptables firewall. Because of the | entirely within the Linux iptables firewall. Because of the | |||
distributed nature of the filter, more conservative rate limits are | distributed nature of the filter, more conservative rate limits are | |||
required than when a global rate limit can be employed. | required than when a global rate limit can be employed. | |||
An example ip6tables / nftables rule to match icmp6 traffic, not | An example ip6tables / nftables rule to match icmp6 traffic, not | |||
match broadcast traffic, impose a rate limit of 10 pps, and pass to a | match broadcast traffic, impose a rate limit of 10 pps, and pass to a | |||
target destination would resemble: | target destination would resemble: | |||
skipping to change at page 7, line 24 | skipping to change at page 7, line 24 | |||
1. Routers with sufficient capacity within the lookup process could | 1. Routers with sufficient capacity within the lookup process could | |||
parse all the way through the L3 or L4 header in the ICMPv6 | parse all the way through the L3 or L4 header in the ICMPv6 | |||
payload beginning at bit offset 32 of the ICMP header. By | payload beginning at bit offset 32 of the ICMP header. By | |||
reordering the elements of the hash to match the inward direction | reordering the elements of the hash to match the inward direction | |||
of the flow, the PTB error could be directed to the same next-hop | of the flow, the PTB error could be directed to the same next-hop | |||
as the incoming packets in the flow. | as the incoming packets in the flow. | |||
2. The FIB (Forwarding Information Base) on the router could be | 2. The FIB (Forwarding Information Base) on the router could be | |||
programmed with a multicast distribution tree that included all | programmed with a multicast distribution tree that included all | |||
of the necessary next-hops. | of the necessary next-hops, and ICMPv6 packets could be policy | |||
routed to this destination. | ||||
3. Ubiquitous implementation of RFC 4821 [RFC4821] Packetization | 3. Ubiquitous implementation of RFC 4821 [RFC4821] Packetization | |||
Layer Path MTU Discovery would probably go a long way towards | Layer Path MTU Discovery would probably go a long way towards | |||
reducing dependence on ICMPv6 PTB. | reducing dependence on ICMPv6 PTB by end systems. | |||
5. Acknowledgements | 5. Acknowledgements | |||
The authors would like to thank Marak Majkowsiki for contributing | The authors would like to thank Marak Majkowsiki for contributing | |||
text, examples, and a very close review. The authors would like to | text, examples, and a very close review. The authors would like to | |||
thank Mark Andrews, Brian Carpenter, Nick Hilliard and Ray Hunter, | thank Mark Andrews, Brian Carpenter, Nick Hilliard and Ray Hunter, | |||
for review. | for review. | |||
6. IANA Considerations | 6. IANA Considerations | |||
End of changes. 15 change blocks. | ||||
28 lines changed or deleted | 31 lines changed or added | |||
This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |