--- 1/draft-ietf-p2psip-base-06.txt 2010-02-17 18:10:48.000000000 +0100 +++ 2/draft-ietf-p2psip-base-07.txt 2010-02-17 18:10:48.000000000 +0100 @@ -1,24 +1,24 @@ P2PSIP C. Jennings Internet-Draft Cisco Intended status: Standards Track B. Lowekamp, Ed. -Expires: May 13, 2010 MYMIC LLC +Expires: August 21, 2010 Skype E. Rescorla Network Resonance S. Baset H. Schulzrinne Columbia University - November 9, 2009 + February 17, 2010 REsource LOcation And Discovery (RELOAD) Base Protocol - draft-ietf-p2psip-base-06 + draft-ietf-p2psip-base-07 Abstract In this document the term BCP 78 and BCP 79 refer to RFC 3978 and RFC 3979 respectively. They refer only to those RFCs and not to any documents that update or supersede them. This document defines REsource LOcation And Discovery (RELOAD), a peer-to-peer (P2P) signaling protocol for use on the Internet. A P2P signaling protocol provides its clients with an abstract storage and @@ -57,25 +57,25 @@ and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. - This Internet-Draft will expire on May 13, 2010. + This Internet-Draft will expire on August 21, 2010. Copyright Notice - Copyright (c) 2009 IETF Trust and the persons identified as the + Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as @@ -149,165 +149,161 @@ 5.4.1. Topology Plugin Requirements . . . . . . . . . . . . 52 5.4.2. Methods and types for use by topology plugins . . . 52 5.4.2.1. Join . . . . . . . . . . . . . . . . . . . . . . 52 5.4.2.2. Leave . . . . . . . . . . . . . . . . . . . . . . 53 5.4.2.3. Update . . . . . . . . . . . . . . . . . . . . . 54 5.4.2.4. Route_Query . . . . . . . . . . . . . . . . . . . 54 5.4.2.5. Probe . . . . . . . . . . . . . . . . . . . . . . 55 5.5. Forwarding and Link Management Layer . . . . . . . . . . 57 5.5.1. Attach . . . . . . . . . . . . . . . . . . . . . . . 58 5.5.1.1. Request Definition . . . . . . . . . . . . . . . 58 - 5.5.1.2. Response Definition . . . . . . . . . . . . . . . 60 + 5.5.1.2. Response Definition . . . . . . . . . . . . . . . 61 5.5.1.3. Using ICE With RELOAD . . . . . . . . . . . . . . 61 5.5.1.4. Collecting STUN Servers . . . . . . . . . . . . . 61 5.5.1.5. Gathering Candidates . . . . . . . . . . . . . . 62 - 5.5.1.6. Encoding the Attach Message . . . . . . . . . . . 62 - 5.5.1.7. Verifying ICE Support . . . . . . . . . . . . . . 63 - 5.5.1.8. Role Determination . . . . . . . . . . . . . . . 63 - 5.5.1.9. Connectivity Checks . . . . . . . . . . . . . . . 63 - 5.5.1.10. Concluding ICE . . . . . . . . . . . . . . . . . 63 - 5.5.1.11. Subsequent Offers and Answers . . . . . . . . . . 64 - 5.5.1.12. Media Keepalives . . . . . . . . . . . . . . . . 64 - 5.5.1.13. Sending Media . . . . . . . . . . . . . . . . . . 64 - 5.5.1.14. Receiving Media . . . . . . . . . . . . . . . . . 64 - 5.5.2. AttachLite . . . . . . . . . . . . . . . . . . . . . 64 - 5.5.2.1. Request Definition . . . . . . . . . . . . . . . 64 - 5.5.2.2. Response Definition . . . . . . . . . . . . . . . 65 - 5.5.2.3. Attach-Lite Connectivity Checks . . . . . . . . . 65 - 5.5.2.4. Implementation Notes for Attach-Lite . . . . . . 65 - 5.5.3. AppAttach . . . . . . . . . . . . . . . . . . . . . 66 - 5.5.3.1. Request Definition . . . . . . . . . . . . . . . 66 + 5.5.1.6. Prioritizing Candidates . . . . . . . . . . . . . 63 + 5.5.1.7. Encoding the Attach Message . . . . . . . . . . . 63 + 5.5.1.8. Verifying ICE Support . . . . . . . . . . . . . . 64 + 5.5.1.9. Role Determination . . . . . . . . . . . . . . . 64 + 5.5.1.10. Full ICE . . . . . . . . . . . . . . . . . . . . 64 + 5.5.1.11. No ICE . . . . . . . . . . . . . . . . . . . . . 65 + 5.5.1.12. Subsequent Offers and Answers . . . . . . . . . . 65 + 5.5.1.13. Sending Media . . . . . . . . . . . . . . . . . . 65 + 5.5.1.14. Receiving Media . . . . . . . . . . . . . . . . . 66 + 5.5.2. AppAttach . . . . . . . . . . . . . . . . . . . . . 66 + 5.5.2.1. Request Definition . . . . . . . . . . . . . . . 66 + 5.5.2.2. Response Definition . . . . . . . . . . . . . . . 67 + 5.5.3. Ping . . . . . . . . . . . . . . . . . . . . . . . . 67 + 5.5.3.1. Request Definition . . . . . . . . . . . . . . . 67 5.5.3.2. Response Definition . . . . . . . . . . . . . . . 67 - 5.5.4. AppAttachLite . . . . . . . . . . . . . . . . . . . 67 - 5.5.4.1. Request Definition . . . . . . . . . . . . . . . 67 - 5.5.4.2. Response Definition . . . . . . . . . . . . . . . 68 - 5.5.5. Ping . . . . . . . . . . . . . . . . . . . . . . . . 68 - 5.5.5.1. Request Definition . . . . . . . . . . . . . . . 68 - 5.5.5.2. Response Definition . . . . . . . . . . . . . . . 68 - 5.5.6. Config_Update . . . . . . . . . . . . . . . . . . . 69 - 5.5.6.1. Request Definition . . . . . . . . . . . . . . . 69 - 5.5.6.2. Response Definition . . . . . . . . . . . . . . . 70 + 5.5.4. Config_Update . . . . . . . . . . . . . . . . . . . 68 + 5.5.4.1. Request Definition . . . . . . . . . . . . . . . 68 + 5.5.4.2. Response Definition . . . . . . . . . . . . . . . 69 5.6. Overlay Link Layer . . . . . . . . . . . . . . . . . . . 70 - 5.6.1. Future Support for HIP . . . . . . . . . . . . . . . 71 - 5.6.2. Reliability for Unreliable Links . . . . . . . . . . 71 - 5.6.2.1. Framed Message Format . . . . . . . . . . . . . . 72 - 5.6.2.2. Retransmission and Flow Control . . . . . . . . . 73 - 5.6.3. Fragmentation and Reassembly . . . . . . . . . . . . 74 - 6. Data Storage Protocol . . . . . . . . . . . . . . . . . . . . 75 - 6.1. Data Signature Computation . . . . . . . . . . . . . . . 77 - 6.2. Data Models . . . . . . . . . . . . . . . . . . . . . . 77 - 6.2.1. Single Value . . . . . . . . . . . . . . . . . . . . 78 - 6.2.2. Array . . . . . . . . . . . . . . . . . . . . . . . 79 - 6.2.3. Dictionary . . . . . . . . . . . . . . . . . . . . . 79 - 6.3. Access Control Policies . . . . . . . . . . . . . . . . 80 - 6.3.1. USER-MATCH . . . . . . . . . . . . . . . . . . . . . 80 - 6.3.2. NODE-MATCH . . . . . . . . . . . . . . . . . . . . . 80 - 6.3.3. USER-NODE-MATCH . . . . . . . . . . . . . . . . . . 80 - 6.3.4. NODE-MULTIPLE . . . . . . . . . . . . . . . . . . . 80 - 6.4. Data Storage Methods . . . . . . . . . . . . . . . . . . 81 - 6.4.1. Store . . . . . . . . . . . . . . . . . . . . . . . 81 - 6.4.1.1. Request Definition . . . . . . . . . . . . . . . 81 - 6.4.1.2. Response Definition . . . . . . . . . . . . . . . 85 - 6.4.1.3. Removing Values . . . . . . . . . . . . . . . . . 86 - 6.4.2. Fetch . . . . . . . . . . . . . . . . . . . . . . . 87 - 6.4.2.1. Request Definition . . . . . . . . . . . . . . . 88 - 6.4.2.2. Response Definition . . . . . . . . . . . . . . . 90 - 6.4.3. Stat . . . . . . . . . . . . . . . . . . . . . . . . 90 - 6.4.3.1. Request Definition . . . . . . . . . . . . . . . 91 - 6.4.3.2. Response Definition . . . . . . . . . . . . . . . 91 - 6.4.4. Find . . . . . . . . . . . . . . . . . . . . . . . . 93 - 6.4.4.1. Request Definition . . . . . . . . . . . . . . . 93 - 6.4.4.2. Response Definition . . . . . . . . . . . . . . . 93 - 6.4.5. Defining New Kinds . . . . . . . . . . . . . . . . . 94 - 7. Certificate Store Usage . . . . . . . . . . . . . . . . . . . 95 - 8. TURN Server Usage . . . . . . . . . . . . . . . . . . . . . . 96 - 9. Chord Algorithm . . . . . . . . . . . . . . . . . . . . . . . 97 - 9.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 98 - 9.2. Routing . . . . . . . . . . . . . . . . . . . . . . . . 98 - 9.3. Redundancy . . . . . . . . . . . . . . . . . . . . . . . 99 - 9.4. Joining . . . . . . . . . . . . . . . . . . . . . . . . 99 - 9.5. Routing Attaches . . . . . . . . . . . . . . . . . . . . 100 - 9.6. Updates . . . . . . . . . . . . . . . . . . . . . . . . 100 - 9.6.1. Handling Neighbor Failures . . . . . . . . . . . . . 102 - 9.6.2. Handling Finger Table Entry Failure . . . . . . . . 103 - 9.6.3. Receiving Updates . . . . . . . . . . . . . . . . . 103 - 9.6.4. Stabilization . . . . . . . . . . . . . . . . . . . 104 - 9.6.4.1. Updating neighbor table . . . . . . . . . . . . . 104 - 9.6.4.2. Refreshing finger table . . . . . . . . . . . . . 104 - 9.6.4.3. Adjusting finger table size . . . . . . . . . . . 105 - 9.6.4.4. Detecting partitioning . . . . . . . . . . . . . 106 - 9.7. Route Query . . . . . . . . . . . . . . . . . . . . . . 106 - 9.8. Leaving . . . . . . . . . . . . . . . . . . . . . . . . 107 - - 10. Enrollment and Bootstrap . . . . . . . . . . . . . . . . . . 108 - 10.1. Overlay Configuration . . . . . . . . . . . . . . . . . 108 - 10.1.1. Relax NG Grammar . . . . . . . . . . . . . . . . . . 112 - 10.2. Discovery Through Enrollment Server . . . . . . . . . . 114 - 10.3. Credentials . . . . . . . . . . . . . . . . . . . . . . 115 - 10.3.1. Self-Generated Credentials . . . . . . . . . . . . . 116 - 10.4. Searching for a Bootstrap Node . . . . . . . . . . . . . 117 - 10.5. Contacting a Bootstrap Node . . . . . . . . . . . . . . 117 - 11. Message Flow Example . . . . . . . . . . . . . . . . . . . . 118 - 12. Security Considerations . . . . . . . . . . . . . . . . . . . 123 - 12.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 123 - 12.2. Attacks on P2P Overlays . . . . . . . . . . . . . . . . 124 - 12.3. Certificate-based Security . . . . . . . . . . . . . . . 124 - 12.4. Shared-Secret Security . . . . . . . . . . . . . . . . . 125 - 12.5. Storage Security . . . . . . . . . . . . . . . . . . . . 126 - 12.5.1. Authorization . . . . . . . . . . . . . . . . . . . 126 - 12.5.2. Distributed Quota . . . . . . . . . . . . . . . . . 127 - 12.5.3. Correctness . . . . . . . . . . . . . . . . . . . . 127 - 12.5.4. Residual Attacks . . . . . . . . . . . . . . . . . . 127 - 12.6. Routing Security . . . . . . . . . . . . . . . . . . . . 128 - 12.6.1. Background . . . . . . . . . . . . . . . . . . . . . 128 - 12.6.2. Admissions Control . . . . . . . . . . . . . . . . . 129 - 12.6.3. Peer Identification and Authentication . . . . . . . 129 - 12.6.4. Protecting the Signaling . . . . . . . . . . . . . . 130 - 12.6.5. Residual Attacks . . . . . . . . . . . . . . . . . . 130 - 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 131 - 13.1. Port Registrations . . . . . . . . . . . . . . . . . . . 131 - 13.2. Overlay Algorithm Types . . . . . . . . . . . . . . . . 131 - 13.3. Access Control Policies . . . . . . . . . . . . . . . . 131 - 13.4. Data Kind-ID . . . . . . . . . . . . . . . . . . . . . . 132 - 13.5. Data Model . . . . . . . . . . . . . . . . . . . . . . . 132 - 13.6. Message Codes . . . . . . . . . . . . . . . . . . . . . 133 - 13.7. Error Codes . . . . . . . . . . . . . . . . . . . . . . 134 - 13.8. Transport Types . . . . . . . . . . . . . . . . . . . . 134 - 13.9. Forwarding Options . . . . . . . . . . . . . . . . . . . 134 - 13.10. Probe Information Types . . . . . . . . . . . . . . . . 135 - 13.11. Message Extensions . . . . . . . . . . . . . . . . . . . 135 - 13.12. reload URI Scheme . . . . . . . . . . . . . . . . . . . 135 - 13.12.1. URI Registration . . . . . . . . . . . . . . . . . . 136 - 14. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 137 - 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 137 - 15.1. Normative References . . . . . . . . . . . . . . . . . . 137 - 15.2. Informative References . . . . . . . . . . . . . . . . . 138 - Appendix A. Change Log . . . . . . . . . . . . . . . . . . . . . 141 - A.1. Changes since draft-ietf-p2psip-reload-04 . . . . . . . 141 - A.2. Changes since draft-ietf-p2psip-reload-01 . . . . . . . 141 - A.3. Changes since draft-ietf-p2psip-reload-00 . . . . . . . 142 - A.4. Changes since draft-ietf-p2psip-base-00 . . . . . . . . 142 - A.5. Changes since draft-ietf-p2psip-base-01 . . . . . . . . 142 - A.6. Changes since draft-ietf-p2psip-base-01a . . . . . . . . 142 - A.7. Changes since draft-ietf-p2psip-base-02 . . . . . . . . 142 - Appendix B. AIMD Retransmission Scheme . . . . . . . . . . . . . 143 - Appendix C. TFRC Retransmission Scheme . . . . . . . . . . . . . 143 - Appendix D. Routing Alternatives . . . . . . . . . . . . . . . . 144 - D.1. Iterative vs Recursive . . . . . . . . . . . . . . . . . 144 - D.2. Symmetric vs Forward response . . . . . . . . . . . . . 144 - D.3. Direct Response . . . . . . . . . . . . . . . . . . . . 145 - D.4. Relay Peers . . . . . . . . . . . . . . . . . . . . . . 146 - D.5. Symmetric Route Stability . . . . . . . . . . . . . . . 146 - Appendix E. Why Clients? . . . . . . . . . . . . . . . . . . . . 147 - E.1. Why Not Only Peers? . . . . . . . . . . . . . . . . . . 147 - E.2. Clients as Application-Level Agents . . . . . . . . . . 148 - Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 148 + 5.6.1. Future Overlay Link Protocols . . . . . . . . . . . 71 + 5.6.1.1. HIP . . . . . . . . . . . . . . . . . . . . . . . 71 + 5.6.1.2. ICE-TCP . . . . . . . . . . . . . . . . . . . . . 71 + 5.6.1.3. Message-oriented Transports . . . . . . . . . . . 71 + 5.6.1.4. Tunneled Transports . . . . . . . . . . . . . . . 71 + 5.6.2. Framing Header . . . . . . . . . . . . . . . . . . . 72 + 5.6.3. Simple Reliability . . . . . . . . . . . . . . . . . 73 + 5.6.3.1. Retransmission and Flow Control . . . . . . . . . 74 + 5.6.4. DTLS/UDP with SR . . . . . . . . . . . . . . . . . . 75 + 5.6.5. TLS/TCP with FH, no ICE . . . . . . . . . . . . . . 75 + 5.6.6. DTLS/UDP with SR, no ICE . . . . . . . . . . . . . . 76 + 5.7. Fragmentation and Reassembly . . . . . . . . . . . . . . 76 + 6. Data Storage Protocol . . . . . . . . . . . . . . . . . . . . 77 + 6.1. Data Signature Computation . . . . . . . . . . . . . . . 78 + 6.2. Data Models . . . . . . . . . . . . . . . . . . . . . . 79 + 6.2.1. Single Value . . . . . . . . . . . . . . . . . . . . 80 + 6.2.2. Array . . . . . . . . . . . . . . . . . . . . . . . 81 + 6.2.3. Dictionary . . . . . . . . . . . . . . . . . . . . . 81 + 6.3. Access Control Policies . . . . . . . . . . . . . . . . 82 + 6.3.1. USER-MATCH . . . . . . . . . . . . . . . . . . . . . 82 + 6.3.2. NODE-MATCH . . . . . . . . . . . . . . . . . . . . . 82 + 6.3.3. USER-NODE-MATCH . . . . . . . . . . . . . . . . . . 82 + 6.3.4. NODE-MULTIPLE . . . . . . . . . . . . . . . . . . . 82 + 6.4. Data Storage Methods . . . . . . . . . . . . . . . . . . 83 + 6.4.1. Store . . . . . . . . . . . . . . . . . . . . . . . 83 + 6.4.1.1. Request Definition . . . . . . . . . . . . . . . 83 + 6.4.1.2. Response Definition . . . . . . . . . . . . . . . 87 + 6.4.1.3. Removing Values . . . . . . . . . . . . . . . . . 88 + 6.4.2. Fetch . . . . . . . . . . . . . . . . . . . . . . . 89 + 6.4.2.1. Request Definition . . . . . . . . . . . . . . . 90 + 6.4.2.2. Response Definition . . . . . . . . . . . . . . . 92 + 6.4.3. Stat . . . . . . . . . . . . . . . . . . . . . . . . 92 + 6.4.3.1. Request Definition . . . . . . . . . . . . . . . 93 + 6.4.3.2. Response Definition . . . . . . . . . . . . . . . 93 + 6.4.4. Find . . . . . . . . . . . . . . . . . . . . . . . . 95 + 6.4.4.1. Request Definition . . . . . . . . . . . . . . . 95 + 6.4.4.2. Response Definition . . . . . . . . . . . . . . . 95 + 6.4.5. Defining New Kinds . . . . . . . . . . . . . . . . . 96 + 7. Certificate Store Usage . . . . . . . . . . . . . . . . . . . 97 + 8. TURN Server Usage . . . . . . . . . . . . . . . . . . . . . . 98 + 9. Chord Algorithm . . . . . . . . . . . . . . . . . . . . . . . 99 + 9.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 100 + 9.2. Routing . . . . . . . . . . . . . . . . . . . . . . . . 101 + 9.3. Redundancy . . . . . . . . . . . . . . . . . . . . . . . 101 + 9.4. Joining . . . . . . . . . . . . . . . . . . . . . . . . 102 + 9.5. Routing Attaches . . . . . . . . . . . . . . . . . . . . 102 + 9.6. Updates . . . . . . . . . . . . . . . . . . . . . . . . 103 + 9.6.1. Handling Neighbor Failures . . . . . . . . . . . . . 104 + 9.6.2. Handling Finger Table Entry Failure . . . . . . . . 105 + 9.6.3. Receiving Updates . . . . . . . . . . . . . . . . . 105 + 9.6.4. Stabilization . . . . . . . . . . . . . . . . . . . 106 + 9.6.4.1. Updating neighbor table . . . . . . . . . . . . . 106 + 9.6.4.2. Refreshing finger table . . . . . . . . . . . . . 106 + 9.6.4.3. Adjusting finger table size . . . . . . . . . . . 107 + 9.6.4.4. Detecting partitioning . . . . . . . . . . . . . 108 + 9.7. Route Query . . . . . . . . . . . . . . . . . . . . . . 108 + 9.8. Leaving . . . . . . . . . . . . . . . . . . . . . . . . 109 + 10. Enrollment and Bootstrap . . . . . . . . . . . . . . . . . . 110 + 10.1. Overlay Configuration . . . . . . . . . . . . . . . . . 110 + 10.1.1. Relax NG Grammar . . . . . . . . . . . . . . . . . . 115 + 10.2. Discovery Through Enrollment Server . . . . . . . . . . 117 + 10.3. Credentials . . . . . . . . . . . . . . . . . . . . . . 117 + 10.3.1. Self-Generated Credentials . . . . . . . . . . . . . 118 + 10.4. Searching for a Bootstrap Node . . . . . . . . . . . . . 119 + 10.5. Contacting a Bootstrap Node . . . . . . . . . . . . . . 119 + 11. Message Flow Example . . . . . . . . . . . . . . . . . . . . 120 + 12. Security Considerations . . . . . . . . . . . . . . . . . . . 125 + 12.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 125 + 12.2. Attacks on P2P Overlays . . . . . . . . . . . . . . . . 126 + 12.3. Certificate-based Security . . . . . . . . . . . . . . . 126 + 12.4. Shared-Secret Security . . . . . . . . . . . . . . . . . 127 + 12.5. Storage Security . . . . . . . . . . . . . . . . . . . . 128 + 12.5.1. Authorization . . . . . . . . . . . . . . . . . . . 128 + 12.5.2. Distributed Quota . . . . . . . . . . . . . . . . . 129 + 12.5.3. Correctness . . . . . . . . . . . . . . . . . . . . 129 + 12.5.4. Residual Attacks . . . . . . . . . . . . . . . . . . 129 + 12.6. Routing Security . . . . . . . . . . . . . . . . . . . . 130 + 12.6.1. Background . . . . . . . . . . . . . . . . . . . . . 130 + 12.6.2. Admissions Control . . . . . . . . . . . . . . . . . 131 + 12.6.3. Peer Identification and Authentication . . . . . . . 131 + 12.6.4. Protecting the Signaling . . . . . . . . . . . . . . 132 + 12.6.5. Residual Attacks . . . . . . . . . . . . . . . . . . 132 + 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 133 + 13.1. Port Registrations . . . . . . . . . . . . . . . . . . . 133 + 13.2. Overlay Algorithm Types . . . . . . . . . . . . . . . . 133 + 13.3. Access Control Policies . . . . . . . . . . . . . . . . 133 + 13.4. Data Kind-ID . . . . . . . . . . . . . . . . . . . . . . 134 + 13.5. Data Model . . . . . . . . . . . . . . . . . . . . . . . 134 + 13.6. Message Codes . . . . . . . . . . . . . . . . . . . . . 135 + 13.7. Error Codes . . . . . . . . . . . . . . . . . . . . . . 136 + 13.8. Overlay Link Types . . . . . . . . . . . . . . . . . . . 136 + 13.9. Forwarding Options . . . . . . . . . . . . . . . . . . . 136 + 13.10. Probe Information Types . . . . . . . . . . . . . . . . 137 + 13.11. Message Extensions . . . . . . . . . . . . . . . . . . . 137 + 13.12. reload URI Scheme . . . . . . . . . . . . . . . . . . . 137 + 13.12.1. URI Registration . . . . . . . . . . . . . . . . . . 138 + 14. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 139 + 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 139 + 15.1. Normative References . . . . . . . . . . . . . . . . . . 139 + 15.2. Informative References . . . . . . . . . . . . . . . . . 140 + Appendix A. Change Log . . . . . . . . . . . . . . . . . . . . . 144 + A.1. Changes since draft-ietf-p2psip-reload-04 . . . . . . . 144 + A.2. Changes since draft-ietf-p2psip-reload-01 . . . . . . . 144 + A.3. Changes since draft-ietf-p2psip-reload-00 . . . . . . . 144 + A.4. Changes since draft-ietf-p2psip-base-00 . . . . . . . . 144 + A.5. Changes since draft-ietf-p2psip-base-01 . . . . . . . . 144 + A.6. Changes since draft-ietf-p2psip-base-01a . . . . . . . . 145 + A.7. Changes since draft-ietf-p2psip-base-02 . . . . . . . . 145 + Appendix B. Routing Alternatives . . . . . . . . . . . . . . . . 145 + B.1. Iterative vs Recursive . . . . . . . . . . . . . . . . . 145 + B.2. Symmetric vs Forward response . . . . . . . . . . . . . 146 + B.3. Direct Response . . . . . . . . . . . . . . . . . . . . 146 + B.4. Relay Peers . . . . . . . . . . . . . . . . . . . . . . 147 + B.5. Symmetric Route Stability . . . . . . . . . . . . . . . 148 + Appendix C. Why Clients? . . . . . . . . . . . . . . . . . . . . 148 + C.1. Why Not Only Peers? . . . . . . . . . . . . . . . . . . 149 + C.2. Clients as Application-Level Agents . . . . . . . . . . 149 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 149 1. Introduction This document defines REsource LOcation And Discovery (RELOAD), a peer-to-peer (P2P) signaling protocol for use on the Internet. It provides a generic, self-organizing overlay network service, allowing nodes to efficiently route messages to other nodes and to efficiently store and retrieve data in the overlay. RELOAD provides several features that are critical for a successful P2P protocol for the Internet: @@ -487,21 +483,21 @@ Forwarding and Link Management Layer: Stores and implements the routing table by providing packet forwarding services between nodes. It also handles establishing new links between nodes, including setting up connections across NATs using ICE. Overlay Link Layer: TLS [RFC5246] and DTLS [RFC4347] are the "link layer" protocols used by RELOAD for hop-by-hop communication. Each such protocol includes the appropriate provisions for per-hop framing or hop-by-hop ACKs required by unreliable transports. - To further clarify the roles of the various layer, this figure + To further clarify the roles of the various layers, this figure parallels the architecture with each layer's role from an overlay perspective and implementation layer in the internet: | Internet Model | Real | Equivalent | Reload Internet | in Overlay | Architecture --------------+-----------------+------------------------------------ | | +-------+ +-------+ | Application | | SIP | | XMPP | ... | | | Usage | | Usage | @@ -585,21 +581,21 @@ 1.2.3. Storage One of the major functions of RELOAD is to allow nodes to store data in the overlay and to retrieve data stored by other nodes or by themselves. The Storage component is responsible for processing data storage and retrieval messages. For instance, the Storage component might receive a Store request for a given resource from the Message Transport. It would then query the appropriate usage before storing the data value(s) in its local data store and sending a response to - the Message Transport for delivery to the requesting peer. + the Message Transport for delivery to the requesting node. Typically, these messages will come from other nodes, but depending on the overlay topology, a node might be responsible for storing data for itself as well, especially if the overlay is small. A peer's Node-ID determines the set of resources that it will be responsible for storing. However, the exact mapping between these is determined by the overlay algorithm in use. The Storage component will only receive a Store request from the Message Transport if this peer is responsible for that Resource-ID. The Storage component is notified by the Topology Plugin when the Resource-IDs for which it is @@ -720,22 +716,20 @@ 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. We use the terminology and definitions from the Concepts and Terminology for Peer to Peer SIP [I-D.ietf-p2psip-concepts] draft extensively in this document. Other terms used in this document are defined inline when used and are also defined below for reference. - Terms which are new to this document (and perhaps should be added to - the concepts document) are marked with a (*). DHT: A distributed hash table. A DHT is an abstract hash table service realized by storing the contents of the hash table across a set of peers. Overlay Algorithm: An overlay algorithm defines the rules for determining which peers in an overlay store a particular piece of data and for determining a topology of interconnections amongst peers in order to find a piece of data. @@ -746,43 +740,48 @@ Peer: A host that is participating in the overlay. Peers are responsible for holding some portion of the data that has been stored in the overlay and also route messages on behalf of other hosts as required by the Overlay Algorithm. Client: A host that is able to store data in and retrieve data from the overlay but which is not participating in routing or data storage for the overlay. + Kind: A kind defined a particular type of data that can be stored in + the overlay. Applications define new Kinds to story the data they + use. Each Kind is identied iwht a unique IANA assinged intereger + called a Kind-ID . + Node: We use the term "Node" to refer to a host that may be either a Peer or a Client. Because RELOAD uses the same protocol for both clients and peers, much of the text applies equally to both. Therefore we use "Node" when the text applies to both Clients and Peers and the more specific term (i.e. client or peer) when the text applies only to Clients or only to Peers. Node-ID: A 128-bit value that uniquely identifies a node. Node-IDs 0 and 2^128 - 1 are reserved and are invalid Node-IDs. A value of zero is not used in the wire protocol but can be used to indicate an invalid node in implementations and APIs. The Node-ID of - 2^128-1 is used on the wire protocol as a wildcard. (*) + 2^128-1 is used on the wire protocol as a wildcard. Resource: An object or group of objects associated with a string identifier. See "Resource Name" below. Resource Name: The potentially human readable name by which a resource is identified. In unstructured P2P networks, the resource name is sometimes used directly as a Resource-ID. In structured P2P networks the resource name is typically mapped into a Resource-ID by using the string as the input to hash function. A SIP resource, for example, is often identified by its AOR which - is an example of a Resource Name.(*) + is an example of a Resource Name. Resource-ID: A value that identifies some resources and which is used as a key for storing and retrieving the resource. Often this is not human friendly/readable. One way to generate a Resource-ID is by applying a mapping function to some other unique name (e.g., user name or service name) for the resource. The Resource-ID is used by the distributed database algorithm to determine the peer or peers that are responsible for storing the data for the overlay. In structured P2P networks, Resource-IDs are generally fixed length and are formed by hashing the resource name. In @@ -792,30 +791,29 @@ Connection Table: The set of nodes to which a node is directly connected. This includes nodes with which Attach handshakes have been done but which have not sent any Updates. Routing Table: The set of peers which a node can use to route overlay messages. In general, these peers will all be on the connection table but not vice versa, because some peers will have Attached but not sent updates. Peers may send messages directly to peers that are in the connection table but may only route messages to other peers through peers that are in the routing - table. (*) + table. Destination List: A list of IDs through which a message is to be routed. A single Node-ID is a trivial form of destination list. - (*) Usage: A usage is an application that wishes to use the overlay for some purpose. Each application wishing to use the overlay defines a set of data kinds that it wishes to use. The SIP usage defines - the location data kind. (*) + the location data kind. The term "maximum request lifetime" is the maximum time a request will wait for a response; it defaults to 15 seconds. The term "successor replacement hold-down time" is the amount of time to wait before starting replication when a new successor is found; it defaults to 30 seconds. 3. Overlay Management Overview The most basic function of RELOAD is as a generic overlay network. @@ -916,21 +914,21 @@ concept. From the perspective of a peer, a client is simply a node which has not yet sent any Updates or Joins. It might never do so (if it's a client) or it might eventually do so (if it's just a node that's taking a long time to join). The routing and storage rules for RELOAD provide for correct behavior by peers regardless of whether other nodes attached to them are clients or peers. Of course, a client implementation must know that it intends to be a client, but this localizes complexity only to that node. For more discussion of the motivation for RELOAD's client support, - see Appendix E. + see Appendix C. 3.2.1. Client Routing There are two routing options by which a client may be located in an overlay. o Establish a connection to the peer responsible for the client's Node-ID in the overlay. Then requests may be sent from/to the client using its Node-ID in the same manner as if it were a peer, because the responsible peer in the overlay will handle the final @@ -954,42 +952,42 @@ requests from other members of the overlay. 3.2.2. Minimum Functionality Requirements for Clients A node may act as a client simply because it does not have the resources or even an implementation of the topology plugin required to act as a peer in the overlay. In order to exchange RELOAD messages with a peer, a client must meet a minimum level of functionality. Such a client must: - o Implement RELOAD's connection-management connections that are used + o Implement RELOAD's connection-management operations that are used to establish the connection with the peer. o Implement RELOAD's data retrieval methods (with client functionality). o Be able to calculate Resource-IDs used by the overlay. o Possess security credentials required by the overlay it is implementing. A client speaks the same protocol as the peers, knows how to calculate Resource-IDs, and signs its requests in the same manner as peers. While a client does not necessarily require a full implementation of the overlay algorithm, calculating the Resource-ID requires an implementation of the appropriate algorithm for the overlay. 3.3. Routing This section will discuss the requirements RELOAD's routing capabilities must meet, then describe the routing features in the protocol, and then provide a brief overview of how they are used. - Appendix D discusses some alternative designs and the tradeoffs that + Appendix B discusses some alternative designs and the tradeoffs that would be necessary to support them. RELOAD's routing capabilities must meet the following requirements: NAT Traversal: RELOAD must support establishing and using connections between nodes separated by one or more NATs, including locating peers behind NATs for those overlays allowing/requiring it. Clients: RELOAD must support requests from and to clients that do not participate in overlay routing. @@ -1079,24 +1077,21 @@ <---------- Dest=B, A <---------- Dest=A RELOAD also supports a basic Iterative routing mode (where the intermediate peers merely return a response indicating the next hop, but do not actually forward the message to that next hop themselves). Iterative routing is implemented using the Route_Query method, which requests this behavior. Note that iterative routing is selected only - by the initiating node. RELOAD does not support an intermediate peer - returning a response that it will not recursively route a normal - request. The willingness to perform that operation is implicit in - its role as a peer in the overlay. + by the initiating node. 3.4. Connectivity Management In order to provide efficient routing, a peer needs to maintain a set of direct connections to other peers in the Overlay Instance. Due to the presence of NATs, these connections often cannot be formed directly. Instead, we use the Attach request to establish a connection. Attach uses ICE [I-D.ietf-mmusic-ice] to establish the connection. It is assumed that the reader is familiar with ICE. @@ -1141,21 +1136,21 @@ and Leave. However, the contents of those messages, when they are sent, and their precise semantics are specified by the actual overlay algorithm; RELOAD merely provides a framework of commonly-needed methods that provides uniformity of notation (and ease of debugging) for a variety of overlay algorithms. 3.5.2. Joining, Leaving, and Maintenance Overview When a new peer wishes to join the Overlay Instance, it must have a Node-ID that it is allowed to use. When an enrollment server is used - that Node-Id will be in the certificate the node received from the + that Node-ID will be in the certificate the node received from the enrollment server. The details of the joining procedure are defined by the overlay algorithm, but the general steps for joining an Overlay Instance are: o Forming connections to some other peers. o Acquiring the data values this peer is responsible for storing. o Informing the other peers which were previously responsible for that data that this peer has taken over responsibility. The first thing the peer needs to do is to form a connection to some @@ -1218,27 +1213,30 @@ and perhaps a username and password, and leverage that into having a working peer with minimal user intervention. This helps avoid the problems that have been experienced with conventional SIP clients where users are required to manually configure a large number of settings. 3.6.1. Initial Configuration In the first phase of the process, the user starts out with the name of the overlay and uses this to download an initial set of overlay - configuration parameters. The user does a DNS SRV lookup on the + configuration parameters. The node does a DNS SRV lookup on the overlay name to get the address of a configuration server. It can then connect to this server with HTTPS to download a configuration document which contains the basic overlay configuration parameters as well as a set of bootstrap nodes which can be used to join the overlay. + If a node already has the valid configuration document that it + received by some out of band method, this step can be skipped. + 3.6.2. Enrollment If the overlay is using centralized enrollment, then a user needs to acquire a certificate before joining the overlay. The certificate attests both to the user's name within the overlay and to the Node- IDs which they are permitted to operate. In that case, the configuration document will contain the address of an enrollment server which can be used to obtain such a certificate. The enrollment server may (and probably will) require some sort of username and password before issuing the certificate. The enrollment @@ -1365,20 +1363,22 @@ order they were stored in corresponds to the stored time values associated with (and carried in) their values. Because the stored time values are those associated with the peer which did the writing, clock skew is generally not an issue. If two nodes are on different partitions, write to the same location, and have clock skew, this can create merge conflicts. However because RELOAD deliberately segregates storage so that data from different users and peers is stored in different locations, and a single peer will typically only be in a single network partition, this case will generally not arise. + o Defines the types of connections that can be initiated using + AppAttach. The kinds defined by a usage may also be applied to other usages. However, a need for different parameters, such as different size limits, would imply the need to create a new kind. 4.1.3. Replication Replication in P2P overlays can be used to provide: persistence: if the responsible peer crashes and/or if the storing @@ -1565,21 +1565,21 @@ which is the default algorithm used by nodes to route messages through the overlay. All implementations MUST implement this routing algorithm. An overlay may be configured to use alternative routing algorithms, and alternative routing algorithms may be selected on a per-message basis. 5.2.1. Request Origination In order to originate a message to a given Node-ID or Resource-ID, a node constructs an appropriate destination list. The simplest such - destination list is a single entry containing the peer or + destination list is a single entry containing the Node-ID or Resource-ID. The resulting message will use the normal overlay routing mechanisms to forward the message to that destination. The node can also construct a more complicated destination list for source routing. Once the message is constructed, the node sends the message to some adjacent peer. If the first entry on the destination list is directly connected, then the message MUST be routed down that connection. Otherwise, the topology plugin MUST be consulted to determine the appropriate next hop. @@ -1594,38 +1594,38 @@ Because messages may be lost in transit through the overlay, RELOAD incorporates an end-to-end reliability mechanism. When an originating node transmits a request it MUST set a 3 second timer. If a response has not been received when the timer fires, the request is retransmitted with the same transaction identifier. The request MAY be retransmitted up to 4 times (for a total of 5 messages). After the timer for the fifth transmission fires, the message SHALL be considered to have failed. Note that this retransmission procedure is not followed by intermediate nodes. They follow the - hop-by-hop reliability procedure described in Section 5.6.2. + hop-by-hop reliability procedure described in Section 5.6.3. The above algorithm can result in multiple requests being delivered to a node. Receiving nodes MUST generate semantically equivalent responses to retransmissions of the same request (this can be determined by transaction id) if the request is received within the maximum request lifetime (15 seconds). For some requests (e.g., Fetch) this can be accomplished merely by processing the request again. For other requests, (e.g., Store) it may be necessary to maintain state for the duration of the request lifetime. 5.2.2. Response Origination - When a peer sends a response to a request, it MUST construct the - destination list by reversing the order of the entries on the via - list. This has the result that the response traverses the same peers - as the request traversed, except in reverse order (symmetric - routing). + When a peer sends a response to a request using this routing + algorithm, it MUST construct the destination list by reversing the + order of the entries on the via list. This has the result that the + response traverses the same peers as the request traversed, except in + reverse order (symmetric routing). 5.3. Message Structure RELOAD is a message-oriented request/response protocol. The messages are encoded using binary fields. All integers are represented in network byte order. The general philosophy behind the design was to use Type, Length, Value fields to allow for extensibility. However, for the parts of a structure that were required in all messages, we just define these in a fixed position, as adding a type and length for them is unnecessary and would simply increase bandwidth and @@ -1668,24 +1668,20 @@ o It is easy to write and familiar enough looking that most readers can grasp it quickly. o The ability to define nested structures allows a separation between high-level and low-level message structures. o It has a straightforward wire encoding that allows quick implementation, but the structures can be comprehended without knowing the encoding. o The ability to mechanically (compile) encoders and decoders. - This presentation is to some extent a placeholder. We consider it an - open question what the final protocol definition method and encodings - use. We expect this to be a question for the WG to decide. - Several idiosyncrasies of this language are worth noting. o All lengths are denoted in bytes, not objects. o Variable length values are denoted like arrays with angle brackets. o "select" is used to indicate variant structures. For instance, "uint16 array<0..2^8-2>;" represents up to 254 bytes but only up to 127 values of two bytes (16 bits) each. @@ -1712,25 +1708,25 @@ A NodeId is a fixed-length 128-bit structure represented as a series of bytes, with the most significant byte first. Note: the use of "typedef" here is an extension to the TLS language, but its meaning should be relatively obvious. A ResourceId, shown below, represents a single Resource-ID. typedef opaque ResourceId<0..2^8-1>; - Like a NodeId, a Resource-ID is an opaque string of bytes, but unlike - Node-IDs, Resource-IDs are variable length, up to 255 bytes (2048 - bits) in length. On the wire, each ResourceId is preceded by a - single length byte (allowing lengths up to 255). Thus, the 3-byte - value "Foo" would be encoded as: 03 46 4f 4f. + Like a NodeId, a ResourceId is an opaque string of bytes, but unlike + NodeIds, ResourceIds are variable length, up to 255 bytes (2048 bits) + in length. On the wire, each ResourceId is preceded by a single + length byte (allowing lengths up to 255). Thus, the 3-byte value + "Foo" would be encoded as: 03 46 4f 4f. A more complicated example is IpAddressPort, which represents a network address and can be used to carry either an IPv6 or IPv4 address: enum {reserved_addr(0), ipv4_address (1), ipv6_address (2), (255)} AddressType; struct { uint32 addr; @@ -1819,39 +1815,42 @@ overlay: The 32 bit checksum/hash of the overlay being used. The variable length string representing the overlay name is hashed with SHA-1 and the low order 32 bits are used. The purpose of this field is to allow nodes to participate in multiple overlays and to detect accidental misconfiguration. This is not a security critical function. configuration_sequence: The sequence number of the configuration file. - version: The version of the RELOAD protocol being used. This - document describes version 0.1, with a value of 0x01. + version: The version of the RELOAD protocol being used. This is a + fixed point interger between 0.1 and 25.4. This document + describes version 0.1, with a value of 0x01. [[ Note to RFC + Editor: Please update this to version 1.0 with value of 0x0a and + remove this note. ]] ttl: An 8 bit field indicating the number of iterations, or hops, a message can experience before it is discarded. The TTL value MUST be decremented by one at every hop along the route the message traverses. If the TTL is 0, the message MUST NOT be propagated further and MUST be discarded, and a "Error_TTL_Exceeded" error should be generated. The initial value of the TTL SHOULD be 100 unless defined otherwise by the overlay configuration. fragment: This field is used to handle fragmentation. The high order two bits are used to indicate the fragmentation status: If the high bit (0x80000000) is set, it indicates that the message is a fragment. If the next bit (0x40000000) is set, it indicates that this is the last fragment. The next six bits (0x20000000 to 0x01000000) are reserved and SHOULD be set to zero. The remainder of the field is used to indicate the fragment offset; see - Section 5.6.3 + Section 5.7 length: The count in bytes of the size of the message, including the header. transaction_id: A unique 64 bit number that identifies this transaction and also allows receivers to disambiguate transactions which are otherwise identical. Responses use the same Transaction ID as the request they correspond to. Transaction IDs are also used for fragment reassembly. @@ -2057,21 +2055,21 @@ enum { (2^16-1) } MessageExtensionType; struct { MessageExtensionType type; Boolean critical; opaque extension_contents<0..2^32-1>; } MessageExtension; struct { - MessageCode message_code; + uint16 message_code; opaque message_body<0..2^32-1>; MessageExtensions extensions<0..2^32-1>; } MessageContents; The contents of this structure are as follows: message_code This indicates the message that is being sent. The code space is broken up as follows. @@ -2131,34 +2129,34 @@ opaque error_info<0..2^16-1>; } ErrorResponse; The contents of this structure are as follows: error_code A numeric error code indicating the error that occurred. error_info An optional arbitrary byte string. Unless otherwise specified, - this will be a text string providing further information about - what went wrong. + this will be a UTF-8 text string providing further information + about what went wrong. The following error code values are defined. The numeric values for these are defined in Section 13.7. - Error_Forbidden: The requesting peer does not have permission to + Error_Forbidden: The requesting node does not have permission to make this request. Error_Not_Found: The resource or peer cannot be found or does not exist. Error_Request_Timeout: A response to the request has not been - received in a suitable amount of time. The requesting peer MAY + received in a suitable amount of time. The requesting node MAY resend the request at a later time. Error_Data_Too_Old: A store cannot be completed because the storage_time precedes the existing value. Error_Generation_Counter_Too_Low: A store cannot be completed because the generation counter precedes the existing value. Error_Incompatible_with_Overlay: A peer receiving the request is using a different overlay, overlayalgorithm, or hash algorithm. @@ -2296,24 +2294,24 @@ The input to signatures over data values is different, and is described in Section 6.1. All RELOAD messages MUST be signed. Upon receipt, the receiving node MUST verify the signature and the authorizing certificate. This check provides a minimal level of assurance that the sending node is a valid part of the overlay as well as cryptographic authentication of the sending node. In addition, responses MUST be checked as follows: - 1. The response to a message sent to a specific Node-Id MUST have - been sent by that Node-Id. + 1. The response to a message sent to a specific Node-ID MUST have + been sent by that Node-ID. 2. The response to a message sent to a Resource-Id MUST have been - sent by a Node-Id which is as close to or closer to the target + sent by a Node-ID which is as close to or closer to the target Resource-Id than any node in the requesting node's neighbor table. The second condition serves as a primitive check for responses from wildly wrong nodes but is not a complete check. Note that in periods of churn, it is possible for the requesting node to obtain a closer neighbor while the request is outstanding. This will cause the response to be rejected and the request to be retransmitted. In addition, some methods (especially Store) have additional @@ -2332,31 +2330,32 @@ 5.4.1. Topology Plugin Requirements When specifying a new overlay algorithm, at least the following need to be described: o Joining procedures, including the contents of the Join message. o Stabilization procedures, including the contents of the Update message, the frequency of topology probes and keepalives, and the mechanism used to detect when peers have disconnected. o Exit procedures, including the contents of the Leave message. - o The length of the Resource-IDs and Node-IDs. For DHTs, the hash - algorithm to compute the hash of an identifier. + o The length of the Resource-IDs. For DHTs, the hash algorithm to + compute the hash of an identifier. o The procedures that peers use to route messages. o The replication strategy used to ensure data redundancy. All overlay algorithms MUST specify maintenance procedures that send - Updates to all members of the Connection Table whenever the range of - IDs for which the peer is responsible changes. This Update allows - clients and peers that have established connections to the peer - responsible for a particular ID to update that connection as - appropriate. + Updates to clients and peers that have established connections to the + peer responsible for a particular ID when the responsibility for that + ID changes. Because tracking this information is difficult, overlay + algorithms MAY simply specify that an Update is sent to all members + of the Connection Table whenever the range of IDs for which the peer + is responsible changes. 5.4.2. Methods and types for use by topology plugins This section describes the methods that topology plugins use to join, leave, and maintain the overlay. 5.4.2.1. Join A new peer (but one that already has credentials) uses the JoinReq message to join the overlay. The JoinReq is sent to the responsible @@ -2411,21 +2410,21 @@ Upon receiving a Leave request, a peer MUST update its own routing table, and send the appropriate Store/Update sequences to re- stabilize the overlay. 5.4.2.3. Update Update is the primary overlay-specific maintenance message. It is used by the sender to notify the recipient of the sender's view of the current state of the overlay (its routing state), and it is up to the recipient to take whatever actions are appropriate to deal with - the state change. In general, peers MUST send Update messages to all + the state change. In general, peers send Update messages to all their adjacencies whenever they detect a topology shift. When a peer detects through an Update that it is no longer responsible for any data value it is storing, it MUST attempt to Store a copy to the correct node unless it knows the the newly responsible node already has a copy of the data. This prevents data loss during large-scale topology shifts such as the merging of partitioned overlays. The contents of the UpdateReq message are completely overlay- @@ -2448,22 +2447,22 @@ responds with information about the peers to which the request would be routed. The sending peer MAY then use the Attach method to attach to that peer(s), and repeat the RouteQuery. Eventually, the sender gets a response from a peer that is closest to the identifier in the destination_object as determined by the topology plugin. At that point, the sender can send messages directly to that peer. 5.4.2.4.1. Request Definition A RouteQueryReq message indicates the peer or resource that the - requesting peer is interested in. It also contains a "send_update" - option allowing the requesting peer to request a full copy of the + requesting node is interested in. It also contains a "send_update" + option allowing the requesting node to request a full copy of the other peer's routing table. struct { Boolean send_update; Destination destination; opaque overlay_specific_data<0..2^16-1>; } RouteQueryReq; The contents of the RouteQueryReq message are as follows: @@ -2501,21 +2500,21 @@ pieces of status information that the requester would like the responder to provide. enum { responsible_set(1), num_resources(2), uptime(3), (255)} ProbeInformationType; struct { ProbeInformationType requested_info<0..2^8-1>; } ProbeReq - The two currently defined values for ProbeInformation are: + The currently defined values for ProbeInformation are: responsible_set indicates that the peer should Respond with the fraction of the overlay for which the responding peer is responsible. num_resources indicates that the peer should Respond with the number of resources currently being stored by the peer. uptime @@ -2558,41 +2557,38 @@ types. Each of the current possible Probe information types is a 32-bit unsigned integer. For type "responsible_ppb", it is the fraction of the overlay for which the peer is responsible in parts per billion. For type "num_resources", it is the number of resources the peer is storing. For the type "uptime" it is the number of seconds the peer has been up. The responding peer SHOULD include any values that the requesting - peer requested and that it recognizes. They SHOULD be returned in + node requested and that it recognizes. They SHOULD be returned in the requested order. Any other values MUST NOT be returned. 5.5. Forwarding and Link Management Layer Each node maintains connections to a set of other nodes defined by the topology plugin. This section defines the methods RELOAD uses to form and maintain connections between nodes in the overlay. Three methods are defined: Attach: used to form RELOAD connections between nodes. When node A wants to connect to node B, it sends an Attach message to node B through the overlay. The Attach contains A's ICE parameters. B responds with its ICE parameters and the two nodes perform ICE to - form connection. - AttachLite: like attach, it is used to form connections between - nodes but instead of using full ICE, it only uses a subset known - as ICE-Lite. + form connection. Attach also allows two nodes to connect via No- + ICE instead of full ICE. AppAttach: used to form application layer connections between nodes. - AppAttachLite: like AppAttach but uses ICE-Lite. Ping: is a simple request/response which is used to verify connectivity of the target peer. 5.5.1. Attach A node sends an Attach request when it wishes to establish a direct TCP or UDP connection to another node for the purpose of sending RELOAD messages. As described in Section 5.1, an Attach may be routed to either a @@ -2609,36 +2605,38 @@ it MAY route messages which are directly addressed to B through that channel but MUST NOT route messages through B to other peers via that channel. The process of Attaching is separate from the process of becoming a peer (using Join and Update), to prevent half-open states where a node has started to form connections but is not really ready to act as a peer. Thus, clients (unlike peers) can simply Attach without sending Join or Update. 5.5.1.1. Request Definition - An AttachReq message contains the requesting peer's ICE connection + An Attach request message contains the requesting node ICE connection parameters formatted into a binary structure. - enum { reserved(0), UDP(1), TCP(2), (255) } Transport; + enum { reserved(0), DTLS-UDP-SR(1), + DLTS-UDP-SR-NO-ICE(3), TLS-TCP-FH-NO-ICE(4), + (255) } Overlay Link; enum { reserved(0), host(1), srflx(2), prflx(3), relay(4), (255) } CandType; struct { opaque name<2^16-1>; opaque value<2^16-1>; } IceExtension; struct { IpAddressPort addr_port; - Transport transport; + Overlay Link overlay_link; opaque foundation<0..255>; uint32 priority; CandType type; select (type){ case host: ; /* Nothing */ case srflx: case prflx: case relay: IpAddressPort rel_addr_port; @@ -2646,47 +2644,58 @@ IceExtension extensions<0..2^16-1>; } IceCandidate; struct { opaque ufrag<0..2^8-1>; opaque password<0..2^8-1>; opaque role<0..2^8-1>; IceCandidate candidates<0..2^16-1>; } AttachReqAns; - The values contained in AttachReq and AttachAns are: + The values contained in AttachReqAns are: ufrag The username fragment (from ICE). password The ICE password. role - An active/passive/actpass attribute from RFC 4145 [RFC4145]. + An active/passive/actpass attribute from RFC 4145 [RFC4145]. This + value MUST be 'passive' for the offerer (the peer sending the + Attach request) and 'active' for the answerer (the peer sending + the Attach response). candidates One or more ICE candidate values, as described below. Each ICE candidate is represented as an IceCandidate structure, which is a direct translation of the information from the ICE string structures, with the exception of the component ID. Since there is - only one component, it is always 1. The remaining values are - specified as follows: + only one component, it is always 1, and thus left out of the PDU. + The remaining values are specified as follows: addr_port corresponds to the connection-address and port productions. - transport - corresponds to the transport production. New transports such as - SCTP or [I-D.baset-tsvwg-tcp-over-udp] can be added be defining - new Transport values in the IANA registry in Section 13.8. + overlay_link + corresponds to the Overlay Link production, Overlay Link protocols + used with No ICE MUST specify "no ICE" in their description. + Future overlay link values can be added be defining new Overlay + Link values in the IANA registry in Section 13.8. Future + extensions to the encapsulation or framing that provide for + backward compatibility with that specified by a previously defined + Overlay Link values MUST use that previous value. Overlay Link + protocols are defined in Section 5.6 + A single AttachReqAns MUST NOT include both candidates whose + Overlay Link protocols use ICE (the default) and candidates that + specify "no ICE". foundation corresponds to the foundation production. priority corresponds to the priority production. type corresponds to the cand-type production. @@ -2704,41 +2713,39 @@ 5.5.1.2. Response Definition If a peer receives an Attach request, it SHOULD process the request and generate its own response with a AttachReqAns. It should then begin ICE checks. When a peer receives an Attach response, it SHOULD parse the response and begin its own ICE checks. 5.5.1.3. Using ICE With RELOAD This section describes the profile of ICE that is used with RELOAD. - RELOAD implementations MUST implement full ICE. Because RELOAD - always tries to use TCP and then UDP as a fallback, there will be - multiple candidates of the same IP version, which requires full ICE. + RELOAD implementations MUST implement full ICE. In ICE as defined by [I-D.ietf-mmusic-ice], SDP is used to carry the ICE parameters. In RELOAD, this function is performed by a binary encoding in the Attach method. This encoding is more restricted than the SDP encoding because the RELOAD environment is simpler: o Only a single media stream is supported. o In this case, the "stream" refers not to RTP or other types of media, but rather to a connection for RELOAD itself or for SIP signaling. o RELOAD only allows for a single offer/answer exchange. Unlike the usage of ICE within SIP, there is never a need to send a subsequent offer to update the default candidates to match the ones selected by ICE. An agent follows the ICE specification as described in - [I-D.ietf-mmusic-ice] and [I-D.ietf-mmusic-ice-tcp] with the changes - and additional procedures described in the subsections below. + [I-D.ietf-mmusic-ice] with the changes and additional procedures + described in the subsections below. 5.5.1.4. Collecting STUN Servers ICE relies on the node having one or more STUN servers to use. In conventional ICE, it is assumed that nodes are configured with one or more STUN servers through some out-of-band mechanism. This is still possible in RELOAD but RELOAD also learns STUN servers as it connects to other peers. Because all RELOAD peers implement ICE and use STUN keepalives, every peer is a STUN server [RFC5389]. Accordingly, any peer a node knows will be willing to be a STUN server -- though of @@ -2762,113 +2769,170 @@ connections. Only peers to which the peer currently has connections may be used. If the connection to that host is lost, it MUST be removed from the list of stun servers and a new server from the same group SHOULD be selected. 5.5.1.5. Gathering Candidates When a node wishes to establish a connection for the purposes of - RELOAD signaling or SIP signaling (or any other application protocol - for that matter), it follows the process of gathering candidates as - described in Section 4 of ICE [I-D.ietf-mmusic-ice]. RELOAD utilizes - a single component, as does SIP. Consequently, gathering for these - "streams" requires a single component. - - An agent MUST implement ICE-tcp [I-D.ietf-mmusic-ice], and MUST - gather at least one UDP and one TCP host candidate for RELOAD and for - SIP. + RELOAD signaling or application signaling, it follows the process of + gathering candidates as described in Section 4 of ICE + [I-D.ietf-mmusic-ice]. RELOAD utilizes a single component. + Consequently, gathering for these "streams" requires a single + component. In the case where a node has not yet found a TURN server, + the agent would not include a relayed candidate. The ICE specification assumes that an ICE agent is configured with, or somehow knows of, TURN and STUN servers. RELOAD provides a way for an agent to learn these by querying the overlay, as described in Section 5.5.1.4 and Section 8. - The agent SHOULD prioritize its TCP-based candidates over its UDP- - based candidates in the prioritization described in Section 4.1.2 of - ICE [I-D.ietf-mmusic-ice]. - - The default candidate selection described in Section 4.1.3 of ICE is + The default candidate selection described in Section 4.1.4 of ICE is ignored; defaults are not signaled or utilized by RELOAD. -5.5.1.6. Encoding the Attach Message + An alternative to using the full ICE supported by the Attach request + is to use No-ICE mechanism by providing candidates with "no ICE" + Overlay Link protocols. Configuration for the overlay indicates + whether or not these Overlay Link protocols can be used. A node MUST + only use ICE or No-ICE candidates within one AttachReqAns. No-ICE + will not work in all of the scenarios where ICE would work, but in + some cases, particularly those with no NATs or firewalls, it will + work. It is RECOMMENDED that full ICE be used even for a node that + has a public, unfiltered IP address, to take advantage of STUN + connectivity checks, etc. + +5.5.1.6. Prioritizing Candidates + + At the time of writing, UDP is the only transport protocol specified + to work with ICE. However, standardization of additional protocols + for use with ICE is expected, including TCP and datagram-oriented + protocols such as SCTP and DCCP. In particular, UDP encapsulations + for SCTP and DCCP are expected to be standardized in the near future, + greatly expanding the available Overlay Link protocols available for + RELOAD. When additional protocols are available, the following + prioritization is RECOMMENDED: + + o Highest priority is assigned to message-oriented protocols that + offer well-understood congestion and flow control without head-of- + line blocking. For example, SCTP without message ordering, DCCP, + or those protocols encapsulated using UDP. + o Second highest priority is assigned to stream-oriented protocols, + e.g. TCP. + o Lowest priority is assigned to protocols encapsulated over UDP + that do not implement well-established congestion control + algorithms. For example, the DTLS/UDP with SR overlay link + protocol. + +5.5.1.7. Encoding the Attach Message Section 4.3 of ICE describes procedures for encoding the SDP for - conveying RELOAD or SIP ICE candidates. Instead of actually encoding - an SDP, the candidate information (IP address and port and transport - protocol, priority, foundation, component ID, type and related - address) is carried within the attributes of the Attach request or - its response. Similarly, the username fragment and password are - carried in the Attach message or its response. Section 5.5.1 - describes the detailed attribute encoding for Attach. The Attach - request and its response do not contain any default candidates or the - ice-lite attribute, as these features of ICE are not used by RELOAD. + conveying RELOAD candidates. Instead of actually encoding an SDP, + the candidate information (IP address and port and transport + protocol, priority, foundation, type and related address) is carried + within the attributes of the Attach request or its response. + Similarly, the username fragment and password are carried in the + Attach message or its response. Section 5.5.1 describes the detailed + attribute encoding for Attach. The Attach request and its response + do not contain any default candidates or the ice-lite attribute, as + these features of ICE are not used by RELOAD. Since the Attach request contains the candidate information and short term credentials, it is considered as an offer for a single media stream that happens to be encoded in a format different than SDP, but is otherwise considered a valid offer for the purposes of following the ICE specification. Similarly, the Attach response is considered a valid answer for the purposes of following the ICE specification. -5.5.1.7. Verifying ICE Support +5.5.1.8. Verifying ICE Support An agent MUST skip the verification procedures in Section 5.1 and 6.1 of ICE. Since RELOAD requires full ICE from all agents, this check is not required. -5.5.1.8. Role Determination +5.5.1.9. Role Determination The roles of controlling and controlled as described in Section 5.2 of ICE are still utilized with RELOAD. However, the offerer (the entity sending the Attach request) will always be controlling, and the answerer (the entity sending the Attach response) will always be controlled. The connectivity checks MUST still contain the ICE- CONTROLLED and ICE-CONTROLLING attributes, however, even though the role reversal capability for which they are defined will never be needed with RELOAD. This is to allow for a common codebase between ICE for RELOAD and ICE for SDP. -5.5.1.9. Connectivity Checks +5.5.1.10. Full ICE + + When neither side has provided an No-ICE candidate, connectivity + checks and nominations are used as in regular ICE. + +5.5.1.10.1. Connectivity Checks The processes of forming check lists in Section 5.7 of ICE, scheduling checks in Section 5.8, and checking connectivity checks in Section 7 are used with RELOAD without change. -5.5.1.10. Concluding ICE +5.5.1.10.2. Concluding ICE The controlling agent MUST utilize regular nomination. This is to ensure consistent state on the final selected pairs without the need for an updated offer, as RELOAD does not generate additional offer/ answer exchanges. The procedures in Section 8 of ICE are followed to conclude ICE, with the following exceptions: o The controlling agent MUST NOT attempt to send an updated offer once the state of its single media stream reaches Completed. o Once the state of ICE reaches Completed, the agent can immediately free all unused candidates. This is because RELOAD does not have the concept of forking, and thus the three second delay in Section 8.3 of ICE does not apply. -5.5.1.11. Subsequent Offers and Answers - - An agent MUST NOT send a subsequent offer or answer. Thus, the - procedures in Section 9 of ICE MUST be ignored. - -5.5.1.12. Media Keepalives +5.5.1.10.3. Media Keepalives STUN MUST be utilized for the keepalives described in Section 10 of ICE. +5.5.1.11. No ICE + + No-ICE is selected when either side has provided "no ICE" Overlay + Link candidates. STUN is not used for connectivity checks when doing + No-ICE; instead the DTLS or TLS handshake (or similar security layer + of future overlay link protocols) forms the connectivity check. The + certificate exchanged during the (D)TLS handshake must match the node + that sent the AttachReqAns and if it does not, the connection MUST be + closed. + +5.5.1.11.1. Implementation Notes for No-ICE + + This is a non-normative section to help implementors. + + At times ICE can seem a bit daunting to get one's head around. For a + simple IPv4 only peer, a simple implementation of No-ICE could be + done be doing the following: + o When sending an AttachReqAns, form one candidate with a priority + value of (2^24)*(126)+(2^8)*(65535)+(2^0)*(256-1) that specifies + the UDP port being listened to and another one with the TCP port. + o Check the certificate received in the TLS handshake has the same + Node-ID as the node that has sent the AttachReqAns. If multiple + connections succeed, close all but the one with highest priority. + o Do normal TLS and DTLS with no need for any special framing or + STUN processing. + +5.5.1.12. Subsequent Offers and Answers + + An agent MUST NOT send a subsequent offer or answer. Thus, the + procedures in Section 9 of ICE MUST be ignored. + 5.5.1.13. Sending Media The procedures of Section 11 apply to RELOAD as well. However, in this case, the "media" takes the form of application layer protocols (RELOAD or SIP for example) over TLS or DTLS. Consequently, once ICE processing completes, the agent will begin TLS or DTLS procedures to establish a secure connection. The node which sent the Attach request MUST be the TLS server. The other node MUST be the TLS client. The server MUST request TLS client authentication. The nodes MUST verify that the certificate presented in the handshake @@ -2877,123 +2941,47 @@ protocol is free to use the connection. The concept of a previous selected pair for a component does not apply to RELOAD, since ICE restarts are not possible with RELOAD. 5.5.1.14. Receiving Media An agent MUST be prepared to receive packets for the application protocol (TLS or DTLS carrying RELOAD, SIP or anything else) at any time. The jitter and RTP considerations in Section 11 of ICE do not - apply to RELOAD or SIP. - -5.5.2. AttachLite - - An alternative to using the full ICE supported by the Attach request - is to use ICE-Lite with the AttachLite request. This will not work - in all of the scenarios where ICE would work, but in some cases, - particularly those with no NATs or firewalls, it will work. - Configuration for the overlay indicates whether or not this can be - used. - -5.5.2.1. Request Definition - - An AttachLiteReqAns message contains the requesting peer's ICE-Lite - connection parameters formatted into a binary structure. When using - the AttachLite request, both sides act as ICE-Lite hosts. - - struct { - IpAddressPort addr_port; - Transport transport; - uint32 priority; - } IceLiteCandidate; - - struct { - IceLiteCandidate candidates<0..2^16-1>; - } AttachLiteReqAns; - - The values contained in AttachLiteReqAns are: - - candidates - One or more ICE candidate values. Each one contains an IP address - and family, transport protocol, and port to connect to as well as - a priority. - - These values should be generated using the procedures described in - Section 5.5.1.3. - -5.5.2.2. Response Definition - - If a peer receives an AttachLite request, it SHOULD process the - request and generate its own response with an AttachLiteReqAns. It - should then initiate connections as described below. When a peer - receives an AttachLite response, it SHOULD parse the response and - handle any received connections. - -5.5.2.3. Attach-Lite Connectivity Checks - - STUN is not used for connectivity checks when doing ICE-Lite; instead - the DTLS or TLS handshake forms the connectivity check. The host - that received the AttachLiteReqAns MUST initiate TLS or DTLS - connections to candidates provided in the request. When a connection - forms, the node MUST check that the certificate is for the node that - sent AttachLiteReqAns and if is not, MUST close the connection. - - Since TLS provides the connectivity check, there is no need for the - RFC 4571 [RFC4571] style framing shim for STUN when using TLS. - -5.5.2.4. Implementation Notes for Attach-Lite - - This is a non-normative section to help implementors. - - At times ICE can seem a bit daunting to get one's head around. For a - simple IPv4 only peer, a simple implementation of Attach-Lite could - be done be doing the following: - - o When sending an AttachLiteReqAns, form one candidate with a - priority value of (2^24)*(126)+(2^8)*(65535)+(2^0)*(256-1) that - specifies the UDP port being listened to and another one with the - TCP port. - o When receiving an AttachLiteReqAns, try to form a connection to - each candidate in the request. Check the certificate receive in - the TLS handshake has the same Node-ID as the node that has sent - the AttchLiteReq. If multiple connections succeed, close all but - the one with highest priority. - o Do normal TLS and DTLS with no need for any special framing or - STUN processing. + apply to RELOAD. -5.5.3. AppAttach +5.5.2. AppAttach A node sends an AppAttach request when it wishes to establish a - direct TCP or UDP connection to another node for the purposes of - sending application layer messages. AppAttach is basically like - Attach, except for the purpose of the connection. A separate request - is used to avoid implementor confusion between the two methods (this - was found to be a real problem with initial implementations). The + direct connection to another node for the purposes of sending + application layer messages. AppAttach is basically like Attach, + except for the purpose of the connection. A separate request is used + to avoid implementor confusion between the two methods (this was + found to be a real problem with initial implementations). The AppAttach request and its response contain an application attribute, - with a value of SIP or RELOAD, which indicates what protocol is to be - run over the connection. + which indicates what protocol is to be run over the connection. -5.5.3.1. Request Definition +5.5.2.1. Request Definition - An AttachReq message contains the requesting peer's ICE connection - parameters formatted into a binary structure. + An AppAttachReqAns message contains the requesting node's ICE + connection parameters formatted into a binary structure. struct { opaque ufrag<0..2^8-1>; opaque password<0..2^8-1>; uint16 application; opaque role<0..2^8-1>; IceCandidate candidates<0..2^16-1>; } AppAttachReqAns; - The values contained in AppAttachReq and AppAttachAns are: + The values contained in AppAttachReqAns are: ufrag The username fragment (from ICE) password The ICE password. application A 16-bit port number. This port number represents the IANA registered port of the protocol that is going to be sent on this @@ -3001,107 +2989,70 @@ registered port, we avoid the need for an additional registry and allow RELOAD to be used to set up connections for any existing or future application protocols. role An active/passive/actpass attribute from RFC 4145 [RFC4145]. candidates One or more ICE candidate values -5.5.3.2. Response Definition +5.5.2.2. Response Definition If a peer receives an AppAttach request, it SHOULD process the request and generate its own response with a AppAttachReqAns. It should then begin ICE checks. When a peer receives an AppAttach response, it SHOULD parse the response and begin its own ICE checks. -5.5.4. AppAttachLite - - Similar to the AttachLite method, RELOAD provides an AppAttachLite to - allow application connections when full ICE is not needed. - -5.5.4.1. Request Definition - - An AppAttachLiteReqAns message contains the requesting peer's ICE- - Lite connection parameters formatted into a binary structure. When - using the AppAttachLite request, both sides act as ICE-Lite hosts. - - struct { - uint16 application; - IceLiteCandidate candidates<0..2^16-1>; - } AppAttachLiteReqAns; - - The values contained in AppAttachLiteReqAns are: - - application - A 16-bit port number used in the same way as in the AppAttach - request. This port number represents the IANA registered port of - the protocol that is going to be sent on this connection. - - candidates - One or more ICE candidate values. Each one contains an IP address - and family, transport protocol, and port to connect to as well as - a priority. - - These values should be generated using the procedures described in - Section 5.5.1.3. - -5.5.4.2. Response Definition - - If a peer receives an AppAttachLite request, it SHOULD process the - request and generate its own response with an AppAttachLiteReqAns as - described in the AttachLite section. - -5.5.5. Ping +5.5.3. Ping Ping is used to test connectivity along a path. A ping can be addressed to a specific Node-ID, to the peer controlling a given location (by using a resource ID), or to the broadcast Node-ID (2^128-1). -5.5.5.1. Request Definition +5.5.3.1. Request Definition struct { } PingReq -5.5.5.2. Response Definition +5.5.3.2. Response Definition A successful PingAns response contains the information elements requested by the peer. struct { uint64 response_id; uint64 time; } PingAns; A PingAns message contains the following elements: response_id A randomly generated 64-bit response ID. This is used to distinguish Ping responses. time The time when the ping responses was created in absolute time, represented in milliseconds since midnight Jan 1, 1970 which is the UNIX epoch. -5.5.6. Config_Update +5.5.4. Config_Update The Config_Update method is used to push updated configuration data across the overlay. Whenever a node detects that another node has old configuration data, it MUST generate a Config_Update request. The Config_Update request allows updating of two kinds of data: the configuration data (Section 5.3.2.1) and kind information (Section 6.4.1.1). -5.5.6.1. Request Definition +5.5.4.1. Request Definition enum { reserved(0), config(1), kind(2), (255) } Config_UpdateType; typedef opaque KindDescription<2^16-1>; struct { Config_UpdateType type; uint32 length; @@ -3126,109 +3076,158 @@ The length of the remainder of the message. This is included to preserve backward compatibility and is 32 bits instead of 24 to facilitate easy conversion between network and host byte order. config_data (type==config) The contents of the configuration document. kinds (type==kind) One or more XML kind-block productions (see Section 10.1). These MUST be encoded with UTF-8 and assume a default namespace of "urn:ietf:params:xml:ns:p2p:config-base". -5.5.6.2. Response Definition +5.5.4.2. Response Definition struct { } Config_UpdateReq If the Config_UpdateReq is of type "config" it MUST only be processed if all the following are true: o The sequence number in the document is greater than the current configuration sequence number. o The configuration document is correctly digitally signed (see Section 10 for details on signatures. Otherwise appropriate errors MUST be generated. If the Config_UpdateReq is of type "kind" it MUST only be processed - if it is correctly digitally signed by an acceptable kind signer. In - addition, if the kind update conflicts with an existing known kind - (i.e., it is signed by a different signer), then it should be - rejected with "Error_Forbidden". This should not happen in correctly - functioning overlays. + if it is correctly digitally signed by an acceptable kind signer as + specified in the configuraton file. Details on kind-signer field in + the configuration file is described in Section 10.1. In addition, if + the kind update conflicts with an existing known kind (i.e., it is + signed by a different signer), then it should be rejected with + "Error_Forbidden". This should not happen in correctly functioning + overlays. If the update is acceptable, then the node MUST reconfigure itself to match the new information. This may include adding permissions for new kinds, deleting old kinds, or even, in extreme circumstances, exiting and reentering the overlay, if, for instance, the DHT algorithm has changed. The response for Config_Update is empty. 5.6. Overlay Link Layer RELOAD can use multiple Overlay Link protocols to send its messages. Because ICE is used to establish connections (see Section 5.5.1.3), RELOAD nodes are able to detect which Overlay Link protocols are offered by other nodes and establish connections between them. Any link protocol needs to be able to establish a secure, authenticated connection and to provide data origin authentication and message integrity for individual data elements. RELOAD currently supports - two Overlay Link protocols: + three Overlay Link protocols: - o TLS [RFC5246] over TCP - o DTLS [RFC4347] over UDP + o DTLS [RFC4347] over UDP with Simple Reliability (SR) + o TLS [RFC5246] over TCP with Framing Header, no ICE + o DTLS [RFC4347] over UDP with SR, no ICE Note that although UDP does not properly have "connections", both TLS and DTLS have a handshake which establishes a similar, stateful association, and we simply refer to these as "connections" for the purposes of this document. If a peer receives a message that is larger than value of max- message-size defined in the overlay configuration, the peer SHOULD send an Error_Message_Too_Large error and then close the TLS or DTLS session from which the message was received. Note that this error can be sent and the session closed before receiving the complete message. If the forwarding header is larger than the max-message- size, the receiver SHOULD close the TLS or DTLS session without sending an error. -5.6.1. Future Support for HIP + The Framing Header (FH) is used to frame messages and provide timing + when used on a reliable stream-based transport protocol. Simple + Reliability (SR) makes use of the FH to provide congestion control + and semi-reliability when using unreliable message-oriented transport + protocols. We will first define each of these algorithms, then + define overlay link protocols that use them. + + Note: We expect future Overlay Link protocols to define replacements + for all components of these protocols, including the framing header. + These protocols have been chosen for simplicity of implementation and + reasonable performance. + + Note to implementers: There are inherent tradeoffs in utilizing + short timeouts to determine when a link has failed. To balance the + tradeoffs, an implementation should be able to quickly act to remove + entries from the routing table when there is reason to suspect the + link has failed. For example, in a Chord-derived overlay algorithm, + a closer finger table entry could be substituted for an entry in the + finger table that has experienced a timeout. That entry can be + restored if it proves to resume functioning, or replaced at some + point in the future if necessary. End-to-end retransmissions will + handle any lost messages, but only if the failing entries do not + remain in the finger table for subsequent retransmissions. + +5.6.1. Future Overlay Link Protocols + +5.6.1.1. HIP The P2PSIP Working Group has expressed interest in supporting a HIP- based link protocol [RFC5201]. Such support would require specifying such details as: o How to issue certificates which provided identities meaningful to the HIP base exchange. We anticipate that this would require a mapping between ORCHIDs and NodeIds. o How to carry the HIP I1 and I2 messages. We anticipate that this would require defining a HIP Tunnel usage. o How to carry RELOAD messages over HIP. - We leave this work as a topic for another draft. +5.6.1.2. ICE-TCP -5.6.2. Reliability for Unreliable Links + The ICE-TCP draft [I-D.ietf-mmusic-ice-tcp] should allow TCP to be + supported as an Overlay Link protocol that can be added using ICE. + However, as of the time of this writing, the draft is not making + significant progress toward approval. - When RELOAD is carried over DTLS or another unreliable link protocol, - it needs to be used with a reliability and congestion control - mechanism, which is provided on a hop-by-hop basis. The basic - principle is that each message, regardless of whether or not it - carries a request or response, will get an ACK and be reliably - retransmitted. The receiver's job is very simple, limited to just - sending ACKs. All the complexity is at the sender side. This allows - the sending implementation to trade off performance versus - implementation complexity without affecting the wire protocol. +5.6.1.3. Message-oriented Transports - In order to support unreliable links, each message is wrapped in a - very simple framing layer (FramedMessage) which is only used for each - hop. This layer contains a sequence number which can then be used - for ACKs. + Modern message-oriented transports offer high performance, good + congestion control, and avoid head-of-line blocking in case of lost + data. These characteristics make them preferable as underlying + transport protocols for RELOAD links. SCTP without message ordering + and DCCP are two examples of such protocols. However, currently they + are not well-supported by commonly available NATs, and specifications + for ICE session establishment are not available. -5.6.2.1. Framed Message Format +5.6.1.4. Tunneled Transports + + As of the time of this writing, there is significant interest in the + IETF community in tunneling other transports over UDP, motivated by + the situation that UDP is well-supported by modern NAT hardware, and + similar performance can be achieved to native implementation. + Currently SCTP, DCCP, and a generic tunneling extension are being + proposed for message-oriented protocols. Baset et al. have proposed + tunneling TCP over UDP for similar reasons + [I-D.baset-tsvwg-tcp-over-udp]. Once ICE traversal has been + specified for these tunneled protocols, they should be easily + supported as an overlay link protocol. + +5.6.2. Framing Header + + In order to support unreliable links and to allow for quick detection + of link failures when using reliable end-to-end transports, each + message is wrapped in a very simple framing layer (FramedMessage) + which is only used for each hop. This layer contains a sequence + number which can then be used for ACKs. The same header is used for + both reliable and unreliable transports for simplicity of + implementation---not all aspects of the header apply to both types of + transports. The definition of FramedMessage is: enum {data (128), ack (129), (255)} FramedMessageType; struct { FramedMessageType type; select (type) { case data: @@ -3282,33 +3281,44 @@ means only that it is unknown whether or not the packet has been received, because it might have been received before the 32 most recently received packets. The received field bits in the ACK provide a very high degree of redundancy so that the sender to figure out which packets the receiver has received and can then estimate packet loss rates. If the sender also keeps track of the time at which recent sequence numbers have been sent, the RTT can be estimated. -5.6.2.2. Retransmission and Flow Control +5.6.3. Simple Reliability + + When RELOAD is carried over DTLS or another unreliable link protocol, + it needs to be used with a reliability and congestion control + mechanism, which is provided on a hop-by-hop basis. The basic + principle is that each message, regardless of whether or not it + carries a request or response, will get an ACK and be reliably + retransmitted. The receiver's job is very simple, limited to just + sending ACKs. All the complexity is at the sender side. This allows + the sending implementation to trade off performance versus + implementation complexity without affecting the wire protocol. + +5.6.3.1. Retransmission and Flow Control Because the receiver's role is limited to providing packet acknowledgements, a wide variety of congestion control algorithms can be implemented on the sender side while using the same basic wire protocol. Senders MUST implement a retransmission and congestion control scheme no more aggressive then TFRC[RFC5348]. One way to do that is for senders to implement the scheme in the following section. - Another would be to implement the scheme described in Appendix B. Another alternative would be TFRC-SP [RFC4828] and use the received bitmask to allow the sender to compute packet loss event rates. -5.6.2.2.1. Trivial Retransmission +5.6.3.1.1. Trivial Retransmission A peer SHOULD retransmit a message if it has not received an ACK after an interval of RTO ("Retransmission TimeOut"). The peer MUST double the time to wait after each retransmission. In each retransmission, the sequence number is incremented. The RTO is an estimate of the round-trip time (RTT). Implementations can use a static value for RTO or a dynamic estimate which will result in better performance. For implementations that use a static value, the default value for RTO is 500 ms. Nodes MAY use smaller @@ -3336,27 +3346,81 @@ Retransmissions continue until a response is received, or until a total of 5 requests have been sent or there has been a hard ICMP error [RFC1122]. The sender knows a response was received when it receives an ACK with a sequence number that indicates it is a response to one of the transmissions of this messages. For example, assuming an RTO of 500 ms, requests would be sent at times 0 ms, 500 ms, 1500 ms, 3500 ms, and 7500 ms. If all retransmissions for a message fail, then the sending node SHOULD close the connection routing the message. + To determine when a link may be failing without waiting for the final + timeout, observe when no ACKs have been received for an entire RTO + interval, and then wait for three retransmissions to occur beyond + that point. If no ACKs have been received by the time the third + retransmission occurs, it is RECOMMENDED that the link be removed + from the routing table. The link MAY be restored to the routing + table if ACKs resume before the connection is closed, as described + above. + Once an ACK has been received for a message, the next message can be - sent but the peer SHOULD ensure that there is at least 10 ms between + sent, but the peer SHOULD ensure that there is at least 10 ms between sending any two messages. The only time a value less than 10 ms can be used is when it is known that all nodes are on a network that can support retransmissions faster than 10 ms with no congestion issues. -5.6.3. Fragmentation and Reassembly +5.6.4. DTLS/UDP with SR + + This overlay link protocol consists of DTLS over UDP while + implementing the Simple Reliability protocol. STUN Connectivity + checks and keepalives are used. + +5.6.5. TLS/TCP with FH, no ICE + + This overlay link protocol consists of TLS over TCP with the framing + header. Because ICE is not used, STUN connectivity checks are not + used upon establishing the TCP connection, nor are they used for + keepalives. + + Because the TCP layer's application-level timeout is too slow to be + useful for overlay routing, the Overlay Link implementation MUST + using the framing header to measure the RTT of the connection and + calculate an RTO as specified in Section 2 of [RFC2988]. The + resulting RTO is not used for retransmissions, but as a timeout to + indicate when the link SHOULD be removed from the routing table. It + is RECOMMENDED that such a connection be retained for 30s to + determine if the failure was transient before concluding the link has + failed permanently. + + When sending candidates for TLS/TCP with FH, no ICE, a passive + candidate MUST be provided. The following table shows which side of + the exchange initiates the connection depending on whether they + provided ICE or No-ICE candidates. Note that the active TCP role + does not alter the TLS server/client determination. + + +----------------------+----------+-----------------+ + | Offeror | Answerer | TCP Active Role | + +----------------------+----------+-----------------+ + | ICE | No-ICE | Offeror | + | No-ICE | ICE | Answerer | + | No-ICE | No-ICE | Offeror | + +----------------------+----------+-----------------+ + + Table 1: Determining Active Role for No-ICE + +5.6.6. DTLS/UDP with SR, no ICE + + This overlay link protocol consists of DTLS over UDP while + implementing the Simple Reliability protocol. Because ICE is not + used, no STUN connectivity checks or keepalives are used. + +5.7. Fragmentation and Reassembly In order to allow transmission over datagram protocols such as DTLS, RELOAD messages may be fragmented. Any node along the path can fragment the message but only the final destination reassembles the fragments. When a node takes a packet and fragments it, each fragment has a full copy of the Forwarding Header but the data after the Forwarding Header is broken up in appropriate sized chunks. The size of the payload chunks needs to take into account space to allow the via and destination lists to @@ -3951,21 +4015,21 @@ kind The Kind-ID of the data being fetched. Implementations SHOULD reject requests corresponding to unknown kinds unless specifically configured otherwise. model The data model of the data. This must be checked against the Kind-ID. generation - The last generation counter that the requesting peer saw. This + The last generation counter that the requesting node saw. This may be used to avoid unnecessary fetches or it may be set to zero. length The length of the rest of the structure, thus allowing extensibility. model_specifier A reference to the data value being requested within the data model specified for the kind. For instance, if the data model is "array", it might specify some subset of the values. @@ -3984,22 +4048,22 @@ of the dictionary keys being requested. If no keys are specified, than this is a wildcard fetch and all key-value pairs are returned. The generation-counter is used to indicate the requester's expected state of the storing peer. If the generation-counter in the request matches the stored counter, then the storing peer returns a response with no StoredData values. Note that because the certificate for a user is typically stored at - the same location as any data stored for that user, a requesting peer - which does not already have the user's certificate should request the + the same location as any data stored for that user, a requesting node + that does not already have the user's certificate should request the certificate in the Fetch as an optimization. 6.4.2.2. Response Definition The response to a successful Fetch request is a FetchAns message containing the data requested by the requester. struct { KindId kind; uint64 generation; @@ -4150,42 +4214,42 @@ (if any) of the resource of kind T known to the target peer which is closest to R. This method can be used to walk the Overlay Instance by interactively fetching R_n+1=nearest(1 + R_n). 6.4.4.1. Request Definition The FindReq message contains a series of Resource-IDs and Kind-IDs identifying the resource the peer is interested in. struct { - ResourceID resource; + ResourceId resource; KindId kinds<0..2^8-1>; } FindReq; The request contains a list of Kind-IDs which the Find is for, as indicated below: resource The desired Resource-ID kinds The desired Kind-IDs. Each value MUST only appear once. 6.4.4.2. Response Definition A response to a successful Find request is a FindAns message containing the closest Resource-ID on the peer for each kind specified in the request. struct { KindId kind; - ResourceID closest; + ResourceId closest; } FindKindData; struct { FindKindData results<0..2^16-1>; } FindAns; If the processing peer is not responsible for the specified Resource-ID, it SHOULD return a 404 error. For each Kind-ID in the request the response MUST contain a @@ -4253,27 +4317,33 @@ peer's Node-ID but rather at a hash of the peer's Node-ID. The intention here (as is common throughout RELOAD) is to avoid making a peer responsible for its own data. A peer MUST ensure that the user's certificates are stored in the Overlay Instance. New certificates are stored at the end of the list. This structure allows users to store an old and a new certificate that both have the same Node-ID, which allows for migration of certificates when they are renewed. - This usage defines the following kind: + This usage defines the following kinds: - Name: CERTIFICATE - Data Model: The data model for CERTIFICATE data is array. + Name: CERTIFICATE_BY_NODE + Data Model: The data model for CERTIFICATE_BY_NODE data is array. Access Control: NODE-MATCH. + Name: CERTIFICATE_BY_USER + + Data Model: The data model for CERTIFICATE_BY_USER data is array. + + Access Control: USER-MATCH. + 8. TURN Server Usage The TURN server usage allows a RELOAD peer to advertise that it is prepared to be a TURN server as defined in [I-D.ietf-behave-turn]. When a node starts up, it joins the overlay network and forms several connections in the process. If the ICE stage in any of these connections returns a reflexive address that is not the same as the peer's perceived address, then the peer is behind a NAT and not a candidate for a TURN server. Additionally, if the peer's IP address is in the private address space range, then it is also not a @@ -4361,43 +4431,50 @@ algorithm are merged into a single periodic process. Stabilization is implemented slightly differently because of the larger neighborhood, and fix_fingers is not as aggressive to reduce load, nor does it search for optimal matches of the finger table entries. o RELOAD uses a 128 bit hash instead of a 160 bit hash, as RELOAD is not designed to be used in networks with close to or more than 2^128 nodes. o RELOAD uses randomized finger entries as described in Section 9.6.4.2. + o This algorithm allows the use of either reactive or periodic + recovery. The original Chord paper used periodic recovery. + Reactive recovery provides better performance in small overlays, + but is believed to be unstable in large (>1000) overlays with high + levels of churn [handling-churn-usenix04]. The overlay + configuration file specifies a "chord-reload-reactive" element + that indicates whether reactive recovery should be used. 9.1. Overview The algorithm described here is a modified version of the Chord algorithm. Each peer keeps track of a finger table and a neighbor - table. The neighbor table typically contains the three peers before - this peer and the three peers after it in the DHT ring. There may - not be three entreis in all cases such as small rings or while the - ring topology is changing. The first entry in the finger table - contains the peer half-way around the ring from this peer; the second - entry contains the peer that is 1/4 of the way around; the third - entry contains the peer that is 1/8th of the way around, and so on. - Fundamentally, the chord data structure can be thought of a doubly- - linked list formed by knowing the successors and predecessor peers in - the neighbor table, sorted by the Node-ID. As long as the successor - peers are correct, the DHT will return the correct result. The - pointers to the prior peers are kept to enable the insertion of new - peers into the list structure. Keeping multiple predecessor and - successor pointers makes it possible to maintain the integrity of the - data structure even when consecutive peers simultaneously fail. The - finger table forms a skip list, so that entries in the linked list - can be found in O(log(N)) time instead of the typical O(N) time that - a linked list would provide. + table. The neighbor table contains at least the three peers before + and after this peer in the DHT ring. There may not be three entries + in all cases such as small rings or while the ring topology is + changing. The first entry in the finger table contains the peer + half-way around the ring from this peer; the second entry contains + the peer that is 1/4 of the way around; the third entry contains the + peer that is 1/8th of the way around, and so on. Fundamentally, the + chord data structure can be thought of a doubly-linked list formed by + knowing the successors and predecessor peers in the neighbor table, + sorted by the Node-ID. As long as the successor peers are correct, + the DHT will return the correct result. The pointers to the prior + peers are kept to enable the insertion of new peers into the list + structure. Keeping multiple predecessor and successor pointers makes + it possible to maintain the integrity of the data structure even when + consecutive peers simultaneously fail. The finger table forms a skip + list, so that entries in the linked list can be found in O(log(N)) + time instead of the typical O(N) time that a linked list would + provide. A peer, n, is responsible for a particular Resource-ID k if k is less than or equal to n and k is greater than p, where p is the peer id of the previous peer in the neighbor table. Care must be taken when computing to note that all math is modulo 2^128. 9.2. Routing The routing table is the union of the neighbor table and the finger table. @@ -4418,86 +4495,84 @@ the neighbor table and to that peer's successor. Note that these Store requests are addressed to those specific peers, even though the Resource-ID they are being asked to store is outside the range that they are responsible for. The peers receiving these check they came from an appropriate predecessor in their neighbor table and that they are in a range that this predecessor is responsible for, and then they store the data. They do not themselves perform further Stores because they can determine that they are not responsible for the Resource-ID. + Managing replicas as the overlay changes is described in + Section 9.6.3. + The sequential replicas used in this overlay algorithm protect against peer failure but not against malicious peers. Additional replication from the Usage is required to protect resources from such attacks, as discussed in Section 12.5.4. 9.4. Joining The join process for a joining party (JP) with Node-ID n is as follows. 1. JP MUST connect to its chosen bootstrap node. 2. JP SHOULD use a series of Pings to populate its routing table. 3. JP SHOULD send Attach requests to initiate connections to each of - the peers in the neighbor table as well as to the desired 16 - finger table entries. Note that this does not populate their - routing tables, but only their connection tables, so JP will not - get messages that it is expected to route to other nodes. + the peers in the neighbor table as well as to the desired finger + table entries. Note that this does not populate their routing + tables, but only their connection tables, so JP will not get + messages that it is expected to route to other nodes. 4. JP MUST enter all the peers it has contacted into its routing table. 5. JP SHOULD send a Join to its immediate successor, the admitting peer (AP) for Node-ID n. The AP sends the response to the Join. 6. AP MUST do a series of Store requests to JP to store the data that JP will be responsible for. 7. AP MUST send JP an Update explicitly labeling JP as its predecessor. At this point, JP is part of the ring and responsible for a section of the overlay. AP can now forget any data which is assigned to JP and not AP. - - 8. AP MUST send an Update to all of its neighbors with the new + 8. The AP MUST send an Update to all of its neighbors with the new values of its neighbor set (including JP). - 9. JP SHOULD send Updates to all the peers in its routing table. + 9. The JP MUST send Updates to all the peers in its neighbor table. In order to populate its neighbor table, JP sends a Ping via the bootstrap node directed at Resource-ID n+1 (directly after its own Resource-ID). This allows it to discover its own successor. Call that node p0. It then sends a ping to p0+1 to discover its successor (p1). This process can be repeated to discover as many successors as desired. The values for the two peers before p will be found at a later stage when n receives an Update. - In order to set up its neighbor table entry for peer i, JP simply - sends an Attach to peer (n+2^(numBitsInNodeId-i). This will be - routed to a peer in approximately the right location around the ring. + In order to set up its finger table entry for peer i, JP simply sends + an Attach to peer (n+2^(128-i). This will be routed to a peer in + approximately the right location around the ring. The joining peer MUST NOT send any Update message placing itself in the overlay until it has successfully completed an Attach with each peer that should be in its neighbor table. 9.5. Routing Attaches When a peer needs to Attach to a new peer in its neighbor table, it MUST source-route the Attach request through the peer from which it learned the new peer's Node-ID. Source-routing these requests allows the overlay to recover from instability. All other Attach requests, such as those for new finger table entries, are routed conventionally through the overlay. - If a peer is unable to successfully Attach with a peer that should be - in its neighborhood, it MUST locate either a TURN server or another - peer in the overlay, but not in its neighborhood, through which it - can exchange messages with its neighbor peer. - 9.6. Updates A chord Update is defined as + enum { reserved (0), peer_ready(1), neighbors(2), full(3), (255) } ChordUpdateType; struct { uint32 uptime; ChordUpdateType type; select(type){ case peer_ready: /* Empty */ ; @@ -4547,111 +4622,107 @@ successors The successor set of the Updating peer. fingers The finger table if the Updating peer, in numerically ascending order. A peer MUST maintain an association (via Attach) to every member of its neighbor set. A peer MUST attempt to maintain at least three predecessors and three successors, even though this will not be - possible if the ring is very small. However, it MUST send its entire - set in any Update message sent to neighbors. + possible if the ring is very small. It is RECOMMENDED that O(log(N)) + predecessors and successors be maintained in the neighbor set. 9.6.1. Handling Neighbor Failures Every time a connection to a peer in the neighbor table is lost (as determined by connectivity pings or the failure of some request), the peer MUST remove the entry from its neighbor table and replace it with the best match it has from the other peers in its routing table. If using reactive recovery, it then sends an immediate Update to all - nodes in its Connection Table. The update will contain all the Node- + nodes in its Neighbor Table. The update will contain all the Node- IDs of the current entries of the table (after the failed one has been removed). Note that when replacing a successor the peer SHOULD delay the creation of new replicas for successor replacement hold- down time (30 seconds) after removing the failed entry from its neighbor table in order to allow a triggered update to inform it of a better match for its neighbor table. + If the neighbor failure effects the peer's range of responsible IDs, + then the Update MUST be sent to all nodes in its Connection Table. + A peer MAY attempt to reestablish connectivity with a lost neighbor either by waiting additional time to see if connectivity returns or by actively routing a new ATTACH to the lost peer. Details for these procedures are beyond the scope of this document. In no event does an attempt to reestablish connectivity with a lost neighbor allow the peer to remain in the neighbor table. Such a peer is returned to the neighbor table once connectivity is reestablished. - If connectivity is lost to all three of the peers that follow this - peer in the ring, then this peer should behave as if it is joining - the network and use Pings to find a peer and send it a Join. If - connectivity is lost to all the peers in the finger table, this peer - should assume that it has been disconnected from the rest of the - network, and it should periodically try to join the DHT. + If connectivity is lost to all successor peers in the neighbor table, + then this peer should behave as if it is joining the network and use + Pings to find a peer and send it a Join. If connectivity is lost to + all the peers in the finger table, this peer should assume that it + has been disconnected from the rest of the network, and it should + periodically try to join the DHT. 9.6.2. Handling Finger Table Entry Failure If a finger table entry is found to have failed, all references to the failed peer are removed from the finger table and replaced with the closest preceding peer from the finger table or neighbor table. If using reactive recovery, the peer initiates a search for a new finger table entry as described below. 9.6.3. Receiving Updates When a peer, N, receives an Update request, it examines the Node-IDs in the UpdateReq and at its neighbor table and decides if this UpdateReq would change its neighbor table. This is done by taking the set of peers currently in the neighbor table and comparing them - to the peers in the update request. There are three major cases: + to the peers in the update request. There are two major cases: - o The UpdateReq contains peers that would not change the neighbor - set because they match the neighbor table. - o The UpdateReq contains peers closer to N than those in its + o The UpdateReq contains peers that match N's neighbor table, so no + change is needed to the neighbor set. + o The UpdateReq contains peers N does not know about that should be + in N's neighbor table, i.e. they are closer than entries in the neighbor table. - o The UpdateReq defines peers that indicate a neighbor table further - away from N than some of its neighbor table. Note that merely - receiving peers further away does not demonstrate this, since the - update could be from a node far away from N. Rather, the peers - would need to bracket N. In the first case, no change is needed. In the second case, N MUST attempt to Attach to the new peers and if it is successful it MUST adjust its neighbor set accordingly. Note that it can maintain the now inferior peers as neighbors, but it MUST remember the closer ones. - The third case implies that a neighbor has disappeared, most likely - because it has simply been disconnected but perhaps because of - overlay instability. N MUST Ping the questionable peers to discover - if they are indeed missing and if so, remove them from its neighbor - table. - After any Pings and Attaches are done, if the neighbor table changes and the peer is using reactive recovery, the peer sends an Update request to each member of its Connection Table. These Update requests are what ends up filling in the predecessor/successor tables of peers that this peer is a neighbor to. A peer MUST NOT enter itself in its successor or predecessor table and instead should leave the entries empty. - If peer N which is responsible for a Resource-ID R discovers that the - replica set for R (the next two nodes in its successor set) has + If peer N is responsible for a Resource-ID R, and N discovers that + the replica set for R (the next two nodes in its successor set) has changed, it MUST send a Store for any data associated with R to any new node in the replica set. It SHOULD NOT delete data from peers which have left the replica set. When a peer N detects that it is no longer in the replica set for a resource R (i.e., there are three predecessors between N and R), it SHOULD delete all data associated with R from its local store. + When a peer discovers that its range of responsible IDs have changed, + it MUST send an UPDATE to all entries in its connection table. + 9.6.4. Stabilization There are four components to stabilization: 1. exchange Updates with all peers in its neighbor table to exchange state. 2. search for better peers to place in its finger table. 3. search to determine if the current finger table size is sufficiently large. 4. search to determine if the overlay has partitioned and needs to recover. @@ -4683,30 +4754,26 @@ entries are presented: Alternative 1: A peer selects one entry in the finger table from among the invalid entries. It pings for a new peer for that finger table entry. The selection SHOULD be exponentially weighted to attempt to replace earlier (lower i) entries in the finger table. A simple way to implement this selection is to search through the finger table entries from i=0 and each time an invalid entry is encountered, send a Ping to replace that entry with probability 0.5. - Alternative 2: Every "chord-reload-ping-interval" seconds, the peer - scans through its finger table and for each invalid finger table - entry i, sends a RouteQuery request for the ID n+2^(128-i) to the - closest preceding peer to that ID in the routing table. The - responses to these route queries are used to identify the set of - entries for which a new Ping is likely to result in a valid entry: - the responses that contain a peer not currently in the finger table - indicate a Ping may result in a new valid entry for the finger table. - The peer then selects from among those candidates using an - exponentially weighted probability as above. + Alternative 2: A peer monitors the Update messages received from its + connections to observe when an Update indicates a peer that would be + used to replace in invalid finger table entry, i, and flags that + entry in the finger table. Every "chord-reload-ping-interval" + seconds, the peer selects from among those flagged candidates using + an exponentially weighted probability as above. When searching for a better entry, the peer SHOULD send the Ping to a Node-ID selected randomly from that range. Random selection is preferred over a search for strictly spaced entries to minimize the effect of churn on overlay routing [minimizing-churn-sigcomm06]. An implementation or subsequent specification MAY choose a method for selecting finger table entries other than choosing randomly within the range. Any such alternate methods SHOULD be employed only on finger table stabilization and not for the selection of initial finger table entries unless the alternative method is faster and @@ -4758,21 +4825,21 @@ determining when to repeat the discovery process. 9.7. Route Query For this topology plugin, the RouteQueryReq contains no additional information. The RouteQueryAns contains the single node ID of the next peer to which the responding peer would have routed the request message in recursive routing: struct { - NodeId next_id; + NodeId next_peer; } ChordRouteQueryAns; The contents of this structure are as follows: next_peer The peer to which the responding peer would route the message in order to deliver it to the destination listed in the request. If the requester has set the send_update flag, the responder SHOULD initiate an Update immediately after sending the RouteQueryAns. @@ -4831,28 +4898,28 @@ p2p-overlay+xml" for an MIME entity that contains overlay information. An example document is shown below. - false + false 192.0.0.1:5678 192.0.2.2:6789 30 false - 10 + 10 < 4000 - https://example.org + https://example.org foo 300 400 false asecret chord DATA GOES HERE @@ -4907,27 +4974,27 @@ sequence: a monotonically increasing sequence number between 1 and 2^32 Inside each overlay element, the following elements can occur: topology-plugin This element has an attribute called algorithm-name that describes the overlay algorithm being used. root-cert This element contains a PEM encoded X.509v3 certificate that is a root trust anchor used to sign all certificates in this overlay. There can be more than one root-cert element. - credential-server This element contains the URL at which the - credential server can be reached in a "url" element. This URL - MUST be of type "https:". More than one credential-server element + enrollment-server This element contains the URL at which the + enrollment server can be reached in a "url" element. This URL + MUST be of type "https:". More than one enrollment-server element may be present. self-signed-permitted This element indicates whether self-signed certificates are permitted. If it is set to "true", then self- - signed certificates are allowed, in which case the credential- + signed certificates are allowed, in which case the enrollment- server and root-cert elements may be absent. Otherwise, it SHOULD be absent, but MAY be set to "false". This element also contains an attribute "digest" which indicates the digest to be used to compute the Node-ID. Valid values for this parameter are "SHA-1" and "SHA-256". Implementations MUST support both of these algorithms. bootstrap-node This element represents the address of one of the bootstrap nodes. It has an attribute called "address" that represents the IP address (either IPv4 or IPv6, since they can be distinguished) and an attribute called "port" that represents the @@ -4941,37 +5008,39 @@ multicast-bootstrap This element represents the address of a multicast, boradcast, or anycast address and port that may be used for bootstrap. Nodes SHOULD listen on the address. It has an attributed called "address" that represents the IP address and an attribute called "port" that represents the port. More than one "multicast-bootstrap" element may be present. clients-permitted This element represents whether clients are permitted or whether all nodes must be peers. If it is set to "TRUE" or absent, this indicates that clients are permitted. If it is set to "FALSE" then nodes MUST join as peers. - attach-lite-permitted This element represents whether nodes are - allowed to use the AttachLite and AppAttachLite request in this + ice-lite-permitted This element represents whether nodes are + allowed to use the "no-ICE" Overlay Link protocols. in this overlay. If it is absent, it is treated as if it were set to "FALSE". chord-update-interval The update frequency for the Chord-reload topology plugin (see Section 9). chord-ping-interval The ping frequency for the Chord-reload topology plugin (see Section 9). + chord-reload-reactive Whether reactive recovery should be used for + this overlay. (see Section 9). shared-secret If shared secret mode is used, this contains the shared secret. max-message-size Maximum size in bytes of any message in the overlay. If this value is not present, the default is 5000. initial-ttl Initial default TTL (time to live, see Section 5.3.2) for messages. If this value is not present, the default is 100. - kind-signer This contains a single Node-Id in hexadecimal and + kind-signer This contains a single Node-ID in hexadecimal and indicates that the certificate with this Node-ID is allowed to - sign kinds. Identifying kind-signer by Node-Id instead of + sign kinds. Identifying kind-signer by Node-ID instead of certificate allows the use of short lived certificates without constantly having to provide an updated configuration file. Inside each overlay element, the required-kinds elements can also occur. This element indicates the kinds that members must support and contains multiple kind-block elements that each define a single kind that MUST be supported by nodes in the overlay. Each kind-block consists of a single kind element and a kind-signature. The kind element defines the kind. The kind-signature is the signature computed over the kind element. @@ -5017,21 +5086,27 @@ When a node receives a new configuration file, it MUST change its configuration to meet the new requirements. This may require the node to exit the DHT and re-join. If a node is not capable of supporting the new requirements, it MUST exit the overlay. If some information about a particular kind changes from what the node previously knew about the kind (for example the max size), the new information in the configuration files overrides any previously learned information. If any kind data was signed by a node that is no longer allowed to sign kinds, that kind MUST be discarded along - with any stored information of that kind. + with any stored information of that kind. Note that forcing an + avalanche restart of the overlay with a configuration change that + requires re-joining the overlay may result in serious performance + problems, including total collapse of the network if configuration + parameters are not properly considered. Such an event may be + necessary in case of a compromised CA or similar problem, but for + large overlays should be avoided in almost all circumstances. 10.1.1. Relax NG Grammar The grammar for the configuration data is: namespace chord = "urn:ietf:params:xml:ns:p2p:config-chord" namespace local = "" default namespace p2pcf="urn:ietf:params:xml:ns:p2p:config-base" namespace rng = "http://relaxng.org/ns/structure/1.0" @@ -5054,23 +5129,23 @@ xsd:base64Binary }? } signature-algorithm-type |= "rsa-sha1" parameter &= element topology-plugin { topology-plugin-type } parameter &= element max-message-size { xsd:int }? parameter &= element initial-ttl { xsd:int }? parameter &= element root-cert { text }? parameter &= element required-kinds { kind-block* } - parameter &= element credential-server { xsd:anyURI }? + parameter &= element enrollment-server { xsd:anyURI }? parameter &= element kind-signer { text }* - parameter &= element attach-lite-permitted { xsd:boolean }? + parameter &= element ice-lite-permitted { xsd:boolean }? parameter &= element shared-secret { xsd:string }? parameter &= element clients-permitted { xsd:boolean }? parameter &= element turn-density { xsd:int }? parameter &= foreign-elements* parameter &= element self-signed-permitted { attribute digest { self-signed-digest-type }, xsd:boolean }? self-signed-digest-type |= "sha1" @@ -5145,23 +5219,23 @@ peer performs a GET to the URL. The result of the HTTP GET is an XML configuration file described above, which replaces any previously learned configuration file for this overlay. For overlays that do not use an enrollment server, nodes obtain the configuration information needed to join the overlay through some out of band approach such an an XML configuration file sent over email. 10.3. Credentials - If the configuration document contains a credential-server element, + If the configuration document contains a enrollment-server element, credentials are required to join the Overlay Instance. A peer which - does not yet have credentials MUST contact the credential server to + does not yet have credentials MUST contact the enrollment server to acquire them. RELOAD defines its own trivial certificate request protocol. We would have liked to have used an existing protocol but were concerned about the implementation burden of even the simplest of those protocols, such as [RFC5272]) and [RFC5273]. Our objective was to have a protocol which could be easily implemented in a Web server which the operator did not control (e.g., in a hosted service) and was compatible with the existing certificate handling tooling as used with the Web certificate infrastructure. This means accepting bare @@ -5173,28 +5248,28 @@ request is an HTTP POST with the following properties: o If authentication is required, there is a URL parameter of "password" and "username" containing the user's name and password in the clear (hence the need for HTTPS) o The body is of content type "application/pkcs10", as defined in [RFC2311]. o The Accept header contains the type "application/pkix-cert", indicating the type that is expected in the response. - The credential server MUST authenticate the request using the + The enrollment server MUST authenticate the request using the provided user name and password. If the authentication succeeds and the requested user name is acceptable, the server generates and returns a certificate. The SubjectAltName field in the certificate contains the following values: o One or more Node-IDs which MUST be cryptographically random - [RFC4086]. Each MUST be chosen by the credential server in such a + [RFC4086]. Each MUST be chosen by the enrollment server in such a way that they are unpredictable to the requesting user. Each is placed in the subjectAltName using the uniformResourceIdentifier type and MUST contain RELOAD URIs as described in Section 13.12 and MUST contain a Destination list with a single entry of type "node_id". o A single name this user is allowed to use in the overlay, using type rfc822Name. The certificate is returned as type "application/pkix-cert", with an HTTP status code of 200 OK. Certificate processing errors should be @@ -5511,21 +5586,22 @@ | | | | | | | | | | | | 12. Security Considerations 12.1. Overview RELOAD provides a generic storage service, albeit one designed to be useful for P2PSIP. In this section we discuss security issues that - are likely to be relevant to any usage of RELOAD. + are likely to be relevant to any usage of RELOAD. More background + information can be found in [I-D.irtf-p2prg-rtc-security]. In any Overlay Instance, any given user depends on a number of peers with which they have no well-defined relationship except that they are fellow members of the Overlay Instance. In practice, these other nodes may be friendly, lazy, curious, or outright malicious. No security system can provide complete protection in an environment where most nodes are malicious. The goal of security in RELOAD is to provide strong security guarantees of some properties even in the face of a large number of malicious nodes and to allow the overlay to function correctly in the face of a modest number of malicious nodes. @@ -5568,28 +5644,28 @@ this data as well as securing, as well as possible, the routing in the overlay. Both types of security are based on requiring that every entity in the system (whether user or peer) authenticate cryptographically using an asymmetric key pair tied to a certificate. When a user enrolls in the Overlay Instance, they request or are assigned a unique name, such as "alice@dht.example.net". These names are unique and are meant to be chosen and used by humans much like a SIP Address of Record (AOR) or an email address. The user is also assigned one or more Node-IDs by the central enrollment authority. - Both the name and the peer ID are placed in the certificate, along + Both the name and the Peer-ID are placed in the certificate, along with the user's public key. Each certificate enables an entity to act in two sorts of roles: o As a user, storing data at specific Resource-IDs in the Overlay Instance corresponding to the user name. - o As a overlay peer with the peer ID(s) listed in the certificate. + o As a overlay peer with the Peer-ID(s) listed in the certificate. Note that since only users of this Overlay Instance need to validate a certificate, this usage does not require a global PKI. Instead, certificates are signed by require a central enrollment authority which acts as the certificate authority for the Overlay Instance. This authority signs each peer's certificate. Because each peer possesses the CA's certificate (which they receive on enrollment) they can verify the certificates of the other entities in the overlay without further communication. Because the certificates contain the user/peer's public key, communications from the user/peer can be @@ -5762,48 +5838,48 @@ In general, attacks on DHT routing are mounted by the attacker arranging to route traffic through one or two nodes it controls. In the Eclipse attack [Eclipse] the attacker tampers with messages to and from nodes for which it is on-path with respect to a given victim node. This allows it to pretend to be all the nodes that are reachable through it. In the Sybil attack [Sybil], the attacker registers a large number of nodes and is therefore able to capture a large amount of the traffic through the DHT. Both the Eclipse and Sybil attacks require the attacker to be able to - exercise control over her peer IDs. The Sybil attack requires the + exercise control over her Peer-IDs. The Sybil attack requires the creation of a large number of peers. The Eclipse attack requires that the attacker be able to impersonate specific peers. In both cases, these attacks are limited by the use of centralized, certificate-based admission control. 12.6.2. Admissions Control Admission to a RELOAD Overlay Instance is controlled by requiring - that each peer have a certificate containing its peer ID. The + that each peer have a certificate containing its Peer-ID. The requirement to have a certificate is enforced by using certificate- based mutual authentication on each connection. (Note: the following only applies when self-signed certificates are not used.) Whenever a peer connects to another peer, each side automatically - checks that the other has a suitable certificate. These peer IDs are + checks that the other has a suitable certificate. These Peer-IDs are randomly assigned by the central enrollment server. This has two benefits: o It allows the enrollment server to limit the number of peer IDs issued to any individual user. - o It prevents the attacker from choosing specific peer IDs. + o It prevents the attacker from choosing specific Peer-IDs. The first property allows protection against Sybil attacks (provided the enrollment server uses strict rate limiting policies). The second property deters but does not completely prevent Eclipse attacks. Because an Eclipse attacker must impersonate peers on the other side of the attacker, he must have a certificate for suitable - peer IDs, which requires him to repeatedly query the enrollment + Peer-IDs, which requires him to repeatedly query the enrollment server for new certificates, which will match only by chance. From the attacker's perspective, the difficulty is that if he only has a small number of certificates, the region of the Overlay Instance he is impersonating appears to be very sparsely populated by comparison to the victim's local region. 12.6.3. Peer Identification and Authentication In general, whenever a peer engages in overlay activity that might affect the routing table it must establish its identity. This @@ -5926,29 +6002,30 @@ IANA SHALL create a "RELOAD Data Kind-ID" Registry. Entries in this registry are 32-bit integers denoting data kinds, as described in Section 4.1.2. Code points in the range 0x00000001 to 0x7fffffff SHALL be registered via RFC 5226 Standards Action. Code points in the range 0x8000000 to 0xf0000000 SHALL be registered via RFC 5226 Expert Review. Code points in the range 0xf0000001 to 0xffffffff are reserved for private use via the kind description mechanism described in Section 10. The initial contents of this registry are: - +--------------+------------+----------+ + +---------------------+------------+----------+ | Kind | Kind-ID | RFC | - +--------------+------------+----------+ + +---------------------+------------+----------+ | INVALID | 0 | RFC-AAAA | | TURN_SERVICE | 2 | RFC-AAAA | - | CERTIFICATE | 3 | RFC-AAAA | + | CERTIFICATE_BY_NODE | 3 | RFC-AAAA | + | CERTIFICATE_BY_USER | 16 | RFC-AAAA | | Reserved | 0x7fffffff | RFC-AAAA | | Reserved | 0xffffffff | RFC-AAAA | - +--------------+------------+----------+ + +---------------------+------------+----------+ 13.5. Data Model IANA SHALL create a "RELOAD Data Model" Registry. Entries in this registry are 8-bit integers denoting data models, as described in Section 6.2. Code points in this registry SHALL be registered via RFC 5226 Standards Action. The initial contents of this registry are: +--------------+------+----------+ @@ -6028,34 +6105,35 @@ | Error_Unsupported_Forwarding_Option | 7 | RFC-AAAA | | Error_Data_Too_Large | 8 | RFC-AAAA | | Error_Data_Too_Old | 9 | RFC-AAAA | | Error_TTL_Exceeded | 10 | RFC-AAAA | | Error_Message_Too_Large | 11 | RFC-AAAA | | Error_Unknown_Kind | 12 | RFC-AAAA | | Error_Unknown_Extension | 13 | RFC-AAAA | | reserved | 0x8000..0xfffe | RFC-AAAA | +-------------------------------------+----------------+----------+ -13.8. Transport Types +13.8. Overlay Link Types - IANA shall create a "RELOAD Transport." New entries SHALL be defined - via RFC 5226 Standards Action. This registry SHALL be initially - populated with the following values: + IANA shall create a "RELOAD Overlay Link." New entries SHALL be + defined via RFC 5226 Standards Action. This registry SHALL be + initially populated with the following values: - +---------------------+------+---------------+ + +--------------------+------+---------------+ | Protocol | Code | Specification | - +---------------------+------+---------------+ + +--------------------+------+---------------+ | reserved | 0 | RFC-AAAA | - | UDP (DTLS over UDP) | 1 | RFC-AAAA | - | TCP (TLS over TCP) | 2 | RFC-AAAA | + | DTLS-UDP-SR | 1 | RFC-AAAA | + | DTLS-UDP-SR-NO-ICE | 3 | RFC-AAAA | + | TLS-TCP-FH-NO-ICE | 4 | RFC-AAAA | | reserved | 255 | RFC-AAAA | - +---------------------+------+---------------+ + +--------------------+------+---------------+ 13.9. Forwarding Options IANA shall create a "Forwarding Option Registry". Entries in this registry between 1 and 127 SHALL be defined via RFC 5226 Standards Action. Entries in this registry between 128 and 254 SHALL be defined via RFC 5226 Specification Required. This registry SHALL be initially populated with the following values: +-------------------+------+---------------+ @@ -6154,27 +6233,28 @@ (RELOAD)" draft by David A. Bryan, Marcia Zangrilli and Bruce B. Lowekamp, the "Address Settlement by Peer to Peer" draft by Cullen Jennings, Jonathan Rosenberg, and Eric Rescorla, the "Security Extensions for RELOAD" draft by Bruce B. Lowekamp and James Deverick, the "A Chord-based DHT for Resource Lookup in P2PSIP" by Marcia Zangrilli and David A. Bryan, and the Peer-to-Peer Protocol (P2PP) draft by Salman A. Baset, Henning Schulzrinne, and Marcin Matuszewski. Thanks to the authors of RFC 5389 for text included from that. Vidya Narayanan provided many comments and imporvements. - The ideas for the Chord specific extension data to the Leave - mechanisms and text provided by J. Maenpaa, G. Camarillo, and J. + The ideas text for the Chord specific extension data to the Leave + mechanisms was provided by J. Maenpaa, G. Camarillo, and J. Hautakorp. Thanks to the many people who contributed including Ted Hardie, Michael Chen, Dan York, Das Saumitra, Lyndsay Campbell, Brian Rosen, - David Bryan, Michael Chen, Dave Craig, and Julian Cain. + David Bryan, Dave Craig, and Julian Cain. Extensinve working last + call comments were provided by: TODO 15. References 15.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [I-D.ietf-mmusic-ice] Rosenberg, J., "Interactive Connectivity Establishment @@ -6195,26 +6275,20 @@ [RFC5273] Schaad, J. and M. Myers, "Certificate Management over CMS (CMC): Transport Protocols", RFC 5273, June 2008. [RFC5272] Schaad, J. and M. Myers, "Certificate Management over CMS (CMC)", RFC 5272, June 2008. [RFC4279] Eronen, P. and H. Tschofenig, "Pre-Shared Key Ciphersuites for Transport Layer Security (TLS)", RFC 4279, December 2005. - [I-D.ietf-mmusic-ice-tcp] - Rosenberg, J., "TCP Candidates with Interactive - Connectivity Establishment (ICE)", - draft-ietf-mmusic-ice-tcp-07 (work in progress), - July 2008. - [RFC5246] Dierks, T. and E. Rescorla, "The Transport Layer Security (TLS) Protocol Version 1.2", RFC 5246, August 2008. [RFC4347] Rescorla, E. and N. Modadugu, "Datagram Transport Layer Security", RFC 4347, April 2006. [RFC5348] Floyd, S., Handley, M., Padhye, J., and J. Widmer, "TCP Friendly Rate Control (TFRC): Protocol Specification", RFC 5348, September 2008. @@ -6227,26 +6301,32 @@ [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, January 2005. [RFC4395] Hansen, T., Hardie, T., and L. Masinter, "Guidelines and Registration Procedures for New URI Schemes", BCP 35, RFC 4395, February 2006. 15.2. Informative References + [I-D.ietf-mmusic-ice-tcp] + Rosenberg, J., "TCP Candidates with Interactive + Connectivity Establishment (ICE)", + draft-ietf-mmusic-ice-tcp-07 (work in progress), + July 2008. + [I-D.maenpaa-p2psip-self-tuning] Maenpaa, J., Camarillo, G., and J. Hautakorpi, "A Self- tuning Distributed Hash Table (DHT) for REsource LOcation And Discovery (RELOAD)", - draft-maenpaa-p2psip-self-tuning-00 (work in progress), - February 2009. + draft-maenpaa-p2psip-self-tuning-01 (work in progress), + October 2009. [I-D.baset-tsvwg-tcp-over-udp] Baset, S. and H. Schulzrinne, "TCP-over-UDP", draft-baset-tsvwg-tcp-over-udp-01 (work in progress), June 2009. [RFC5201] Moskowitz, R., Nikander, P., Jokela, P., and T. Henderson, "Host Identity Protocol", RFC 5201, April 2008. [RFC4828] Floyd, S. and E. Kohler, "TCP Friendly Rate Control @@ -6358,20 +6438,30 @@ Kaashoek, M., Dabek, F., and H. Balakrishnan, "Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications", IEEE/ACM Transactions on Networking Volume 11, Issue 1, 17-32, Feb 2003. [vulnerabilities-acsac04] Srivatsa, M. and L. Liu, "Vulnerabilities and Security Threats in Structured Peer-to-Peer Systems: A Quantitative Analysis", ACSAC 2004. + [I-D.irtf-p2prg-rtc-security] + Schulzrinne, H., Marocco, E., and E. Ivov, "Security + Issues and Solutions in Peer-to-peer Systems for Realtime + Communications", draft-irtf-p2prg-rtc-security-05 (work in + progress), September 2009. + + [handling-churn-usenix04] + Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz, + "Handling Churn in a DHT", USENIX 2004. + [minimizing-churn-sigcomm06] Godfrey, P., Shenker, S., and I. Stoica, "Minimizing Churn in Distributed Systems", SIGCOMM 2006. Appendix A. Change Log A.1. Changes since draft-ietf-p2psip-reload-04 o Renamed the XML element in configuration files from to . @@ -6438,86 +6527,44 @@ fragments, indirect attack. o Updates to trivial sender/receiver text. o Updates to data model based on list discussion. o Updates to chord overlay algorithm section. o Added AppAttach and removed port number from Attach. o Changed via-list to use shorter structure. o Rewrote fragmentation. o Moved AIMD and TFRC congestion control algorithms to appendix until further WG effort decides direction there. -Appendix B. AIMD Retransmission Scheme - - This Appendix specifies an optional sender retransmission algorithm - with better performance that senders MAY implement and is based on - the theAIMD algorithm in TCP. The algorithm here is only the AIMD - portion of TCP. All other features are restricted to simplify the - implementation, i.e. no slow start (initial window is 1), no fast - retransmission, and no fast recovery. - - AIMD extends stop and wait defined in Section 5.6.2.2 by allowing - multiple fragments to be pending at the same time. The sender allows - w unacknowledged fragments to be outstanding at any given time. w is - initially set to one. In each RTO interval in which no - retransmissions occur, w is increased by one. When a loss occurs, w - is halved. After halving w, if there are more than w fragments for - which an ACK is pending, no further retransmissions of the most - recently initiated fragments are performed until they fit in the - window w, at which point they begin the retransmission algorithm - again. The value w is held fixed for one RTO. After that point, if - additional retransmissions occur, it will be halved again; otherwise - it may be incremented after an additional RTO without loss. - - If w drops to one and the one pending fragment is not ACKed by the - other side after 5 requests are sent, the link is considered to have - failed. Otherwise, unACKed fragments are simply dropped. - -Appendix C. TFRC Retransmission Scheme - - This Appendix specifies an optional TFRC (RFC5348) based scheme that - can be implemented in the Sender-Based Variant format with the same - receiver algorithm. This implementation requires the sender to - maintain precise timestamps in ms of the transmission time of each - sequence number as well as the segment sizes. That, combined with - the ACKs, allows the sender to calculate the performance required by - TFRC, including information calculated by the receiver in the - conventional form of TFRC. - - TFRC is used for congestion control. For reliability, an individual - fragment is retransmitted up to twice at RTO intervals, pending the - availability of room in the congestion window. If it is not ACKed - after another RTO following its last retransmission, it is dropped. - -Appendix D. Routing Alternatives +Appendix B. Routing Alternatives Significant discussion has been focused on the selection of a routing algorithm for P2PSIP. This section discusses the motivations for selecting symmetric recursive routing for RELOAD and describes the extensions that would be required to support additional routing algorithms. -D.1. Iterative vs Recursive +B.1. Iterative vs Recursive Iterative routing has a number of advantages. It is easier to debug, consumes fewer resources on intermediate peers, and allows the querying peer to identify and route around misbehaving peers [non-transitive-dhts-worlds05]. However, in the presence of NATs, iterative routing is intolerably expensive because a new connection must be established for each hop (using ICE) [bryan-design-hotp2p08]. Iterative routing is supported through the Route_Query mechanism and is primarily intended for debugging. It also allows the querying peer to evaluate the routing decisions made by the peers at each hop, consider alternatives, and perhaps detect at what point the forwarding path fails. -D.2. Symmetric vs Forward response +B.2. Symmetric vs Forward response An alternative to the symmetric recursive routing method used by RELOAD is Forward-Only routing, where the response is routed to the requester as if it were a new message initiated by the responder (in the previous example, Z sends the response to A as if it were sending a request). Forward-only routing requires no state in either the message or intermediate peers. The drawback of forward-only routing is that it does not work when the overlay is unstable. For example, if A is in the process of @@ -6532,21 +6579,21 @@ path is more likely to have a failed peer than is the request path (which was just tested to route the request) [non-transitive-dhts-worlds05]. An extension to RELOAD that supports forward-only routing but relies on symmetric responses as a fallback would be possible, but due to the complexities of determining when to use forward-only and when to fallback to symmetric, we have chosen not to include it as an option at this point. -D.3. Direct Response +B.3. Direct Response Another routing option is Direct Response routing, in which the response is returned directly to the querying node. In the previous example, if A encodes its IP address in the request, then Z can simply deliver the response directly to A. In the absence of NATs or other connectivity issues, this is the optimal routing technique. The challenge of implementing direct response is the presence of NATs. There are a number of complexities that must be addressed. In this discussion, we will continue our assumption that A issued the @@ -6582,21 +6628,21 @@ [RFC4787], and no clear recommendation is available, the prevalence of this feature in future devices remains unclear. An extension to RELOAD that supports direct response routing but relies on symmetric responses as a fallback would be possible, but due to the complexities of determining when to use direct response and when to fallback to symmetric, and the reduced performance for responses to peers behind restrictive NATs, we have chosen not to include it as an option at this point. -D.4. Relay Peers +B.4. Relay Peers SEP [I-D.jiang-p2psip-sep] has proposed implementing a form of direct response by having A identify a peer, Q, that will be directly reachable by any other peer. A uses Attach to establish a connection with Q and advertises Q's IP address in the request sent to Z. Z sends the response to Q, which relays it to A. This then reduces the latency to two hops, plus Z negotiating a secure connection to Q. This technique relies on the relative population of nodes such as A that require relay peers and peers such as Q that are capable of @@ -6608,21 +6654,21 @@ An extension to RELOAD that supports relay peers is possible, but due to the complexities of implementing such an alternative, we have not added such a feature to RELOAD at this point. A concept similar to relay peers, essentially choosing a relay peer at random, has previously been suggested to solve problems of pairwise non-transitivity [non-transitive-dhts-worlds05], but deterministic filtering provided by NATs makes random relay peers no more likely to work than the responding peer. -D.5. Symmetric Route Stability +B.5. Symmetric Route Stability A common concern about symmetric recursive routing has been that one or more peers along the request path may fail before the response is received. The significance of this problem essentially depends on the response latency of the overlay. An overlay that produces slow responses will be vulnerable to churn, whereas responses that are delivered very quickly are vulnerable only to failures that occur over that small interval. The other aspect of this issue is whether the request itself can be @@ -6637,28 +6683,28 @@ An overlay that is unstable enough to suffer this type of failure frequently is unlikely to be able to support reliable functionality regardless of the routing mechanism. However, regardless of the stability of the return path, studies show that in the event of high churn, iterative routing is a better solution to ensure request completion [lookups-churn-p2p06] [non-transitive-dhts-worlds05] Finally, because RELOAD retries the end-to-end request, that retry will address the issues of churn that remain. -Appendix E. Why Clients? +Appendix C. Why Clients? There are a wide variety of reasons a node may act as a client rather than as a peer [I-D.pascual-p2psip-clients]. This section outlines some of those scenarios and how the client's behavior changes based on its capabilities. -E.1. Why Not Only Peers? +C.1. Why Not Only Peers? For a number of reasons, a particular node may be forced to act as a client even though it is willing to act as a peer. These include: o The node does not have appropriate network connectivity, typically because it has a low-bandwidth network connection. o The node may not have sufficient resources, such as computing power, storage space, or battery power. o The overlay algorithm may dictate specific requirements for peer selection. These may include participating in the overlay to @@ -6670,21 +6716,21 @@ the overlay algorithm and specific deployment. A node acting as a client that has a full implementation of RELOAD and the appropriate overlay algorithm is capable of locating its responsible peer in the overlay and using Attach to establish a direct connection to that peer. In that way, it may elect to be reachable under either of the routing approaches listed above. Particularly for overlay algorithms that elect nodes to serve as peers based on trustworthiness or population, the overlay algorithm may require such a client to locate itself at a particular place in the overlay. -E.2. Clients as Application-Level Agents +C.2. Clients as Application-Level Agents SIP defines an extensive protocol for registration and security between a client and its registrar/proxy server(s). Any SIP device can act as a client of a RELOAD-based P2PSIP overlay if it contacts a peer that implements the server-side functionality required by the SIP protocol. In this case, the peer would be acting as if it were the user's peer, and would need the appropriate credentials for that user. Application-level support for clients is defined by a usage. A usage @@ -6698,26 +6744,26 @@ Cisco 170 West Tasman Drive MS: SJC-21/2 San Jose, CA 95134 USA Phone: +1 408 421-9990 Email: fluffy@cisco.com Bruce B. Lowekamp (editor) - MYMIC LLC - 1040 University Blvd., Suite 100 - Portsmouth, VA 23703 + Skype + Palo Alto, CA USA Email: bbl@lowekamp.net + Eric Rescorla Network Resonance 2064 Edgewood Drive Palo Alto, CA 94303 USA Phone: +1 650 320-8549 Email: ekr@networkresonance.com Salman A. Baset