--- 1/draft-ietf-ippm-metrictest-00.txt 2010-10-24 17:15:37.000000000 +0200 +++ 2/draft-ietf-ippm-metrictest-01.txt 2010-10-24 17:15:37.000000000 +0200 @@ -1,23 +1,23 @@ Internet Engineering Task Force R. Geib, Ed. Internet-Draft Deutsche Telekom Intended status: Standards Track A. Morton -Expires: January 3, 2011 AT&T Labs +Expires: April 27, 2011 AT&T Labs R. Fardid Cariden Technologies A. Steinmitz HS Fulda - July 2, 2010 + October 24, 2010 IPPM standard advancement testing - draft-ietf-ippm-metrictest-00 + draft-ietf-ippm-metrictest-01 Abstract This document specifies tests to determine if multiple independent instantiations of a performance metric RFC have implemented the specifications in the same way. This is the performance metric equivalent of interoperability, required to advance RFCs along the standards track. Results from different implementations of metric RFCs will be collected under the same underlying network conditions and compared using state of the art statistical methods. The goal is @@ -33,21 +33,21 @@ Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on January 3, 2011. + This Internet-Draft will expire on April 27, 2011. Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents @@ -57,41 +57,45 @@ the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 6 2. Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3. Verification of conformance to a metric specification . . . . 8 3.1. Tests of an individual implementation against a metric - specification . . . . . . . . . . . . . . . . . . . . . . 8 + specification . . . . . . . . . . . . . . . . . . . . . . 9 3.2. Test setup resulting in identical live network testing - conditions . . . . . . . . . . . . . . . . . . . . . . . . 9 + conditions . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3. Tests of two or more different implementations against - a metric specification . . . . . . . . . . . . . . . . . . 14 - 3.4. Clock synchronisation . . . . . . . . . . . . . . . . . . 14 - 3.5. Recommended Metric Verification Measurement Process . . . 15 - 3.6. Miscellaneous . . . . . . . . . . . . . . . . . . . . . . 19 - 4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19 - 5. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 19 - 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20 - 7. Security Considerations . . . . . . . . . . . . . . . . . . . 20 - 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 20 - 8.1. Normative References . . . . . . . . . . . . . . . . . . . 20 - 8.2. Informative References . . . . . . . . . . . . . . . . . . 21 - Appendix A. An example on a One-way Delay metric validation . . . 22 - A.1. Compliance to Metric specification requirements . . . . . 22 - A.2. Examples related to statistical tests for One-way Delay . 24 - Appendix B. Anderson-Darling 2 sample C++ code . . . . . . . . . 25 - Appendix C. Glossary . . . . . . . . . . . . . . . . . . . . . . 34 - Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 35 + a metric specification . . . . . . . . . . . . . . . . . . 15 + 3.4. Clock synchronisation . . . . . . . . . . . . . . . . . . 16 + 3.5. Recommended Metric Verification Measurement Process . . . 17 + 3.6. Miscellaneous . . . . . . . . . . . . . . . . . . . . . . 20 + 3.7. Proposal to determine an "equivalence" threshold for + each metric evaluated . . . . . . . . . . . . . . . . . . 21 + 4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 22 + 5. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 22 + 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 + 7. Security Considerations . . . . . . . . . . . . . . . . . . . 22 + 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 23 + 8.1. Normative References . . . . . . . . . . . . . . . . . . . 23 + 8.2. Informative References . . . . . . . . . . . . . . . . . . 24 + Appendix A. An example on a One-way Delay metric validation . . . 25 + A.1. Compliance to Metric specification requirements . . . . . 25 + A.2. Examples related to statistical tests for One-way Delay . 26 + Appendix B. Anderson-Darling 2 sample C++ code . . . . . . . . . 28 + Appendix C. A tunneling set up for remote metric + implementation testing . . . . . . . . . . . . . . . 36 + Appendix D. Glossary . . . . . . . . . . . . . . . . . . . . . . 38 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 38 1. Introduction The Internet Standards Process RFC2026 [RFC2026] requires that for a IETF specification to advance beyond the Proposed Standard level, at least two genetically unrelated implementations must be shown to interoperate correctly with all features and options. This requirement can be met by supplying: o evidence that (at least a sub-set of) the specification has been @@ -185,20 +189,37 @@ The metric RFC advancement process begins with a request for protocol action accompanied by a memo that documents the supporting tests and results. The procedures of [RFC2026] are expanded in[RFC5657], including sample implementation and interoperability reports. Section 3 of [morton-advance-metrics-01] can serve as a template for a metric RFC report which accompanies the protocol action request to the Area Director, including description of the test set-up, procedures, results for each implementation and conclusions. + Changes from WG -00 to WG -01 draft + + o Discussion on merits and requirements of a distributed lab test + using only local load generators. + + o Proposal of metrics suitable for tests using the proposed + measurement configuration. + + o Hint on delay caused by software based L2TPv3 implementation. + + o Added an appendix with a test configuration allowing remote tests + comparing different implementations accross the network. + + o Proposal for maximum error of "equivalence", based on performance + comparison of identical implementations. This may be useful for + both ADK and non-ADK comparisons. + Changes from prior ID -02 to WG -00 draft o Incorporation of aspects of reporting to support the protocol action request in the Introduction and section 3.5 o Overhaul of sectcion 3.2 regarding tunneling: Added generic tunneling requirements and L2TPv3 as an example tunneling mechanism fulfilling the tunneling requirements. Removed and adapted some of the prior references to other tunneling protocols @@ -336,20 +356,62 @@ 3. Verification of conformance to a metric specification This section specifies how to verify compliance of two or more IPPM implementations against a metric specification. This document only proposes a general methodology. Compliance criteria to a specific metric implementation need to be defined for each individual metric specification. The only exception is the statistical test comparing two metric implementations which are simultaneously tested. This test is applicable without metric specific decision criteria. + Several testing options exist to compare two or more implementations: + + o Use a single test lab to compare the implementations and emulate + the Internet with an impairment generator. + + o Use a single test lab to compare the implementations and measure + across the Internet. + + o Use remotely separated test labs to compare the implementations + and emulate the Internet with two "identically" configured + impairment generators. + + o Use remotely separated test labs to compare the implementations + and measure across the Internet. + + o Use remotely separated test labs to compare the implementations + and measure across the Internet and include a single impairment + generator to impact all measurement flows in non discriminatory + way. + + The first two approaches work, but cause higher expenses than the + other ones (due to travel and/or shipping+installation). For the + third option, ensuring two identically configured impairment + generators requires well defined test cases and possibly identical + hard- and software. >>>Comment: for some specific tests, impairment + generator accuracy requirements are less-demanding than others, and + in such cases there is more flexibility in impairment generator + configuration. <<< + + It is a fair question, whether the last two options can result in any + applicable test set up at all. While an experimental approach is + given in Appendix C, the tradeoff that measurement packets of + different sites pass the path segments but always in a different + order of segments probably can't be avoided. + + The question of which option above results in identical networking + conditions and is broadly accepted can't be answered without more + practical experience in comparing implementations. The last proposal + has the advantage that, while the measurement equipment is remotely + distributed, a single network impairment generator and the Internet + can be used in combination to impact all measurement flows. + 3.1. Tests of an individual implementation against a metric specification A metric implementation MUST support the requirements classified as "MUST" and "REQUIRED" of the related metric specification to be compliant to the latter. Further, supported options of a metric implementation SHOULD be documented in sufficient detail. The documentation of chosen options is RECOMMENDED to minimise (and recognise) differences in the test @@ -539,27 +601,33 @@ length of 4 Bytes. By the time of writing, between 1 and 4 Labels seems to be a fair guess of what's expectable. The applicability of one or more of the following tunneling protocols may be investigated by interested parties if Ethernet over L2TPv3 is felt to be not suitable: IP in IP [RFC2003] or Generic Routing Encapsulation (GRE) [RFC2784]. RFC 4928 [RFC4928] proposes measures how to avoid ECMP treatment in MPLS networks. L2TP is a commodity tunneling protocol [RFC2661]. By the time of - writing, L2TPv3 [RFC3931]is the latest version of L2TP. + writing, L2TPv3 [RFC3931]is the latest version of L2TP. If L2TPv3 is + applied, software based implementations of this protocol are not + suitable for the test set up, as such implementations may cause + uncalculable delay shifts. Ethernet Pseudo Wires may also be set up on MPLS networks [RFC4448]. While there's no technical issue with this solution, MPLS interfaces are mostly found in the network provider domain. Hence not all of the above tunneling criteria are met. + Appendix C provides an experimental tunneling set up for metric + implementation testing between two (or more) remote sites. + Each test is repeated several times. WAN conditions may change over time. Sequential testing is desirable, but may not be a useful metric test option. It is RECOMMENDED that tests be carried out by establishing N different parallel measurement flows. Two or three linecards per implementation serving to send or receive measurement flows should be sufficient to create 5 or more parallel measurement flows. If three linecards are used, each card sends and receives 2 flows. Other options are to separate flows by DiffServ marks (without deploying any QoS in the inner or outer tunnel) or using a single CBR flow and evaluating every n-th singleton to belong to a @@ -688,21 +756,22 @@ In order to meet their obligations under the IETF Standards Process the IESG must be convinced that each metric specification advanced to Draft Standard or Internet Standard status is clearly written, that there are the required multiple verifiably equivalent implementations, and that all options have been implemented. In the context of this document, metrics are designed to measure some characteristic of a data network. An aim of any metric definition should be that it should be specified in a way that can reliably - measure the specific characteristic in a repeatable way. + measure the specific characteristic in a repeatable way across + multiple independent implementations. Each metric, statistic or option of those to be validated MUST be compared against a reference measurement or another implementation by at least 5 different basic data sets, each one with sufficient size to reach the specified level of confidence, as specified by this document. Finally, the metric definitions, embodied in the text of the RFCs, are the objects that require evaluation and possible revision in order to advance to the next step on the standards track. @@ -715,49 +784,49 @@ THEN the details of each implementation should be audited along with the exact definition text, to determine if there is a lack of clarity that has caused the implementations to vary in a way that affects the correspondence of the results. IF there was a lack of clarity or multiple legitimate interpretations of the definition text, THEN the text should be modified and the resulting memo proposed for - consensus and advancement along the standards track. + consensus and (possible) advancement along the standards track. Finally, all the findings MUST be documented in a report that can support advancement on the standards track, similar to those described in [RFC5657]. The list of measurement devices used in testing satisfies the implementation requirement, while the test results provide information on the quality of each specification in the metric RFC (the surrogate for feature interoperability). The complete process of advancing a metric specification to a standard as defined by this document is illustrated in Figure 3. ,---. / \ ( Start ) \ / Implementations `-+-' +-------+ | /| 1 `. +---+----+ / +-------+ `.-----------+ ,-------. | RFC | / |Check for | ,' was RFC `. YES - | | / |Equivalence.... clause x --------+ + | | / |Equivalence.... clause x ------+ | |/ +-------+ |under | `. clear? ,' | - | Metric \.....| 2 ....relevant | `---+---' +----+---+ + | Metric \.....| 2 ....relevant | `---+---' +----+-----+ | Metric |\ +-------+ |identical | No | |Report | | Metric | \ |network | +--+----+ |results+| | ... | \ |conditions | |Modify | |Advance | - | | \ +-------+ | | |Spec +----+RFC | - +--------+ \| n |.'+-----------+ +-------+ |request | - +-------+ +--------+ + | | \ +-------+ | | |Spec +--+RFC | + +--------+ \| n |.'+-----------+ +-------+ |request(?)| + +-------+ +----------+ Illustration of the metric standardisation process Figure 3 Any recommendation for the advancement of a metric specification MUST be accompanied by an implementation report, as is the case with all requests for the advancement of IETF specifications. The implementation report needs to include the tests performed, the applied test setup, the specific metrics in the RFC and reports of @@ -833,52 +902,111 @@ o Different IP options. o Different DSCP. o If the N measurements are captured using sequential measurements instead of simultaneous ones, then the following factors come into play: Time varying paths and load conditions. 3.6. Miscellaneous - In the case that a metric validation requires capturing rare events, - an impairment generator may have to be added to the test set up. + A minimum amount of singletons per metric is required if results are + to be compared. To avoid accidental singletons from impacting a + metric comparison, a minimum number of 5 singletons per compared + interval was proposed above. Commercial Internet service is not + operated to reliably create enough rare events of singletons to + characterize bad measurement engineering or bad implementations. In + the case that a metric validation requires capturing rare events, an + impairment generator may have to be added to the test set up. Inclusion of an impairment generator and the parameterisation of the - impairments generated MUST be documented. Rare events could be - packet duplications, packet loss rates above one digit percentages, - loss patterns or packet re-ordering and so on. + impairments generated MUST be documented. + + A metric characterising a common impairment condition would be one, + which by expectation creates a singleton result for each measured + packet. Delay or Delay Variation are examples of this type, and in + such cases, the Internet may be used to compare metric + implementations. + + Rare events are those, where by expectation no or a rather low number + of "event is present" singletons are captured during a measurement + interval. Packet duplications, packet loss rates above one digit + percentages, loss patterns and packet reordering are examples. Note + especially that a packet reordering or loss pattern metric + implementation comparison may require a more sophisticated test set + up than described here. Spatial and temporal effects combine in the + case of packet re-ordering and measurements with different packet + rates may always lead to different results. As specified above, 5 singletons are the recommended basis to minimise interference of random events with the statistical test proposed by this document. In the case of ratio measurements (like packet loss), the underlying sum of basic events, against the which the metric's monitored singletons are "rated", determines the resolution of the test. A packet loss statistic with a resolution of 1% requires one packet loss statistic-datapoint to consist of 500 delay singletons (of which at least 5 were lost). To compare EDFs on packet loss requires one hundred such statistics per flow. That means, all in all at least 50 000 delay singletons are required per single measurement flow. Live network packet loss is assumed to be present during main traffic hours only. Let this interval be 5 hours. The required minimum rate of a single measurement flow in that case is 2.8 packets/sec (assuming a loss of 1% during 5 hours). If this measurement is too demanding under live network conditions, an impairment generator should be used. +3.7. Proposal to determine an "equivalence" threshold for each metric + evaluated + + This section describes a proposal for maximum error of "equivalence", + based on performance comparison of identical implementations. This + comparison may be useful for both ADK and non-ADK comparisons. + + Each metric tested by two or more implementations (cross- + implementation testing). + + Each metric is also tested twice simultaneously by the *same* + implementation, using different Src/Dst Address pairs and other + differences such that the connectivity differences of the cross- + implementation tests are also experienced and measured by the same + implementation. + + Comparative results for the same implementation represent a bound on + cross-implementation equivalence. This should be particularly useful + when the metric does *not* produces a continuous distribution of + singleton values, such as with a loss metric, or a duplication + metric. Appendix A indicates how the ADK will work for 0ne-way + delay, and should be likewise applicable to distributions of delay + variation. + + Proposal: the implementation with the largest difference in + homogeneous comparison results is the lower bound on the equivalence + threshold, noting that there may be other systematic errors to + account for when comparing between implementations. + + Thus, when evaluationg equivalence in cross-implementation results: + + Maximum_Error = Same_Implementation_Error + Systematic_Error + + and only the systematic error need be decided beforehand. + + In the case of ADK comparison, the largest same-implementation + resolution of distribution equivalence can be used as a limit on + cross-implementation resolutions (at the same confidence level). + 4. Acknowledgements Gerhard Hasslinger commented a first version of this document, suggested statistical tests and the evaluation of time series - information. Henk Uijterwaal pushed this work and Mike Hamilton, - Scott Bradner and Emile Stephan commented on versions of this draft - before initial publication. Carol Davids reviewed the 01 version of - this draft. + information. Henk Uijterwaal and Lars Eggert have encouraged and + helped to orgainize this work. Mike Hamilton, Scott Bradner, David + Mcdysan and Emile Stephan commented on this draft. Carol Davids + reviewed the 01 version of the ID before it was promoted to WG draft. 5. Contributors Scott Bradner, Vern Paxson and Allison Mankin drafted bradner- metrictest [bradner-metrictest], and major parts of it are included in this document. 6. IANA Considerations This memo includes no request to IANA. @@ -1132,21 +1259,27 @@ as shown in the first two columns of table 1 clearly fails an ADK test with 95% confidence. The results of Implemnt_2 are now reduced by difference of the averages of column 2 (rounded to 6581 us) and column 1 (rounded to 5029 us), which is 1552 us. The result may be found in column 3 of table 1. Comparing column 1 and column 3 of the table by an ADK test shows, that the data contained in these columns passes an ADK tests with 95% confidence. + >>> Comment: Extensive averaging was used in this example, because of + the vastly different sampling frequencies. As a result, the + distributions compared do not exactly align with a metric in + [RFC2679], but illustrate the ADK process adequately. + Appendix B. Anderson-Darling 2 sample C++ code + /* Routines for computing the Anderson-Darling 2 sample * test statistic. * * Implemented based on the description in * "Anderson-Darling K Sample Test" Heckert, Alan and * Filliben, James, editors, Dataplot Reference Manual, * Chapter 15 Auxiliary, NIST, 2004. * Official Reference by 2010 * Heckert, N. A. (2001). Dataplot website at the * National Institute of Standards and Technology: @@ -1534,21 +1668,64 @@ * n_total * (k - 1)) * (sum_adk_samp1 / n_sample1 + sum_adk_samp2 / n_sample2); /* if(adk_result <= adk_criterium) * adk_2_sample test is passed */ Figure 4 -Appendix C. Glossary +Appendix C. A tunneling set up for remote metric implementation testing + + Parties interested in testing metric compliance is most convenient if + all involved parties can stay in their local test laboratories. + Figure 4 shows a test configuration which may enable remote metric + compliance testing. + + +----+ +----+ +----+ +----+ + |LC10| |LC11| ,---. |LC20| |LC21| + +----+ +----+ / \ +-------+ +----+ +----+ + | V10 | V11 / \ | Tunnel| | V20 | V21 + | | ( ) | Head | | | + +--------+ +------+ | | | Router|__+----------+ + |Ethernet| |Tunnel| |Internet | +---B---+ |Ethernet | + |Switch |--|Head |-| | | |Switch | + +-+--+---+ |Router| | | +---+---+ +--+--+----+ + |__| +--A---+ ( )--|Option.| |__| + \ / |Impair.| + Bridge \ / |Gener. | Bridge + V20 to V21 `-+-? +-------+ V10 to V11 + + Figure 5 + + LC10 identify measurement clients /line cards. V10 and the others + denote VLANs. All VLANs are using the same tunnel from A to B and in + the reverse direction. The remote site VLANs are U-bridged at the + local site Ethernet switch. The measurement packets of site 1 travel + tunnel A->B first, are U-bridged at site 2 and travel tunnel B->A + second. Measurement packets of site 2 travel tunnel B->A first, are + U-bridged at site 1 and travel tunnel A->B second. So all + measurement packets pass the same tunnel segments, but in different + segment order. An experiment to prove or reject the above test set + up shown in figure 4 has been agreed but not yet scheduled between + Deutsche Telekom and RIPE. + + Figure 4 includes an optional impairment generator. If this + impairment generator is inserted in the IP path between the tunnel + head end routers, it equally impacts all measurement packets and + flows. Thus trouble with ensuring identical test set up by + configuring two separated impairment generators identically is + avoided (which was another proposal allowing remote metric compliance + testing). + +Appendix D. Glossary +-------------+-----------------------------------------------------+ | ADK | Anderson-Darling K-Sample test, a test used to | | | check whether two samples have the same statistical | | | distribution. | | ECMP | Equal Cost Multipath, a load balancing mechanism | | | evaluating MPLS labels stacks, IP addresses and | | | ports. | | EDF | The "Empirical Distribution Function" of a set of | | | scalar measurements is a function F(x) which for |