--- 1/draft-ietf-ippm-metrictest-02.txt 2011-06-29 21:16:04.000000000 +0200 +++ 2/draft-ietf-ippm-metrictest-03.txt 2011-06-29 21:16:04.000000000 +0200 @@ -1,23 +1,23 @@ Internet Engineering Task Force R. Geib, Ed. Internet-Draft Deutsche Telekom Intended status: Standards Track A. Morton -Expires: September 15, 2011 AT&T Labs +Expires: December 31, 2011 AT&T Labs R. Fardid Cariden Technologies A. Steinmitz - HS Fulda - March 14, 2011 + Deutsche Telekom + June 29, 2011 IPPM standard advancement testing - draft-ietf-ippm-metrictest-02 + draft-ietf-ippm-metrictest-03 Abstract This document specifies tests to determine if multiple independent instantiations of a performance metric RFC have implemented the specifications in the same way. This is the performance metric equivalent of interoperability, required to advance RFCs along the standards track. Results from different implementations of metric RFCs will be collected under the same underlying network conditions and compared using state of the art statistical methods. The goal is @@ -33,68 +33,65 @@ Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on September 15, 2011. + This Internet-Draft will expire on December 31, 2011. Copyright Notice Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 - 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 6 - 2. Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . 6 + 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 7 + 2. Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3. Verification of conformance to a metric specification . . . . 8 3.1. Tests of an individual implementation against a metric specification . . . . . . . . . . . . . . . . . . . . . . 9 3.2. Test setup resulting in identical live network testing conditions . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3. Tests of two or more different implementations against - a metric specification . . . . . . . . . . . . . . . . . . 15 - 3.4. Clock synchronisation . . . . . . . . . . . . . . . . . . 16 - 3.5. Recommended Metric Verification Measurement Process . . . 17 - 3.6. Miscellaneous . . . . . . . . . . . . . . . . . . . . . . 20 - 3.7. Proposal to determine an "equivalence" threshold for + a metric specification . . . . . . . . . . . . . . . . . . 16 + 3.4. Clock synchronisation . . . . . . . . . . . . . . . . . . 17 + 3.5. Recommended Metric Verification Measurement Process . . . 18 + 3.6. Proposal to determine an "equivalence" threshold for each metric evaluated . . . . . . . . . . . . . . . . . . 21 4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 22 5. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 22 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 - 7. Security Considerations . . . . . . . . . . . . . . . . . . . 22 + 7. Security Considerations . . . . . . . . . . . . . . . . . . . 23 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 23 8.1. Normative References . . . . . . . . . . . . . . . . . . . 23 8.2. Informative References . . . . . . . . . . . . . . . . . . 24 Appendix A. An example on a One-way Delay metric validation . . . 25 A.1. Compliance to Metric specification requirements . . . . . 25 - A.2. Examples related to statistical tests for One-way Delay . 26 - Appendix B. Anderson-Darling 2 sample C++ code . . . . . . . . . 28 - Appendix C. A tunneling set up for remote metric - implementation testing . . . . . . . . . . . . . . . 36 - Appendix D. Glossary . . . . . . . . . . . . . . . . . . . . . . 38 + A.2. Examples related to statistical tests for One-way Delay . 27 + Appendix B. Anderson-Darling 2 sample C++ code . . . . . . . . . 29 + Appendix C. Glossary . . . . . . . . . . . . . . . . . . . . . . 37 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 38 1. Introduction The Internet Standards Process RFC2026 [RFC2026] requires that for a IETF specification to advance beyond the Proposed Standard level, at least two genetically unrelated implementations must be shown to interoperate correctly with all features and options. This requirement can be met by supplying: @@ -189,20 +186,28 @@ The metric RFC advancement process begins with a request for protocol action accompanied by a memo that documents the supporting tests and results. The procedures of [RFC2026] are expanded in[RFC5657], including sample implementation and interoperability reports. Section 3 of [morton-advance-metrics-01] can serve as a template for a metric RFC report which accompanies the protocol action request to the Area Director, including description of the test set-up, procedures, results for each implementation and conclusions. + Changes from WG-02 to WG-03: + + o Changes stemming from experiments that implemented this plan, in + general. + + o Adoption of the VLAN loopback figure in the main body of the memo + (section 3.2). + Changes from WG-01 to WG-02: o Clarification of the number of test streams recommended in section 3.2. o Clarifications on testing details in sections 3.3 and 3.4. o Spelling corrections throughout. Changes from WG -00 to WG -01 draft @@ -320,36 +325,35 @@ requires careful test design: o The measurement test setup must be self-consistent to the largest possible extent. To minimize the influence of the test and measurement setup on the result, network conditions and paths MUST be identical for the compared implementations to the largest possible degree. This includes both the stability and non- ambiguity of routes taken by the measurement packets. See RFC 2330 for a discussion on self-consistency. + o To minimize the influence of implementation options on the result, + metric implementations SHOULD use identical options and parameters + for the metric under evaluation. + o The error induced by the sample size must be small enough to minimize its influence on the test result. This may have to be respected, especially if two implementations measure with different average probing rates. - o Every comparison must be repeated several times based on different - measurement data to avoid random indications of compatibility (or - the lack of it). - - o To minimize the influence of implementation options on the result, - metric implementations SHOULD use identical options and parameters - for the metric under evaluation. - o The implementation with the lowest probing frequency determines the smallest temporal interval for which samples can be compared. + o Repeat comparisons with several independent metric samples to + avoid random indications of compatibility (or the lack of it). + The metric specifications themselves are the primary focus of evaluation, rather than the implementations of metrics. The documentation produced by the advancement process should identify which metric definitions and supporting material were found to be clearly worded and unambiguous, OR, it should identify ways in which the metric specification text should be revised to achieve clarity and unified interpretation. The process should also permit identification of options that were not implemented, so that they can be removed from the advancing @@ -390,37 +394,28 @@ o Use remotely separated test labs to compare the implementations and measure across the Internet and include a single impairment generator to impact all measurement flows in non discriminatory way. The first two approaches work, but cause higher expenses than the other ones (due to travel and/or shipping+installation). For the third option, ensuring two identically configured impairment generators requires well defined test cases and possibly identical - hard- and software. >>>Comment: for some specific tests, impairment - generator accuracy requirements are less-demanding than others, and - in such cases there is more flexibility in impairment generator - configuration. <<< - - It is a fair question, whether the last two options can result in any - applicable test set up at all. While an experimental approach is - given in Appendix C, the trade off that measurement packets of - different sites pass the path segments but always in a different - order of segments probably can't be avoided. + hard- and software. - The question of which option above results in identical networking - conditions and is broadly accepted can't be answered without more - practical experience in comparing implementations. The last proposal - has the advantage that, while the measurement equipment is remotely - distributed, a single network impairment generator and the Internet - can be used in combination to impact all measurement flows. + As documented in a test report [morton-testplan-rfc2679], the last + option was required to prove compatibility of two delay metric + implementations. An impairment generator is probably required when + testing compatibility of most other metrics and it therefore + RECOMMENDED to include an impairment generator in metric test set + ups. 3.1. Tests of an individual implementation against a metric specification A metric implementation MUST support the requirements classified as "MUST" and "REQUIRED" of the related metric specification to be compliant to the latter. Further, supported options of a metric implementation SHOULD be documented in sufficient detail. The documentation of chosen options @@ -455,34 +450,44 @@ implementations. A single IPPM conformant implementation MUST under otherwise identical network conditions produce precise results for repeated measurements of the same metric. RFC 2330 prefers the "empirical distribution function" EDF to describe collections of measurements. RFC 2330 determines, that "unless otherwise stated, IPPM goodness-of-fit tests are done using 5% significance." The goodness of fit test determines by which precision two or more samples of a metric implementation belong to the same underlying distribution (of measured network performance - events). The goodness of fit test to be applied is the Anderson- - Darling K sample test (ADK sample test, K stands for the number of - samples to be compared) [ADK]. Please note that RFC 2330 and RFC - 2679 apply an Anderson Darling goodness of fit test too. + events). The goodness of fit test suggested for the metric test is + the Anderson-Darling K sample test (ADK sample test, K stands for the + number of samples to be compared) [ADK]. Please note that RFC 2330 + and RFC 2679 apply an Anderson Darling goodness of fit test too. The results of a repeated test with a single implementation MUST pass - an ADK sample test with confidence level of 95%. The resolution for + an ADK sample test with confidence level of 95%. The conditions for which the ADK test has been passed with the specified confidence level MUST be documented. To formulate this differently: The - requirement is to document the smallest resolution, at which the - results of the tested metric implementation pass an ADK test with a - confidence level of 95%. The minimum resolution available in the - reported results from each implementation MUST be taken into account - in the ADK test. + requirement is to document the set of parameters with the smallest + deviation, at which the results of the tested metric implementation + pass an ADK test with a confidence level of 95%. The minimum + resolution available in the reported results from each implementation + MUST be taken into account in the ADK test. + + The test conditions which MUST be documented for a passed metric test + include: + + o The metric resolution at which a test was passed (e.g. the + resolution of timestamps) + + o The parameters modified by an impairment generator. + + o The impairment generator parameter settings. 3.2. Test setup resulting in identical live network testing conditions Two major issues complicate tests for metric compliance across live networks under identical testing conditions. One is the general point that metric definition implementations cannot be conveniently examined in field measurement scenarios. The other one is more broadly described as "parallelism in devices and networks", including mechanisms like those that achieve load balancing (see [RFC4928]). @@ -521,87 +526,117 @@ o A low operational overhead may enable a broader audience to set up a metric test with the desired properties. o The tunneling protocol should be reliable and stable in set up and operation to avoid disturbances or influence on the test results. o The tunneling protocol should not incur any extra cost for those interested in setting up a metric test. - An illustration of a test setup with two tunnels and two flows - between two linecards of one implementation is given in Figure 1. + An illustration of a test setup with two layer 2 tunnels and two + flows between two linecards of one implementation is given in + Figure 1. Implementation ,---. +--------+ +~~~~~~~~~~~/ \~~~~~~| Remote | +------->-----F2->-| / \ |->---+ | | +---------+ | Tunnel 1( ) | | | | | transmit|-F1->-| ( ) |->+ | | | | LC1 | +~~~~~~~~~| |~~~~| | | | | | receive |-<--+ ( ) | F1 F2 | | +---------+ | |Internet | | | | | *-------<-----+ F2 | | | | | | +---------+ | | +~~~~~~~~~| |~~~~| | | | | transmit|-* *-| | | |--+<-* | | LC2 | | Tunnel 2( ) | | | | receive |-<-F1-| \ / |<-* | +---------+ +~~~~~~~~~~~\ /~~~~~~| Router | `-+-' +--------+ - Illustration of a test setup with two tunnels. For simplicity, only - two linecards of one implementation and two flows F between them are - shown. + Illustration of a test setup with two layer 2 tunnels. For + simplicity, only two linecards of one implementation and two flows F + between them are shown. Figure 1 - Figure 2 shows the network elements required to set up GRE tunnels or - as shown by figure 1. + Figure 2 shows the network elements required to set up layer 2 + tunnels as shown by figure 1. Implementation +-----+ ,---. | LC1 | / \ +-----+ / \ +------+ | +-------+ ( ) +-------+ |Remote| +--------+ | | | | | | | | |Ethernet| | Tunnel| |Internet | | Tunnel| | | |Switch |--| Head |--| |--| Head |--| | +--------+ | Router| | | | Router| | | | | | ( ) | | |Router| +-----+ +-------+ \ / +-------+ +------+ | LC2 | \ / +-----+ `-+-' Illustration of a hardware setup to realise the test setup - illustrated by figure 1 with GRE tunnels or Pseudowires. + illustrated by figure 1 with layer 2 tunnels or Pseudowires. Figure 2 + The test set up successfully used during a delay metric test + [morton-testplan-rfc2679] is given as an example in figure 3. Note + that the shown set up allows a metric test between two remote sites. + + +----+ +----+ +----+ +----+ + |LC10| |LC11| ,---. |LC20| |LC21| + +----+ +----+ / \ +-------+ +----+ +----+ + | V10 | V11 / \ | Tunnel| | V20 | V21 + | | ( ) | Head | | | + +--------+ +------+ | | | Router|__+----------+ + |Ethernet| |Tunnel| |Internet | +---B---+ |Ethernet | + |Switch |--|Head |-| | | |Switch | + +-+--+---+ |Router| | | +---+---+ +--+--+----+ + |__| +--A---+ ( )--|Option.| |__| + \ / |Impair.| + Bridge \ / |Gener. | Bridge + V20 to V21 `-+-? +-------+ V10 to V11 + + Figure 3 + + In figure 3, LC10 identify measurement clients /line cards. V10 and + the others denote VLANs. All VLANs are using the same tunnel from A + to B and in the reverse direction. The remote site VLANs are + U-bridged at the local site Ethernet switch. The measurement packets + of site 1 travel tunnel A->B first, are U-bridged at site 2 and + travel tunnel B->A second. Measurement packets of site 2 travel + tunnel B->A first, are U-bridged at site 1 and travel tunnel A->B + second. So all measurement packets pass the same tunnel segments, + but in different segment order. + If tunneling is applied, two tunnels MUST carry all test traffic in between the test site and the remote site. For example, if 802.1Q Ethernet Virtual LANs (VLAN) are applied and the measurement streams are carried in different VLANs, the IP tunnel or Pseudo Wires respectively MUST be set up in physical port mode to avoid set up of Pseudo Wires per VLAN (which may see different paths due to ECMP routing), see RFC 4448. The remote router and the Ethernet switch - shown in figure 2 must support 802.1Q in this set up. + shown in figure 3 has to support 802.1Q in this set up. The IP packet size of the metric implementation SHOULD be chosen small enough to avoid fragmentation due to the added Ethernet and tunnel headers. Otherwise, the impact of tunnel overhead on fragmentation and interface MTU size MUST be understood and taken into account (see [RFC4459]). An Ethernet port mode IP tunnel carrying several 802.1Q VLANs each - containing measurement traffic of a single measurement system was set - up as a proof of concept using RFC4719 [RFC4719], Transport of - Ethernet Frames over L2TPv3. Ethernet over L2TPv3 seems to fulfill - most of the desired tunneling protocol criteria mentioned above. + containing measurement traffic of a single measurement system was + successfully applied when testing compatibility of two metric + implementations [morton-testplan-rfc2679]. The following headers may have to be accounted for when calculating total packet length, if VLANs and Ethernet over L2TPv3 tunnels are applied: o Ethernet 802.1Q: 22 Byte. o L2TPv3 Header: 4-16 Byte for L2TPv3 data messages over IP; 16-28 Byte for L2TPv3 data messages over UDP. @@ -619,93 +654,110 @@ L2TP is a commodity tunneling protocol [RFC2661]. By the time of writing, L2TPv3 [RFC3931]is the latest version of L2TP. If L2TPv3 is applied, software based implementations of this protocol are not suitable for the test set up, as such implementations may cause incalculable delay shifts. Ethernet Pseudo Wires may also be set up on MPLS networks [RFC4448]. While there's no technical issue with this solution, MPLS interfaces are mostly found in the network provider domain. Hence not all of - the above tunneling criteria are met. + the above criteria to select a tunneling protocol are met. - Appendix C provides an experimental tunneling set up for metric - implementation testing between two (or more) remote sites. + Note that setting up a metric test environment isn't a plug and play + issue. Skilled networking engineers should be consulted and + involved, if a set up between remote sites is preferred. - Each test SHOULD be conducted multiple times. Sequential testing is - possible, but may not be a useful metric test option because WAN - conditions are likely to change over time. It is RECOMMENDED that - tests be carried out by establishing at least 2 different parallel - measurement flows. Two linecards per implementation that send and - receive measurement flows should be sufficient to create 4 parallel - measurement flows (when each card sends and receives 2 flows). Other + Passing or failing an ADK test with 2 samples could be a random + result (note that [RFC2330] defines a sample as a set of singleton + metric values produced by a measurement stream, and we continue to + use this terminology here). The error margin of a statistical test + is higher if the number of samples it is based on is low (the number + of samples taken influences the so called "degree of freedom" of a + statistical test and a higher degree of freedom produces more + reliable results). To pass ADK with higher probability, the number + of samples collected per implementation under identical networking + conditions SHOULD be greater than 2. Hardware and load constraints + may enforce an upper limit on the number of simultaneous measurement + streams. The ADK test allows one to combine different samples (see + section 9 [ADK]) and then to run a two sample test between combined + samples. At least 4 samples per implementation captured under + identical networking conditions is RECOMMENDED when comparing + different metric implementations by a statistical test. + + It is RECOMMENDED that tests be carried out by establishing N + different parallel measurement flows. Two or three linecards per + implementation serving to send or receive measurement flows should be + sufficient to create 4 or more parallel measurement flows. Other options are to separate flows by DiffServ marks (without deploying any QoS in the inner or outer tunnel) or using a single CBR flow and evaluating every n-th singleton to belong to a specific measurement - flow. + flow. Note that a practical test indeed showed that ADK was passed + with 4 samples even if a 2 sample test + failed[morton-testplan-rfc2679]. - Some additional rules to calculate and compare samples have to be - respected to perform a metric test: + Some additional guidelines to calculate and compare samples to + perform a metric test are: o To compare different probes of a common underlying distribution in terms of metrics characterising a communication network requires to respect the temporal nature for which the assumption of common underlying distribution may hold. Any singletons or samples to be compared MUST be captured within the same time interval. - o Whenever statistical events like singletons or rates are used to - characterise measured metrics of a time-interval, at least 5 - singletons of a relevant metric SHOULD be present to ensure a - minimum confidence into the reported value (see Wikipedia on - confidence [Rule of thumb]). Note that this criterion also is to - be respected e.g. when comparing packet loss metrics. Any packet - loss measurement interval to be compared with the results of - another implementation SHOULD contain at least five lost packets - to have a minimum confidence that the observed loss rate wasn't - caused by a small number of random packet drops. + o If statistical events like rates are used to characterise measured + metrics of a time-interval, its RECOMMENDED to pick as a minimum 5 + singletons of a relevant metric to ensure a minimum confidence + into the reported value. The error margin of the determined rate + depends on the number singletons (refer to statistical textbooks + on Student's t-test). As an example, any packet loss measurement + interval to be compared with the results of another implementation + contains at least five lost packets to have some confidence that + the observed loss rate wasn't caused by a small number of random + packet drops. o The minimum number of singletons or samples to be compared by an Anderson-Darling test SHOULD be 100 per tested metric implementation. Note that the Anderson-Darling test detects small differences in distributions fairly well and will fail for high number of compared results (RFC2330 mentions an example with 8192 measurements where an Anderson-Darling test always failed). o Generally, the Anderson-Darling test is sensitive to differences in the accuracy or bias associated with varying implementations or test conditions. These dissimilarities may result in differing averages of samples to be compared. An example may be different packet sizes, resulting in a constant delay difference between - compared samples. Therefore samples to be compared by an - Anderson-Darling test MAY be calibrated by the difference of the - average values of the samples. Any calibration of this kind MUST - be documented in the test result. + compared samples. Therefore samples to be compared by an Anderson + Darling test MAY be calibrated by the difference of the average + values of the samples. Any calibration of this kind MUST be + documented in the test result. 3.3. Tests of two or more different implementations against a metric specification RFC2330 expects "a methodology for a given metric [to] exhibit continuity if, for small variations in conditions, it results in small variations in the resulting measurements. Slightly more precisely, for every positive epsilon, there exists a positive delta, such that if two sets of conditions are within delta of each other, then the resulting measurements will be within epsilon of each other." A small variation in conditions in the context of the metric test proposed here can be seen as different implementations measuring the same metric along the same path. IPPM metric specifications however allow for implementor options to the largest possible degree. It can not be expected that two - implementors pick identical value ranges in options for the - implementations. Implementors SHOULD to the highest degree possible - pick the same configurations for their systems when comparing their - implementations by a metric test. + implementors allow 100% identical options in their implementations. + Testers SHOULD to the highest degree possible pick the same + configurations for their systems when comparing their implementations + by a metric test. In some cases, a goodness of fit test may not be possible or show disappointing results. To clarify the difficulties arising from different implementation options, the individual options picked for every compared implementation SHOULD be documented in sufficient detail. Based on this documentation, the underlying metric specification should be improved before it is promoted to a standard. The same statistical test as applicable to quantify precision of a single metric implementation MUST be used to compare metric result @@ -749,50 +801,48 @@ Examination of the second condition requires RTT measurement for reference, e.g., based on TWAMP (RFC5357, RFC 5357 [RFC5357]), in conjunction with one-way delay measurement. Specification of X% to strike a balance between identification of unreliable one-way delay samples and misidentification of reliable samples under a wide range of Internet path RTTs probably requires further study. - An implementation of an RFC that requires synchronized clocks is - expected to provide precise measurement results in order to claim - that the metric measured is compliant. + An IPPM compliant metric implementation of an RFC that requires + synchronized clocks is expected to provide precise measurement + results. IF an implementation publishes a specification of its precision, such as "a precision of 1 ms (+/- 500 us) with a confidence of 95%", then the specification SHOULD be met over a useful measurement duration. For example, if the metric is measured along an Internet path which is stable and not congested, then the precision specification SHOULD be met over durations of an hour or more. 3.5. Recommended Metric Verification Measurement Process In order to meet their obligations under the IETF Standards Process the IESG must be convinced that each metric specification advanced to Draft Standard or Internet Standard status is clearly written, that - there are the a sufficient number of verified equivalent - implementations, and that all options have been implemented. + there are a sufficient number of verified equivalent implementations, + and that options that have been implemented are documented. In the context of this document, metrics are designed to measure some characteristic of a data network. An aim of any metric definition should be that it should be specified in a way that can reliably measure the specific characteristic in a repeatable way across multiple independent implementations. Each metric, statistic or option of those to be validated MUST be compared against a reference measurement or another implementation by - at least 5 different basic data sets, each one with sufficient size - to reach the specified level of confidence, as specified by this - document. + as specified by this document. Finally, the metric definitions, embodied in the text of the RFCs, are the objects that require evaluation and possible revision in order to advance to the next step on the standards track. IF two (or more) implementations do not measure an equivalent metric as specified by this document, AND sources of measurement error do not adequately explain the lack of agreement, @@ -808,21 +858,21 @@ consensus and (possible) advancement along the standards track. Finally, all the findings MUST be documented in a report that can support advancement on the standards track, similar to those described in [RFC5657]. The list of measurement devices used in testing satisfies the implementation requirement, while the test results provide information on the quality of each specification in the metric RFC (the surrogate for feature interoperability). The complete process of advancing a metric specification to a - standard as defined by this document is illustrated in Figure 3. + standard as defined by this document is illustrated in Figure 4. ,---. / \ ( Start ) \ / Implementations `-+-' +-------+ | /| 1 `. +---+----+ / +-------+ `.-----------+ ,-------. | RFC | / |Check for | ,' was RFC `. YES | | / |Equivalence.... clause x ------+ @@ -830,21 +880,21 @@ | Metric \.....| 2 ....relevant | `---+---' +----+-----+ | Metric |\ +-------+ |identical | No | |Report | | Metric | \ |network | +--+----+ |results + | | ... | \ |conditions | |Modify | |Advance | | | \ +-------+ | | |Spec +--+RFC | +--------+ \| n |.'+-----------+ +-------+ |request(?)| +-------+ +----------+ Illustration of the metric standardisation process - Figure 3 + Figure 4 Any recommendation for the advancement of a metric specification MUST be accompanied by an implementation report, as is the case with all requests for the advancement of IETF specifications. The implementation report needs to include the tests performed, the applied test setup, the specific metrics in the RFC and reports of the tests performed with two or more implementations. The test plan needs to specify the precision reached for each measured metric and thus define the meaning of "statistically equivalent" for the specific metrics being tested. @@ -914,67 +964,21 @@ by using a layer 2 tunnel. o Different IP options. o Different DSCP. o If the N measurements are captured using sequential measurements instead of simultaneous ones, then the following factors come into play: Time varying paths and load conditions. -3.6. Miscellaneous - - A minimum amount of singletons per metric is required if results are - to be compared. To avoid accidental singletons from impacting a - metric comparison, a minimum number of 5 singletons per compared - interval was proposed above. Commercial Internet service is not - operated to reliably create enough rare events of singletons to - characterize bad measurement engineering or bad implementations. In - the case that a metric validation requires capturing rare events, an - impairment generator may have to be added to the test set up. - Inclusion of an impairment generator and the parameterisation of the - impairments generated MUST be documented. - - A metric characterising a common impairment condition would be one, - which by expectation creates a singleton result for each measured - packet. Delay or Delay Variation are examples of this type, and in - such cases, the Internet may be used to compare metric - implementations. - - Rare events are those, where by expectation no or a rather low number - of "event is present" singletons are captured during a measurement - interval. Packet duplications, packet loss rates above one digit - percentages, loss patterns and packet reordering are examples. Note - especially that a packet reordering or loss pattern metric - implementation comparison may require a more sophisticated test set - up than described here. Spatial and temporal effects combine in the - case of packet re-ordering and measurements with different packet - rates may always lead to different results. - - As specified above, 5 singletons are the recommended basis to - minimise interference of random events with the statistical test - proposed by this document. In the case of ratio measurements (like - packet loss), the underlying sum of basic events, against the which - the metric's monitored singletons are "rated", determines the - resolution of the test. A packet loss statistic with a resolution of - 1% requires one packet loss statistic-data point to consist of 500 - delay singletons (of which at least 5 were lost). To compare EDFs on - packet loss requires one hundred such statistics per flow. That - means, all in all at least 50 000 delay singletons are required per - single measurement flow. Live network packet loss is assumed to be - present during main traffic hours only. Let this interval be 5 - hours. The required minimum rate of a single measurement flow in - that case is 2.8 packets/sec (assuming a loss of 1% during 5 hours). - If this measurement is too demanding under live network conditions, - an impairment generator should be used. - -3.7. Proposal to determine an "equivalence" threshold for each metric +3.6. Proposal to determine an "equivalence" threshold for each metric evaluated This section describes a proposal for maximum error of "equivalence", based on performance comparison of identical implementations. This comparison may be useful for both ADK and non-ADK comparisons. Each metric tested by two or more implementations (cross- implementation testing). Each metric is also tested twice simultaneously by the *same* @@ -1003,24 +1007,26 @@ and only the systematic error need be decided beforehand. In the case of ADK comparison, the largest same-implementation resolution of distribution equivalence can be used as a limit on cross-implementation resolutions (at the same confidence level). 4. Acknowledgements Gerhard Hasslinger commented a first version of this document, suggested statistical tests and the evaluation of time series - information. Henk Uijterwaal and Lars Eggert have encouraged and - helped to orgainize this work. Mike Hamilton, Scott Bradner, David - Mcdysan and Emile Stephan commented on this draft. Carol Davids - reviewed the 01 version of the ID before it was promoted to WG draft. + information. Matthias Wieser's thesis on a metric test resulted in + new input for this draft. Henk Uijterwaal and Lars Eggert have + encouraged and helped to orgainize this work. Mike Hamilton, Scott + Bradner, David Mcdysan and Emile Stephan commented on this draft. + Carol Davids reviewed the 01 version of the ID before it was promoted + to WG draft. 5. Contributors Scott Bradner, Vern Paxson and Allison Mankin drafted bradner- metrictest [bradner-metrictest], and major parts of it are included in this document. 6. IANA Considerations This memo includes no request to IANA. @@ -1118,20 +1124,26 @@ [morton-advance-metrics] Morton, A., "Problems and Possible Solutions for Advancing Metrics on the Standards Track", draft -morton-ippm- advance-metrics-00, (work in progress), July 2009. [morton-advance-metrics-01] Morton, A., "Lab Test Results for Advancing Metrics on the Standards Track", draft -morton-ippm-advance-metrics-01, (work in progress), June 2010. + [morton-testplan-rfc2679] + Ciavattone, L., Geib, R., Morton, A., and M. Wieser, "Test + Plan and Results for Advancing RFC 2679 on the Standards + Track", draft -morton-ippm-testplan-rfc2679-01, (work in + progress), June 2011. + Appendix A. An example on a One-way Delay metric validation The text of this appendix is not binding. It is an example how parts of a One-way Delay metric test could look like. http://xml.resource.org/public/rfc/bibxml/ A.1. Compliance to Metric specification requirements One-way Delay, Loss threshold, RFC 2679 @@ -1681,66 +1694,23 @@ adk_result = (double) (n_total - 1) / (n_total * n_total * (k - 1)) * (sum_adk_samp1 / n_sample1 + sum_adk_samp2 / n_sample2); /* if(adk_result <= adk_criterium) * adk_2_sample test is passed */ - Figure 4 - -Appendix C. A tunneling set up for remote metric implementation testing - - Parties interested in testing metric compliance is most convenient if - all involved parties can stay in their local test laboratories. - Figure 4 shows a test configuration which may enable remote metric - compliance testing. - - +----+ +----+ +----+ +----+ - |LC10| |LC11| ,---. |LC20| |LC21| - +----+ +----+ / \ +-------+ +----+ +----+ - | V10 | V11 / \ | Tunnel| | V20 | V21 - | | ( ) | Head | | | - +--------+ +------+ | | | Router|__+----------+ - |Ethernet| |Tunnel| |Internet | +---B---+ |Ethernet | - |Switch |--|Head |-| | | |Switch | - +-+--+---+ |Router| | | +---+---+ +--+--+----+ - |__| +--A---+ ( )--|Option.| |__| - \ / |Impair.| - Bridge \ / |Gener. | Bridge - V20 to V21 `-+-? +-------+ V10 to V11 - Figure 5 - LC10 identify measurement clients /line cards. V10 and the others - denote VLANs. All VLANs are using the same tunnel from A to B and in - the reverse direction. The remote site VLANs are U-bridged at the - local site Ethernet switch. The measurement packets of site 1 travel - tunnel A->B first, are U-bridged at site 2 and travel tunnel B->A - second. Measurement packets of site 2 travel tunnel B->A first, are - U-bridged at site 1 and travel tunnel A->B second. So all - measurement packets pass the same tunnel segments, but in different - segment order. An experiment to prove or reject the above test set - up shown in figure 4 has been agreed but not yet scheduled between - Deutsche Telekom and RIPE. - - Figure 4 includes an optional impairment generator. If this - impairment generator is inserted in the IP path between the tunnel - head end routers, it equally impacts all measurement packets and - flows. Thus trouble with ensuring identical test set up by - configuring two separated impairment generators identically is - avoided (which was another proposal allowing remote metric compliance - testing). - -Appendix D. Glossary +Appendix C. Glossary +-------------+-----------------------------------------------------+ | ADK | Anderson-Darling K-Sample test, a test used to | | | check whether two samples have the same statistical | | | distribution. | | ECMP | Equal Cost Multipath, a load balancing mechanism | | | evaluating MPLS labels stacks, IP addresses and | | | ports. | | EDF | The "Empirical Distribution Function" of a set of | | | scalar measurements is a function F(x) which for | @@ -1773,21 +1743,21 @@ Table 2 Authors' Addresses Ruediger Geib (editor) Deutsche Telekom Heinrich Hertz Str. 3-7 Darmstadt, 64295 Germany - Phone: +49 6151 628 2747 + Phone: +49 6151 58 12747 Email: Ruediger.Geib@telekom.de Al Morton AT&T Labs 200 Laurel Avenue South Middletown, NJ 07748 USA Phone: +1 732 420 1571 Fax: +1 732 368 1192 @@ -1797,17 +1766,17 @@ Reza Fardid Cariden Technologies 888 Villa Street, Suite 500 Mountain View, CA 94041 USA Phone: Email: rfardid@cariden.com Alexander Steinmitz - HS Fulda - Marquardstr. 35 - Fulda, 36039 + Deutsche Telekom + Memmelsdorfer Str. 209b + Bamberg, 96052 Germany Phone: - Email: steinionline@gmx.de + Email: Alexander.Steinmitz@telekom.de