INTERNET-DRAFT Expires May 1997 INTERNET-DRAFTNetwork Working Group Matt Mathis INTERNET-DRAFT Pittsburgh Supercomputing Center Expiration Date: MayJan 1998 July 1997 Nov 1996Empirical Bulk Transfer Capacity < draft-ietf-bmwg-ippm-treno-btc-00.txtdraft-ietf-bmwg-ippm-treno-btc-01.txt > Status of this Document This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- DraftsInternet-Drafts as reference material or to cite them other than as ``work"work in progress.''progress." To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt''"1id-abstracts.txt" listing contained in the Internet- DraftsInternet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Abstract: Bulk Transport Capacity (BTC) is a measure of a network's ability to transfer significant quantities of data with a single congestion-aware transport connection (e.g. state-of-the-art TCP). For many applications the BTC of the underlying network dominates the theoverall elapsed time for the application, and thus dominates the performance as perceived by a user. The BTC is a property of an IP cloud (links, routers, switches, etc)etc.) between a pair of hosts. It does not include the hosts themselves (or their transport-layer software). However, congestion control is crucial to the BTC metric because the Internet depends on the end systems to fairly divide the available bandwidth on the basis of common congestion behavior. The BTC metric is based on the performance of a reference congestion control algorithm that has particularly uniform and stable behavior. Introduction This Internet-draft is likely to become one section of some future, larger document covering several different metrics. Motivation:Introduction: Bulk Transport Capacity (BTC) is a measure of a network's ability to transfer significant quantities of data with a single congestion-aware transport connection (e.g. state-of-the-art TCP). For many applications the BTC of the underlying network dominates the theoverall elapsed time for the application, and thus dominates the performance as perceived by a user. Examples of such applications include ftpFTP and other network copy utilities. The BTC is a property of an IP cloud (links, routers, switches, etc)etc.) between a pair of hosts. It does not include the hosts themselves (or their transport-layer software). However, congestion control is crucial to the BTC metric because the Internet depends on the end systems to fairly divide the available bandwidth on the basis of common congestion behavior. The BTC metric is based onFour standard control congestion algorithms are described in RFC2001: Slow-start, Congestion Avoidance, Fast Retransmit and Fast Recovery. Of these algorithms, Congestion Avoidance drives the performancesteady-state bulk transfer behavior of a referenceTCP. It calls for opening the congestion control algorithm that has particularly uniformwindow by 1 segment size on each round trip time, and stable behavior. The reference algorithmclosing it by 1/2 on congestion, as signaled by lost segments. Slow-start is documented in appendix A, and can be implemented in TCP using the SACK option [RFC2018].part of TCP's transient behavior. It is similar in styleused to quickly bring new or recently timed out connections up to an appropriate congestion window. In Reno TCP, Fast Retransmit and behaviorFast Recovery are used to support the congestion controlCongestion Avoidance algorithm which have been in standard use [Jacoboson88, Stevens94, Stevens96] induring recovery from lost segments. During the Internet. Sincerecovery interval the behavior ofdata receiver sends duplicate acknowledgements, which the reference congestion control algorithm isdata sender must use to identify missing segments as well defined and implementation independent, it will be possible confirm that different measurements only reflect propertiesas to estimate the quantity of outstanding data in the networknetwork. The research community has observed unpredictable or unstable TCP performance caused by errors and notuncertainties in the end-systems. As such BTC will be a true network metric. [A strong definition of "network metric" belongs in the framework document: - truly indicativeestimation of what *could* be done withoutstanding data [Lakshman94, Floyd95, Hoe95]. Simulations of reference TCP or another good transport layer - sensitive to weaknessesimplementations have uncovered situations where incidental changes in the routers, switches, links, etc.other parts of the IP cloudnetwork have a large effect on performance [Mathis96]. Other simulations have shown that would also cause problems for production transport layers - *not* be sensitiveunder some conditions, slightly better networks (higher bandwidth or lower delay) yield lower throughput [This is easy to weaknessesconstruct, but has it been published?]. As a consequence, even reference TCP implementations do not make good metrics. Furthermore, many TCP implementations in common host hardware or software, such as current productionuse in the Internet today have outright bugs which can have arbitrary and unpredictable effects on performance [Comer94, Brakmo95, Paxson97a, Paxson97b]. The difficulties with using TCP implementations, thatfor measurement can be removedovercome by doing transport right onusing the hosts - complete as a methodologyCongestion Avoidance algorithm by itself, in isolation from other algorithms. In [Mathis97] it is shown that little/no additional deep knowledgethe performance of state-of-the-art measurement technology is needed Othersthe Congestion Avoidance algorithm can be predicted by a simple analytical model. The model was derived in [Ott96a, Ott96b]. The model predicts the performance of the Congestion Avoidance algorithm as a function of the round trip time, and the TCP segment size and the probability of receiving a congestion signal (i.e. packet loss). The paper shows that may comethe model accurately predicts the performance of TCP using the SACK option [RFC2018] under a wide range of conditions. If losses are isolated (no more than one per round trip) then Fast Recovery successfully estimates the actual congestion window during recovery, and Reno TCP also fits the model. This version of the BTC metric is based on the TReno ("tree-no") diagnostic, which implements a protocol-independent version of the Congestion Avoidance algorithm. TReno's internal protocol is designed to accurately implement the Congestion Avoidance algorithm under a very wide range of conditions, and to mind. - Guy Almes]diagnose timeouts when they interrupt Congestion Avoidance. In [Mathis97] it is observed that TReno fits the same performance model as SACK and Reno TCPs. [Although the paper was written using an older version of TReno, which has less accurate internal measurements.] Implementing standard congestion control algorithms withinthe Congestion Avoidance algorithm within a diagnostic tool eliminates calibration problems associated with the non-uniformity of current TCP implementations. However, like all empirical metrics it introduces new problems, most notably the need to certify the correctness of the implementation and to verify that there are not systematic errors due to limitations of the tester. This version of the metric is based on the tool TReno (pronounced tree-no), which implements the reference congestion control algorithm over either traceroute-style UDP and ICMP messages or ICMP ping packets.Many of the calibration checks can be included in the measurement process itself. The TReno program includes error and warning messages for many conditions whichthat indicate either problems with the infrastructure or in some cases problems with the measurement process. Other checks need to be performed manually. Metric Name: TReno-Type-P-Bulk-Transfer-Capacity (e.g. TReno-UDP-BTC) Metric Parameters: A pair of IP addresses, Src (aka "tester") and Dst (aka "target"), a start time T and initial MTU. [The framework document needs a general way to address additional constraints that may be applied to metrics: E.g. for a NetNow-style test between hosts on two exchange points, some indication of/control over the first hop is needed.]Definition: The average data rate attained by the reference congestion controlCongestion Avoidance algorithm, while using type-P packets to probe the forward (Src to Dst) path. In the case of ICMP ping, these messages also probe the return path. Metric Units: bits per second Ancillary results and output used to verify the proper measurement procedure and calibration: -results: * Statistics over the entire test (data transferred, duration and average rate) -* Statistics fromover the equilibriumCongestion Avoidance portion of the test (data transferred, duration, average rate,duration and number of equilibrium congestion control cycles) -average rate) * Path property statistics (MTU, Min RTT, max cwnd in equilibriumduring Congestion Avoidance and max cwnd during Slow-start) - Statistics from the non-equilibrium portion* Direct measures of the test (nature and numberanalytic model parameters (Number of congestion signals, average RTT) * Indications of non-equilibrium events). - Estimatedwhich TCP algorithms must be present to attain the same performance. * The estimated load/BW/buffering used on the return path. -path * Warnings about data transmission abnormalities. (e.g(e.g. packets out-of-order) - Warningout-of-order, events that cause timeouts) * Warnings about conditions which may effectaffect metric accuracy. (e.g(e.g. insufficient tester buffering) -* Alarms about serious data transmission abnormalities. (e.g. data duplicated in the network) -* Alarms about testerinternal inconsistencies of the tester and events which might invalidate the results. -* IP address/name of the responding target. -* TReno version. Method: Run the trenoTReno program on the tester with the chosen packet type addressed to the target. Record both the BTC and the ancillary results. Manual calibration checks.checks: (See detailed explanations below). -* Verify that the tester and target have sufficient raw bandwidth to sustain the test. -* Verify that the tester and target have sufficient buffering to support the window needed by the test. -* Verify that there is not any other system activity on the tester or target. -* Verify that the return path is not a bottleneck at the load needed to sustain the test. -* Verify that the IP address reported in the replies is somean appropriate interface of the selected target. Version control: -* Record the precise TReno version (-V switch) -* Record the precise tester OS version, CPU version and speed, interface type and version. Discussion: We do not use existing TCP implementations due to a number of problems which make them difficultNote that the BTC metric is defined specifically to calibrate as metrics.be the average data rate between the source and destination hosts. The Reno congestion control algorithmsancillary results are subjectdesigned to a number of chaotic or turbulent behaviors which introduce non-uniform performance [Floyd95, Hoe95, mathis96]. Non-uniform performance introduces substantial non-calibratable uncertainty whendetect possible measurement problems, and to help diagnose the network. The ancillary results should not be used as a metric. Furthermore a numbermetrics in their own right. The current version of people [Paxon:testing, Comer:testing, ??others??] have observed extreme diversity between differentTReno does not include an accurate model for TCP implementations, raising doubts about repeatabilitytimeouts or their effect on average throughput. TReno takes the view that timeouts reflect an abnormality in the network, and consistency between different TCP based measures.should be diagnosed as such. There are many possible reasons why a TReno measurement might not agree with the performance obtained by a TCP basedTCP-based application. Some key ones include: older TCP'sTCPs missing key algorithms such as MTU discovery, support for large windows or SACK, or mistuningmis-tuning of either the data source or sink. Some network conditions which needrequire the newer TCP algorithms are detected by TReno and reported in the ancillary results. Other documents will cover methods to diagnose the difference between TReno and TCP performance. Note thatIt would raise the BTC metric is defined specifically to beaccuracy of TReno's traceroute mode if the average data rate betweenICMP "TTL exceeded" messages were generated at the sourcetarget and destination hosts. The ancillary results are designed to detect a number of possible measurement problems,transmitted along the return path with elevated priority (reduced losses and in a few case pathological behaviors inqueuing delays). People using the network. The ancillary resultsTReno metric as part of procurement documents should notbe used as metrics in their own right. The discussion below assumesaware that in many circumstances MTU has an intrinsic and large impact on overall path performance. Under some conditions the TReno algorithm is implemented as a user mode program running under a standard operating system. Other implementations, such asdifficulty in meeting a dedicated measurement instrument, can have stronger builtin calibration checks. The rawgiven performance (bandwidth) limitations ofspecifications is inversely proportional to the square of the path MTU. (e.g. Halving the specified MTU makes meeting the bandwidth specification 4 times harder.) When used as an end-to-end metric TReno presents exactly the same load to the network as a properly tuned state-of-the-art bulk TCP stream between the same pair of hosts. Although the connection is not transferring useful data, it is no more wasteful than fetching an unwanted web page with the same transfer time. Calibration checks: The following discussion assumes that the TReno diagnostic is implemented as a user mode program running under a standard operating system. Other implementations, such as thoes in dedicated measurement instruments, can have stronger built-in calibration checks. The raw performance (bandwidth) limitations of both the tester and target SHOULDshould be measured by running TReno in a controlled environment (e.g. a bench test). Ideally the observed performance limits should be validated by diagnosing the nature of the bottleneck and verifying that it agrees with other benchmarks of the tester and target (e.g. That TReno performance agrees with direct measures of backplane or memory bandwidth or other bottleneck as appropriate.)appropriate). These raw performance limitations MAYmay be obtained in advance and recorded for later reference. Currently no routers are reliable targets, although under some conditions they can be used for meaningful measurements. For most peopleWhen testing between a pair of modern computer systems at a few megabits per second or less, the tester and target are unlikely to be the bottleneck. TReno may not be accurate, and SHOULD NOTshould not be used as a formal metricmetric, at rates above half of the known tester or target limits. This is because during the initial Slow-start TReno needs to be able to send bursts which are twice the average data rate. [need exceptionLikewise, if the 1stlink to the first hop LANis not more than twice as fast as the limit in all cases?]entire path, some of the path properties such as max cwnd during Slow-start may reflect the testers link interface, and not the path itself. Verifying that the tester and target have sufficient buffering is difficult. If they do not have sufficient buffer space, then losses at their own queues may contribute to the apparent losses along the path. There are several difficulties in verifying the tester and target buffer capacity. First, there are no good tests of the target's buffer capacity at all. Second, all validation of the testers buffering dependdepends in some way on the accuracy of reports by the tester's own operating system. Third, there is the confusing result that in many circumstances (particularly when there is much more than sufficient average tester performance) whereinsufficient buffering in the tester does not adversely impact measured performance. TReno separately instruments the performance of the equilibrium and non-equilibrium portions of the test. This is because TReno's behavior is intrinsicly more accurate during equilibrium. If TReno can not sustain equilibrium, it either suggests serious problems with the network or that the expected performance is lower than can be accurately measures by TReno. TRenoreports (as calibration alarms) any events wherein which transmit packets were refused due to insufficient buffer space. It reports a warning if the maximum measured congestion window is larger than the reported buffer space. Although these checks are likely to be sufficient in most cases they are probably not sufficient in all cases, and will be the subject of future research. Note that on a timesharing or multi-tasking system, other activity on the tester introduces burstynessburstiness due to operating system scheduler latency. Therefore,Since some queuing disciplines discriminate against bursty sources, it is veryimportant that there be no other system activity during a test. This SHOULDshould be confirmed with other operating system specific tools. In ICMP mode TReno measures the net effect of both the forward and return paths on a single data stream. Bottlenecks and packet losses in the forward and return paths are treated equally. In traceroute mode, TReno computes and reports the load onit contributes to the return path. Unlike real TCP, TReno can not distinguish between losses on the forward and return paths, so idealyideally we want the return path to introduce as little loss as possible. The bestA good way to test to see if the return path is with TReno ICMP mode usinghas a large effect on a measurement is to reduce the forward path messages down to ACK sized messages,size (40 bytes), and verify that the measured packet rate is improved by aat least factor of two. [More research needed] In ICMP mode TReno measures the net effect of both the forward and return paths on a single data stream. Bottlenecks and packet losses in the forward and return paths are treated equally. It would raise the accuracy of TReno traceroute mode if the ICMP TTL execeded messages were generated at the target and transmitted along the return path with elevated priority (reduced losses and queuing delays). People using the TReno metric as part of procurement documents should be aware that in many circumstances MTU has an intrinsic and large impact on overall path performance. Under some conditions the difficulty in meeting a given performance specificationsis inversely proportional to the square of the path MTU. (e.g. halving the specified MTU makes meeting the bandwidth specification 4 times harder.) In metric mode, TReno presents exactly the same load to the network as a properly tuned state-of-the-art TCP between the same pair of hosts. Although the connection is not transferring useful data, it is no more wasteful than fetching an un-wanted web page takes the same time to transfer.needed.] References [RFC2018] Mathis, M., Mahdavi, J. Floyd,[Brakmo95], Brakmo, S., Romanow, A., "TCP Selective Acknowledgment Options", ftp://ds.internic.net/rfc/rfc2018.txt [Jacobson88] Jacobson, V., "Congestion Avoidance and Control",Peterson, L., "Performance problems in BSD4.4 TCP", Proceedings of ACM SIGCOMM '88, Stanford, CA., August 1988. [Stevens94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols", Addison-Wesley,'95, October 1995. [Comer94], Comer, C., Lin, J., "Probing TCP Implementations", USENIX Summer 1994, June 1994. [Stevens96] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms", Work in progress ftp://ietf.org/internet-drafts/draft-stevens-tcpca-spec-01.txt[Floyd95] Floyd, S., "TCP and successive fast retransmits", February 1995, Obtain via ftp://ftp.ee.lbl.gov/papers/fastretrans.ps. [Hoe95] Hoe, J., "Startup dynamics of TCP's congestion control and avoidance schemes". Master's thesis, Massachusetts Institute of Technology, June 1995. [Jacobson88] Jacobson, V., "Congestion Avoidance and Control", Proceedings of SIGCOMM '88, Stanford, CA., August 1988. [mathis96] Mathis, M. and Mahdavi, J. "Forward acknowledgment: Refining tcpTCP congestion control", Proceedings of ACM SIGCOMM '96, Stanford, CA., August 1996. [RFC2018] Mathis, M., Mahdavi, J. Floyd, S., Romanow, A., "TCP Selective Acknowledgment Options", 1996 Obtain via: ftp://ds.internic.net/rfc/rfc2018.txt [Mathis97] Mathis, M., Semke, J., Mahdavi, J., Ott, T., "The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm", Computer Communications Review, 27(3), July 1997. [Ott96a], Ott, T., Kemperman, J., Mathis, M., "The Stationary Behavior of Ideal TCP Congestion Avoidance", In progress, August 1996. Obtain via pub/tjo/TCPwindow.ps using anonymous ftp to ftp.bellcore.com [Ott96b], Ott, T., Kemperman, J., Mathis, M., "Window Size Behavior in TCP/IP with Constant Loss Probability", DIMACS Special Year on Networks, Workshop on Performance of Real-Time Applications on the Internet, Nov 1996. [Paxson97a] Paxson, V "Automated Packet Trace Analysis of TCP Implementations", Proceedings of ACM SIGCOMM '97, August 1997. [Paxson97b] Paxson, V, editor "Known TCP Implementation Problems", Work in progress: http://reality.sgi.com/sca/tcp-impl/prob-01.txt [Stevens94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols", Addison-Wesley, 1994. [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms", ftp://ds.internic.net/rfc/rfc2001.txt Author's Address Matt Mathis email: email@example.com Pittsburgh Supercomputing Center 4400 Fifth Ave. Pittsburgh PA 15213 ---------------------------------------------------------------- Appendix A: Currently the best existing description of the algorithm is in the "FACK technical note" below http://www.psc.edu/networking/tcp.html. Within TReno, all invocations of "bounding parameters" will be reported as warnings. The FACK technical note will be revised for TReno, supplemented by a code fragment and included here.