Throughput between Tersk03 and Kangadata

Les Cottrell. Page created: January 29, 2001.

Central Computer Access | Computer Networking | Network Group | ICFA-NTF Monitoring

Introduction

At the BaBar/SCS meeting of 1/29/01 concerns were raised about file throughput between Tersk03 and Kangadata02. According to the BaBar (Charlie Young and others), they are seeing something less than 10MBytes/sec.

Configuration

Both machines have 1000Mbit/sec interfaces to the same Cisco Catalyst 6509 called FARMCORE1 and both are on VLAN 124. Tersk03 is on blade 5, port 14, Kangadata02 is on blade 6, port 9. Terks3 is running Solaris 5.7 and Kangadata02 is running Solaris 5.6. Both machines have the default TCP send and receive buffer sizes of 8kBytes. See Enabling High Performance Data Transfers on Hosts: for how to tune these parameters. They are discovered using the Unix commands: ndd /dev/tcp tcp_xmit_hiwat and ndd /dev/tcp tcp_recv_hiwat. The Unix uname -a to discover the operating system and release. Information on the host hardware (number of cpus, model, memory, MHz etc.) can be found in /afs/slac/g/scs/systems/report/hostrpt

Iperf performance

I made measurements of iperf throughput from tersk03 to kangadata02 using multiple flows (between 1 and 40) and multiple window sizes (between 8kB and 1024kB) with 10 seconds for each iperf measurement. While these loaded-by-me measurements were made I simultaneoulsy measured the response time using ping). Each 10 second loaded measurement was followed by a break of 10 seconds while I measured the unloaded-by-me state of the network using ping. For more on the methodology, see: Bulk thruput measurements. The iperf throughput results are shown in the gif graph below. The average through put was 425Mbits/sec or 53.3MBytes/sec. I assume cp is using 1 stream/flow with the default Solaris window size of 8kBytes. I measured throughput of 64.8kbits/s (or 8.14MBytes/s) for that configuration. If the window size was 32kBytes then I measure 170Mbits/s (or 21.2 Mbytes/s), and for 64kByte windows I get 220Mbit/s or 27.5MBytes/s.
Iperf throughput from Kanga03 to Kangadata02

Iperf throughput from Kanga03 to Kangadata02

The maximum throughputs observed (just under 600Mbits/sec) is a bit less than the 700Mbits/sec from a Solaris 5.6 machine to a pentium III running LInux, presented in High-performance Tuning/Measurements. Probably the upper limit is gated by the CPUs and operating systems/TCP stacks.

Link loading

For the graphs the fluctuations from smooth curves are probably due to seeing varying competing (i.e. not-my-iperf load. aka cross-traffic) loads on the link. the loading on the switch ports attached to Tersk03 and Kangadata02 are shown in the plots below. The iperf measurements ran from 14:28 through 15:21 1/29/01 PST. The spike seen in both graphs (outbound/blue for Tersk03, and inbound/red for Kangadata02) of a bit under 250Mbits/s for this period shows the impact on utilization of the iperf measurements. The 250Mbits/s is consistent with a factor of 2 under the maximum performance (of ~ 500Mbits/s) caused by running iperf for 10 seconds (loading) then running without iperf for 10 seconds (unloaded). The high utilization outbound/blue from 12 noon to about 17:00 on Kangadata02 indicates that there was competing/cross traffic.
Utilization of Tersk03 port on switch

Utilization of Tersk03 port on switch

Utilization of kangadata02 port on switch

We also looked at various other metrics reported by the switch to see if there was other evidence of congestion. These metrics included backplane utilization (< 3% measured over 15 minute interval), memory utilization, errors and discards, and we compared this switch for this time period with other time periods and with other switches at SLAC. No evidence of congestion was noted.

Throughput to Other machines

We also made measurements from Tersk03 to other Sun Solaris hosts on VLAN 124 with 1000Mbits/s interfaces, with the idea of looking at performance of various cpus and the impact of going from switch to switch. The results are shown below. The switches are all Cisco 6509s. The name of the switch (e.g. farmcore1) is succeeded by a colon (:) followed by the blade and port separated by a slash (/). The column labelled Kangadata02 - 2 is a repeat, made at a later time, of the Kangadata02 measurement.

Metric	Kangadata01	Kangadata02	Kangadata02 - 2	Kangadata02 - 3	Tersk07	Tersk07 - 2	Kangadata03	Kangadata03 - 2
Cpu	Sun 250R	Sun 420	Sun 420	Sun 420	Sun Netra t 1400/1405	Sun Netra t 1400/1405	Sun Netra t 1400/1405	Sun Netra t 1400/1405
MHz	2cpus @ 400MHz	4cpus @ 450MHz	4cpus @ 450MHz	4cpus @ 450MHz	4cpus @ 440MHz	4cpus @ 440MHz	4cpus @ 440MHz	4cpus @ 440MHz
Memory	0.5GBytes	4GBytes	4GBytes	4GBytes	4GBytes	4GBytes	1GBytes	1GBytes
OS	Solaris 5.6	Solaris 5.7	Solaris 5.7	Solaris 5.7	Solaris 5.7	Solaris 5.7	Solaris 5.7	Solaris 5.7
Switch port	Farm3: 4/1	Farmcore1: 5/14	Farmcore1: 5/14	Farmcore1: 5/14	Farmcore2: 4/13	Farmcore2: 4/13	Farmcore1: 6/16	Farmcore1: 6/16
Average throughput	265Mbits/s	426Mbits/s	432Mbits/s	514Mbits/s	476MBits/s	552Mbits/s	316Mbits/s	335Mbits/s
Average throughput	33MBytes/s	53MBytes/s	54MBytes/s	64MBytes/s	59MBytes/s	69MBytes/s	39MBytes/s	42MBytes/s
Median throughput	272Mbits/s	458Mbits/s	458Mbits/s	556Mbits/s	496Mbits/s	578Mbits/s	310Mbits/s	345bits/s
Max throughput	388Mbits/s	585Mbits/s	655Mbits/s	664Mbits/s	807Mbits/s	766Mbits/s	487Mbits/s	502Mbits/s
IQR Mbits/s	124Mbits/s	147Mbits/s	127Mbits/s	161Mbits/s	292Mbits/s	121Mbits/s	155Mbits/s	175Mbits/s
Min loaded ping RTT	0msec	0msec	0msec	0.18msec	0msec	0msec	0msec	0msec
Avg loaded ping RTT	11msec	8msec	19msec	7msec.	22msec	14msec	23msec	8.5msec
Max loaded ping RTT	542msec.	99msec	880msec	603msec.	790msec	2272msec	811msec	299msec
Loaded loss	2/1360	1/1360	2/1360	3/1360	1/1360	0/1360	4/1360	0/1360
Start measurement	12:26, 1/30/01	14:28, 1/29/01	8:54, 1/30/01	12:28, 2/6/01	15:58, 1/30/01	7:53, 2/1/31	22:17, 1/30/01	18:38, 1/31/01
End measurement	13:20, 1/30/01	15:21, 1/29/01	9:48, 1/30/01	13:24, 2/6/01	16:53, 1/30/01	8:46, 2/1/01	23:11, 1/30/01	19:32, 1/31/01

Datamove8 to datamove7

On 3/15/01, Andy Hanushevsky reported being only able to get 200Mbits/s between datamove7 and datamove8. Both hosts were lightly loaded. These machines are Sun 450s with 4 * 400MHz cpus and 1GByte of memory. Both machines had a 1Gbps Ethernet interface and are on the same subnet, the same VLAN (124), on the same Cisco Catalyst 6500 switch (SWH-FARMCORE1) but different interface cards (datamove7 was on card 6 port 2, datamove8 was on card 5 port 10). Upon investigation both machines had the default Solaris 5.7 maximum window size of 8kBytes. We used the NIKHEF ping to measures the RTT between the machines (the Solaris ping only reports to 1 msec accuracy). The min/avg/max RTT for a 64Byte ping was 0.203/0.434/2.073 msec. for 1360 pings. The ping RTT for a 64Byte ping for the localhost loopback address (127.0.0.1) was 0.105/0.149/0.239 msec. For a 1Gbbps bandwidth this provides a RTT*Bandwidth product of about 32kBytes. With a window size of 8kbytes, we were able to get about 176Mbits/s. Increasing the maximum window size to 32kBytes we were able to get ~400Mbits/sec throughput measured by iperf. Further increases of the window size to 1MByte did not improve performance any further. We were able to achieve a throughput of over 700Mbits/s with a window size of 512kBytes and 15 parallel streams. The plot below shows the variation of Iperf TCP throughput from datamove8 to datamove 7 as a function of maximum window size and number of parallel streams.
throughput datamove8 to 7

throughput datamove8 to 7

Datamove28 to datamove29

throughput datamove8 to 7

Both hosts were lightly loaded. These machines are Sun 450s with 4 * 450MHz cpus and 1GByte of memory. Both machines had a 1Gbps Ethernet interface and are on the same subnet, the same VLAN (124), on the same Cisco Catalyst 6500 switch (SWH-FARMCORE2) but different interface cards (datamove38 was on card 7 port 5, datamove39 was on card 7 port 6). Both machines were running Solaris 5.8. I set both machines to have

ndd -set /dev/tcp tcp_max_buf 4194304
ndd -set /dev/tcp tcp_cwnd_max 2097152

We used the NIKHEF ping to measures the RTT between the machines (the Solaris ping only reports to 1 msec accuracy). The min/avg/max RTT for a 64Byte ping was 0.171/0.252/0.674 msec. for 136 pings. The ping RTT for a 64Byte ping for the localhost loopback address (127.0.0.1) was 0.105/0.149/0.239 msec. For a 1Gbbps bandwidth this provides a RTT*Bandwidth product of about 32kBytes. With one stream for 10 seconds and a window size of 8kbytes, we were able to get about 160Mbits/s. Increasing the maximum window size to 32Kbytes, still with one stream for 10 seconds, we were able to get ~350Mbits/sec throughput measured by iperf. Further increases of the window size to 4MBytes did not increase the performance further with only one stream. to over 650Mbits/sec. We were able to achieve a throughput of over 650Mbits/s with a window size of 512kBytes and 4 parallel streams. The plot to the right shows the variation of Iperf TCP throughput from datamove28 to datamove39 as a function of maximum window size and number of parallel streams. Comparing the results measured above from datamove8 to datamove7 with E450s with 4*400MHz cpus running Solaris 5.7 with the results from datamove38 to datamove39 with E450 with 4*450MHz running Solaris 5.8 indicates that the newer operating system with over 10% more MHz/cpu, does not improve performance in this case.

Recommendation

Try increasing the window sizes on the two machines from 8kBytes to 64kBytes. This is not out of line with other operating systems, and from the graph above appears to give good performance (within about a factor of 2 of the maximum achievable) for 1 stream. Evaluation of the impact on memory usage will be needed (how much memory do the machines have, how full is it etc.). Also consider doing this for other Solaris machines which have high TCP network throughput requirements. For copying data between machines, even including cases where both machines are at the same site, BaBar should look to using applications that provide support for varying the window sizes, and using multiple parallel flows. Such applications include bbftp and sfcp.

Page owner: Les Cottrell