SLAC logo

Throughput between Tersk03 and Kangadata Network logo

Les Cottrell. Page created: January 29, 2001.

Central Computer Access | Computer Networking | Network Group | ICFA-NTF Monitoring
SLAC Welcome
Highlighted Home
Detailed Home
Search
Phonebook

Introduction

At the BaBar/SCS meeting of 1/29/01 concerns were raised about file throughput between Tersk03 and Kangadata02. According to the BaBar (Charlie Young and others), they are seeing something less than 10MBytes/sec.

Configuration

Both machines have 1000Mbit/sec interfaces to the same Cisco Catalyst 6509 called FARMCORE1 and both are on VLAN 124. Tersk03 is on blade 5, port 14, Kangadata02 is on blade 6, port 9. Terks3 is running Solaris 5.7 and Kangadata02 is running Solaris 5.6. Both machines have the default TCP send and receive buffer sizes of 8kBytes. See Enabling High Performance Data Transfers on Hosts: for how to tune these parameters. They are discovered using the Unix commands: ndd /dev/tcp tcp_xmit_hiwat and ndd /dev/tcp tcp_recv_hiwat. The Unix uname -a to discover the operating system and release. Information on the host hardware (number of cpus, model, memory, MHz etc.) can be found in /afs/slac/g/scs/systems/report/hostrpt

Iperf performance

I made measurements of iperf throughput from tersk03 to kangadata02 using multiple flows (between 1 and 40) and multiple window sizes (between 8kB and 1024kB) with 10 seconds for each iperf measurement. While these loaded-by-me measurements were made I simultaneoulsy measured the response time using ping). Each 10 second loaded measurement was followed by a break of 10 seconds while I measured the unloaded-by-me state of the network using ping. For more on the methodology, see: Bulk thruput measurements. The iperf throughput results are shown in the gif graph below. The average through put was 425Mbits/sec or 53.3MBytes/sec. I assume cp is using 1 stream/flow with the default Solaris window size of 8kBytes. I measured throughput of 64.8kbits/s (or 8.14MBytes/s) for that configuration. If the window size was 32kBytes then I measure 170Mbits/s (or 21.2 Mbytes/s), and for 64kByte windows I get 220Mbit/s or 27.5MBytes/s.
Iperf throughput from Kanga03 to Kangadata02
The maximum throughputs observed (just under 600Mbits/sec) is a bit less than the 700Mbits/sec from a Solaris 5.6 machine to a pentium III running LInux, presented in High-performance Tuning/Measurements. Probably the upper limit is gated by the CPUs and operating systems/TCP stacks.

Link loading

For the graphs the fluctuations from smooth curves are probably due to seeing varying competing (i.e. not-my-iperf load. aka cross-traffic) loads on the link. the loading on the switch ports attached to Tersk03 and Kangadata02 are shown in the plots below. The iperf measurements ran from 14:28 through 15:21 1/29/01 PST. The spike seen in both graphs (outbound/blue for Tersk03, and inbound/red for Kangadata02) of a bit under 250Mbits/s for this period shows the impact on utilization of the iperf measurements. The 250Mbits/s is consistent with a factor of 2 under the maximum performance (of ~ 500Mbits/s) caused by running iperf for 10 seconds (loading) then running without iperf for 10 seconds (unloaded). The high utilization outbound/blue from 12 noon to about 17:00 on Kangadata02 indicates that there was competing/cross traffic.
Utilization of Tersk03 port on switch Utilization of kangadata02 port on switch
We also looked at various other metrics reported by the switch to see if there was other evidence of congestion. These metrics included backplane utilization (< 3% measured over 15 minute interval), memory utilization, errors and discards, and we compared this switch for this time period with other time periods and with other switches at SLAC. No evidence of congestion was noted.

Throughput to Other machines

We also made measurements from Tersk03 to other Sun Solaris hosts on VLAN 124 with 1000Mbits/s interfaces, with the idea of looking at performance of various cpus and the impact of going from switch to switch. The results are shown below. The switches are all Cisco 6509s. The name of the switch (e.g. farmcore1) is succeeded by a colon (:) followed by the blade and port separated by a slash (/). The column labelled Kangadata02 - 2 is a repeat, made at a later time, of the Kangadata02 measurement.
Metric Kangadata01 Kangadata02 Kangadata02 - 2 Kangadata02 - 3 Tersk07 Tersk07 - 2 Kangadata03 Kangadata03 - 2
Cpu Sun 250R Sun 420 Sun 420 Sun 420 Sun Netra t 1400/1405 Sun Netra t 1400/1405 Sun Netra t 1400/1405 Sun Netra t 1400/1405
MHz 2cpus @ 400MHz 4cpus @ 450MHz 4cpus @ 450MHz 4cpus @ 450MHz 4cpus @ 440MHz 4cpus @ 440MHz 4cpus @ 440MHz 4cpus @ 440MHz
Memory 0.5GBytes 4GBytes 4GBytes 4GBytes 4GBytes 4GBytes 1GBytes 1GBytes
OS Solaris 5.6 Solaris 5.7 Solaris 5.7 Solaris 5.7 Solaris 5.7 Solaris 5.7 Solaris 5.7 Solaris 5.7
Switch port Farm3: 4/1 Farmcore1: 5/14 Farmcore1: 5/14 Farmcore1: 5/14 Farmcore2: 4/13 Farmcore2: 4/13 Farmcore1: 6/16 Farmcore1: 6/16
Average throughput 265Mbits/s 426Mbits/s 432Mbits/s 514Mbits/s 476MBits/s 552Mbits/s 316Mbits/s 335Mbits/s
Average throughput 33MBytes/s 53MBytes/s 54MBytes/s 64MBytes/s 59MBytes/s 69MBytes/s 39MBytes/s 42MBytes/s
Median throughput 272Mbits/s 458Mbits/s 458Mbits/s 556Mbits/s 496Mbits/s 578Mbits/s 310Mbits/s 345bits/s
Max throughput 388Mbits/s 585Mbits/s 655Mbits/s 664Mbits/s 807Mbits/s 766Mbits/s 487Mbits/s 502Mbits/s
IQR Mbits/s 124Mbits/s 147Mbits/s 127Mbits/s 161Mbits/s 292Mbits/s 121Mbits/s 155Mbits/s 175Mbits/s
Min loaded ping RTT 0msec 0msec 0msec 0.18msec 0msec 0msec 0msec 0msec
Avg loaded ping RTT 11msec 8msec 19msec 7msec. 22msec 14msec 23msec 8.5msec
Max loaded ping RTT 542msec. 99msec 880msec 603msec. 790msec 2272msec 811msec 299msec
Loaded loss 2/1360 1/1360 2/1360 3/1360 1/1360 0/1360 4/1360 0/1360
Start measurement 12:26, 1/30/01 14:28, 1/29/01 8:54, 1/30/01 12:28, 2/6/01 15:58, 1/30/01 7:53, 2/1/31 22:17, 1/30/01 18:38, 1/31/01
End measurement 13:20, 1/30/01 15:21, 1/29/01 9:48, 1/30/01 13:24, 2/6/01 16:53, 1/30/01 8:46, 2/1/01 23:11, 1/30/01 19:32, 1/31/01

Datamove8 to datamove7

On 3/15/01, Andy Hanushevsky reported being only able to get 200Mbits/s between datamove7 and datamove8. Both hosts were lightly loaded. These machines are Sun 450s with 4 * 400MHz cpus and 1GByte of memory. Both machines had a 1Gbps Ethernet interface and are on the same subnet, the same VLAN (124), on the same Cisco Catalyst 6500 switch (SWH-FARMCORE1) but different interface cards (datamove7 was on card 6 port 2, datamove8 was on card 5 port 10). Upon investigation both machines had the default Solaris 5.7 maximum window size of 8kBytes. We used the NIKHEF ping to measures the RTT between the machines (the Solaris ping only reports to 1 msec accuracy). The min/avg/max RTT for a 64Byte ping was 0.203/0.434/2.073 msec. for 1360 pings. The ping RTT for a 64Byte ping for the localhost loopback address (127.0.0.1) was 0.105/0.149/0.239 msec. For a 1Gbbps bandwidth this provides a RTT*Bandwidth product of about 32kBytes. With a window size of 8kbytes, we were able to get about 176Mbits/s. Increasing the maximum window size to 32kBytes we were able to get ~400Mbits/sec throughput measured by iperf. Further increases of the window size to 1MByte did not improve performance any further. We were able to achieve a throughput of over 700Mbits/s with a window size of 512kBytes and 15 parallel streams. The plot below shows the variation of Iperf TCP throughput from datamove8 to datamove 7 as a function of maximum window size and number of parallel streams.
throughput datamove8 to 7

Datamove28 to datamove29

throughput datamove8 to 7 Both hosts were lightly loaded. These machines are Sun 450s with 4 * 450MHz cpus and 1GByte of memory. Both machines had a 1Gbps Ethernet interface and are on the same subnet, the same VLAN (124), on the same Cisco Catalyst 6500 switch (SWH-FARMCORE2) but different interface cards (datamove38 was on card 7 port 5, datamove39 was on card 7 port 6). Both machines were running Solaris 5.8. I set both machines to have
ndd -set /dev/tcp tcp_max_buf 4194304
ndd -set /dev/tcp tcp_cwnd_max 2097152
We used the NIKHEF ping to measures the RTT between the machines (the Solaris ping only reports to 1 msec accuracy). The min/avg/max RTT for a 64Byte ping was 0.171/0.252/0.674 msec. for 136 pings. The ping RTT for a 64Byte ping for the localhost loopback address (127.0.0.1) was 0.105/0.149/0.239 msec. For a 1Gbbps bandwidth this provides a RTT*Bandwidth product of about 32kBytes. With one stream for 10 seconds and a window size of 8kbytes, we were able to get about 160Mbits/s. Increasing the maximum window size to 32Kbytes, still with one stream for 10 seconds, we were able to get ~350Mbits/sec throughput measured by iperf. Further increases of the window size to 4MBytes did not increase the performance further with only one stream. to over 650Mbits/sec. We were able to achieve a throughput of over 650Mbits/s with a window size of 512kBytes and 4 parallel streams. The plot to the right shows the variation of Iperf TCP throughput from datamove28 to datamove39 as a function of maximum window size and number of parallel streams. Comparing the results measured above from datamove8 to datamove7 with E450s with 4*400MHz cpus running Solaris 5.7 with the results from datamove38 to datamove39 with E450 with 4*450MHz running Solaris 5.8 indicates that the newer operating system with over 10% more MHz/cpu, does not improve performance in this case.

Recommendation

Try increasing the window sizes on the two machines from 8kBytes to 64kBytes. This is not out of line with other operating systems, and from the graph above appears to give good performance (within about a factor of 2 of the maximum achievable) for 1 stream. Evaluation of the impact on memory usage will be needed (how much memory do the machines have, how full is it etc.). Also consider doing this for other Solaris machines which have high TCP network throughput requirements. For copying data between machines, even including cases where both machines are at the same site, BaBar should look to using applications that provide support for varying the window sizes, and using multiple parallel flows. Such applications include bbftp and sfcp.
Page owner: Les Cottrell