At the BaBar/SCS meeting of 1/29/01 concerns were raised about file throughput between
Tersk03 and Kangadata02. According to the BaBar (Charlie Young and others), they are
seeing something less than 10MBytes/sec.
Configuration
Both machines have 1000Mbit/sec interfaces to the same Cisco Catalyst 6509 called
FARMCORE1 and both are on VLAN 124. Tersk03 is on blade 5, port 14,
Kangadata02 is on blade 6, port 9. Terks3 is running Solaris 5.7
and Kangadata02 is running Solaris 5.6.
Both machines have the default TCP send and receive buffer sizes of 8kBytes.
See
Enabling High Performance Data Transfers on Hosts: for how to tune these
parameters. They are discovered using the Unix commands:
ndd /dev/tcp tcp_xmit_hiwat
and ndd /dev/tcp tcp_recv_hiwat.
The Unix uname -a to discover the operating system
and release.
Information on the host hardware (number of cpus, model, memory, MHz etc.)
can be found in /afs/slac/g/scs/systems/report/hostrpt
Iperf performance
I made measurements of iperf throughput from tersk03 to kangadata02 using
multiple flows (between 1 and 40) and multiple window sizes (between 8kB
and 1024kB) with 10 seconds for each iperf measurement. While these
loaded-by-me measurements were made I simultaneoulsy measured the
response time using ping). Each 10 second loaded measurement was
followed by a break of 10 seconds while I measured the unloaded-by-me
state of the network using ping. For more on the
methodology, see:
Bulk thruput measurements.
The iperf throughput results are shown in
the gif graph below. The average through put was 425Mbits/sec or 53.3MBytes/sec.
I assume cp is using 1 stream/flow with the default Solaris window size
of 8kBytes. I measured throughput of 64.8kbits/s (or 8.14MBytes/s) for
that configuration. If the window size was 32kBytes then I measure
170Mbits/s (or 21.2 Mbytes/s), and for 64kByte windows I get
220Mbit/s or 27.5MBytes/s.
The maximum throughputs observed (just under 600Mbits/sec)
is a bit less than the 700Mbits/sec from a Solaris
5.6 machine to a pentium III running LInux, presented in
High-performance Tuning/Measurements.
Probably the upper limit is gated by the CPUs and operating systems/TCP stacks.
Link loading
For the graphs the fluctuations from smooth curves are probably due to
seeing varying competing (i.e. not-my-iperf load. aka cross-traffic) loads
on the link. the loading on the switch ports attached
to Tersk03 and Kangadata02 are shown in the plots below. The
iperf measurements ran from 14:28 through 15:21 1/29/01 PST.
The spike seen in both graphs (outbound/blue for Tersk03,
and inbound/red for Kangadata02) of a bit under 250Mbits/s
for this period shows the impact on
utilization of the iperf measurements.
The 250Mbits/s is consistent with a factor of 2 under the
maximum performance (of ~ 500Mbits/s) caused by running iperf
for 10 seconds (loading) then running without iperf for
10 seconds (unloaded).
The
high utilization outbound/blue from 12 noon to about 17:00
on Kangadata02 indicates that there was competing/cross traffic.
We also looked at various other metrics reported by the switch to
see if there was other evidence of congestion. These metrics included
backplane utilization (< 3% measured over 15 minute interval),
memory utilization, errors and discards, and we compared this switch
for this time period with other time periods and
with other switches at SLAC. No evidence of congestion was noted.
Throughput to Other machines
We also made measurements from Tersk03 to other Sun Solaris
hosts on VLAN 124 with
1000Mbits/s interfaces, with the idea of
looking at performance of various cpus and the impact of going from
switch to switch. The results are shown below. The switches are all
Cisco 6509s. The name of the switch (e.g. farmcore1) is succeeded
by a colon (:) followed by the blade and port separated by a slash (/).
The column labelled
Kangadata02 - 2 is a repeat, made at a later time,
of the Kangadata02 measurement.
Metric |
Kangadata01 |
Kangadata02 |
Kangadata02 - 2 |
Kangadata02 - 3 |
Tersk07 |
Tersk07 - 2 |
Kangadata03 |
Kangadata03 - 2 |
Cpu |
Sun 250R |
Sun 420 |
Sun 420 |
Sun 420 |
Sun Netra t 1400/1405 |
Sun Netra t 1400/1405 |
Sun Netra t 1400/1405 |
Sun Netra t 1400/1405 |
MHz |
2cpus @ 400MHz |
4cpus @ 450MHz |
4cpus @ 450MHz |
4cpus @ 450MHz |
4cpus @ 440MHz |
4cpus @ 440MHz |
4cpus @ 440MHz |
4cpus @ 440MHz |
Memory |
0.5GBytes |
4GBytes |
4GBytes |
4GBytes |
4GBytes |
4GBytes |
1GBytes |
1GBytes |
OS |
Solaris 5.6 |
Solaris 5.7 |
Solaris 5.7 |
Solaris 5.7 |
Solaris 5.7 |
Solaris 5.7 |
Solaris 5.7 |
Solaris 5.7 |
Switch port |
Farm3: 4/1 |
Farmcore1: 5/14 |
Farmcore1: 5/14 |
Farmcore1: 5/14 |
Farmcore2: 4/13 |
Farmcore2: 4/13 |
Farmcore1: 6/16 |
Farmcore1: 6/16 |
Average throughput |
265Mbits/s |
426Mbits/s |
432Mbits/s |
514Mbits/s |
476MBits/s |
552Mbits/s |
316Mbits/s |
335Mbits/s |
Average throughput |
33MBytes/s |
53MBytes/s |
54MBytes/s |
64MBytes/s |
59MBytes/s |
69MBytes/s |
39MBytes/s |
42MBytes/s |
Median throughput |
272Mbits/s |
458Mbits/s |
458Mbits/s |
556Mbits/s |
496Mbits/s |
578Mbits/s |
310Mbits/s |
345bits/s |
Max throughput |
388Mbits/s |
585Mbits/s |
655Mbits/s |
664Mbits/s |
807Mbits/s |
766Mbits/s |
487Mbits/s |
502Mbits/s |
IQR Mbits/s |
124Mbits/s |
147Mbits/s |
127Mbits/s |
161Mbits/s |
292Mbits/s |
121Mbits/s |
155Mbits/s |
175Mbits/s |
Min loaded ping RTT |
0msec |
0msec |
0msec |
0.18msec |
0msec |
0msec |
0msec |
0msec |
Avg loaded ping RTT |
11msec |
8msec |
19msec |
7msec. |
22msec |
14msec |
23msec |
8.5msec |
Max loaded ping RTT |
542msec. |
99msec |
880msec |
603msec. |
790msec |
2272msec |
811msec |
299msec |
Loaded loss |
2/1360 |
1/1360 |
2/1360 |
3/1360 |
1/1360 |
0/1360 |
4/1360 |
0/1360 |
Start measurement |
12:26, 1/30/01 |
14:28, 1/29/01 |
8:54, 1/30/01 |
12:28, 2/6/01 |
15:58, 1/30/01 |
7:53, 2/1/31 |
22:17, 1/30/01 |
18:38, 1/31/01 |
End measurement |
13:20, 1/30/01 |
15:21, 1/29/01 |
9:48, 1/30/01 |
13:24, 2/6/01 |
16:53, 1/30/01 |
8:46, 2/1/01 |
23:11, 1/30/01 |
19:32, 1/31/01 |
Datamove8 to datamove7
On 3/15/01, Andy Hanushevsky reported being only able to get 200Mbits/s
between datamove7 and datamove8.
Both hosts were lightly loaded.
These machines are Sun 450s with 4 * 400MHz
cpus and 1GByte of memory.
Both machines had
a 1Gbps Ethernet interface and are on the same subnet, the same VLAN (124),
on the same Cisco Catalyst 6500 switch (SWH-FARMCORE1) but different
interface cards (datamove7 was on card 6 port 2, datamove8 was on card 5
port 10).
Upon investigation both
machines had the default Solaris 5.7 maximum window size of 8kBytes. We used
the NIKHEF ping to measures the RTT between the machines (the Solaris
ping only reports to 1 msec accuracy). The min/avg/max RTT for a 64Byte ping
was 0.203/0.434/2.073 msec. for 1360 pings. The ping RTT for a 64Byte ping
for the localhost
loopback address (127.0.0.1) was 0.105/0.149/0.239 msec.
For a 1Gbbps bandwidth this provides
a RTT*Bandwidth product of about 32kBytes. With a window size of 8kbytes,
we were able to get about 176Mbits/s. Increasing the maximum window size
to 32kBytes we were able to get ~400Mbits/sec throughput measured by iperf.
Further increases of the window size to 1MByte did not improve performance
any further. We were able to achieve a throughput of over 700Mbits/s with
a window size of 512kBytes and 15 parallel streams.
The plot below shows the variation of Iperf TCP throughput from
datamove8 to datamove 7 as a function of maximum window size and
number of parallel streams.
Datamove28 to datamove29
Both hosts were lightly loaded.
These machines are Sun 450s with 4 * 450MHz
cpus and 1GByte of memory.
Both machines had
a 1Gbps Ethernet interface and are on the same subnet, the same VLAN (124),
on the same Cisco Catalyst 6500 switch (SWH-FARMCORE2) but different
interface cards (datamove38 was on card 7 port 5, datamove39 was on card 7
port 6). Both machines were running Solaris 5.8.
I set both
machines to have
ndd -set /dev/tcp tcp_max_buf 4194304
ndd -set /dev/tcp tcp_cwnd_max 2097152
We used
the NIKHEF ping to measures the RTT between the machines (the Solaris
ping only reports to 1 msec accuracy). The min/avg/max RTT for a 64Byte ping
was 0.171/0.252/0.674 msec. for 136 pings. The ping RTT for a 64Byte ping
for the localhost
loopback address (127.0.0.1) was 0.105/0.149/0.239 msec.
For a 1Gbbps bandwidth this provides
a RTT*Bandwidth product of about 32kBytes. With one stream for 10 seconds
and a window size of 8kbytes,
we were able to get about 160Mbits/s. Increasing the maximum window size
to 32Kbytes, still with one
stream for 10 seconds,
we were able to get ~350Mbits/sec throughput measured by iperf.
Further increases of the window size to 4MBytes
did not increase the performance further with only one stream.
to over 650Mbits/sec. We were able to achieve a throughput of over 650Mbits/s with
a window size of 512kBytes and 4 parallel streams.
The plot to the right shows the variation of Iperf TCP throughput from
datamove28 to datamove39 as a function of maximum window size and
number of parallel streams. Comparing the results measured above from
datamove8 to datamove7 with E450s with 4*400MHz cpus running Solaris 5.7
with the results from
datamove38 to datamove39 with E450 with 4*450MHz
running Solaris 5.8 indicates that the
newer operating system with over 10% more MHz/cpu,
does not improve performance in this case.
Recommendation
Try increasing the window sizes on the two machines from 8kBytes
to 64kBytes. This is not out of line with other operating systems, and
from the graph above appears to give good performance (within about
a factor of 2 of the maximum achievable) for 1 stream.
Evaluation of the
impact on memory usage will be needed (how much memory do the machines have,
how full is it etc.). Also consider doing this for other
Solaris machines which have high TCP network throughput requirements.
For copying data between machines, even including cases where both machines
are at the same site, BaBar should look to using applications that provide
support for varying the window sizes, and using multiple parallel flows.
Such applications include
bbftp and
sfcp.
Page owner: Les Cottrell