Created: May 14, 1999; last updated by: Les Cottrell on February 27, 2000
|
|
Unless otherwise noted, the pings were sent at one second intervals with a timeout of 20 seconds and a payload (including the 8 ICMP protocol bytes) of 100 bytes.
Server name | Server Hardware | Server OS | Server interface speed | Network connection devices & speeds |
---|---|---|---|---|
mercury | Sun Ultra 5 | Solaris 5.6 | 10Mbps HDX shared | Same shared 10 Mbps hub |
charon | Sun Ultra 1 | Solaris 5.6 | 10Mbps HDX shared | 10Mbps to edge switch (cgb3) 10 Mbps to doris | bronco001 | Sun Ultra 5 | Solaris 5.6 | 100Mbps FDX switched | 100 Mbps to farm switch, 1Gbps to core switch, 1Gbps to core router, 1Gbps to core switch, 100Mbps to edge switch, 10Mbps to doris |
mailbox | Sun Ultra 5 | Solaris 5.6 | 100 Mbps FDX switched | 100Mbps to server switch, 1Gbps to core switch, 1Gbps to core router, 1Gbps to core switch, 100Mbps to edge switch, 10 Mbps to doris |
grouse | Sun Sparc 1+ | SunOS 4.1.3.1 | 10Mbps HDX shared | 10Mbps to edge switch, 100Mbps to core switch, 1Gbps to core router, 1Gbps to core switch, 100Mbps to edge switch, 10Mbps to doris |
A simple model to understand the median or minimum ping response times for an unloaded local area network and lightly loaded hosts is to ignore the hubs (a hub inserts about 1 bit time delay) and the cable lengths (for a site with cable runs of < 10,000 feet this should introduce an error of < 20usec.), assume the latency of the switches and routers is about 15usec. (this comes from Cisco specification sheets), calculate the times to clock the 100 byte ping packet into each device at the interface speed, measure the ping client time by comparing the time reported by ping host client with the wire time immediately in front of the host (~210usec.), and measure the server host time to echo the ping by measuring the wire times (going in and coming out) in front of the server (~ 100usec. for the Ultra 5 (330MHz)hosts, ~ 125 usec. for a Sun Ultra 1 (167MHz), ~550usec. for the Sun Sparc 1+ (25MHz), and ~170usec to a Sun Sparcstation 5 (110Mhz)). Putting this all together, for the hosts in the table the agreement between measured and predicted Ping RTTs is within 60 usec.
In the subsections below we show some examples of the ping RTT history and also the frequency distributions. We do not attempt to explain the frequency distributions in any detail but simply note that in all cases there is a large peak near the low end of the measured RTTs followed by a long tail with some structure observable. The artificial regularity of every n-th bin having a higher or lower frequency above log10(RTT) of 1 is a binning effect of the logarithmic bin sizes interacting with the measurement granularity of the NIKHEF ping (1usec. for RTT < 1 msec., 10usec. for RTT >=1msec. and < 10msec., 100usec. for RTT >=10msec. and < 100msec., and 1msec. for >=100msec.). and does not show up when using equally spaced linear RTT bins. The double peak in the frequency distribution for the two hosts on the same subnet is also a binning effect. and does not show up when using linear bin-widths. On the other hand, by measuring the wire-time difference between packets entering and leaving the server, the double peak seen in the low RTT "peak" for the distribution of the two hosts on the same shared hub is found to be be caused by the ping server. For another example of a pathological RTT distribution caused by a ping server, see Pinger Measurement Pathologies.
The distribution has a sharp peak with a median at 1.35 msec and with
an Inter Quartile Range (IQR) of 0.2 msec. There is also a high RTT tail.
The third plot in this subsection shows the time variation of the ping RTT for
306,000 pings between the Linux host
and the SLAC Surveyor host.
The final plot in this subsection show the frequency distribution of the ping RTTs between
the Linux host and the SLAc Surveyor host. The blue line shows the cumulative
distribution function (CDF). The data is binned into 3 different bin widths The black dots
are for bins
with a width of 0.1 msec. and are for RTT < 1 msec.. The magenta
dots are for bin widths of 1 msec. and
are for RTTs < 10 msec.. The green dots have bin widths of 10 msec.
and cover the entire range of data. The binned data is normalized
by dividing the counts in the 1 msec.
bins by 10 and the counts in the 10 msec. bins by 10.
the black line is a simple power series fits to the data
between 2.3 msec. and 61 msec. inclusive.
The distribution exhibits a sharp peak with a median at 0.9 msec. an IQR of 0.06 msec.
and a high RTT tail. There are also secondary peaks at 10 msec. and 2.4 msec.
The ping distribution for an extensive
(500K samples) measurement
between a host at SLAC (minos.slac.stanford.edu) and a host at ESnet at LBNL
(hershey.es.net), is seen below starting at
9:01am on April 23, 1999 and ending at 3:59am on April 29 1999. The pings were
separated by 1 second and the timeout was 20 seconds.
It can be seen that there is a narrow (IQR =
1msec.) peak at 4 msec. with a very long tail extending out to beyond 750 msec.
The black line is a fit to a power series with the parameters shown.
If one plots this data on a log-log plot (see below) then it can be seen that
there are two time scales (4-18 msec. and 18-1000 msec.) with quite different
behaviors. The bulk of the data (99.8%) falls in the 4-18 msec. region.
In the 4-18 msec. region (the magenta points) the data falls of as
y ~ A * RTT-6.6 whereas beyond 18msec. (the blue points)
it falls off as
y ~ B * RTT-1.7. The parameters of the fits are shown in the
chart. Note that in the 4-18 msec. region the data are histogrammed in 1 msec.
bins, whereas beyond that they are histogrammed in 10 msec. bins.
and the 2 y scales are adjusted
appropriately (the one for the wider bins beyond 18 msec. is a factor 10 greater
than the other). The green points are not used in the fits and are the
data histogrammed in 1 msec. bins for the range 19 msec. to 55 msec.
The power law exponent behavior in the region 4 - 18 msec.
is that exhibited
by very chaotic processes such as fully developed turbulence or the
stock market, whereas the data beyond 18 msec.
is more characteristic of heavy-tailed or
long range similarity behavior. A guess is that the transition at
20 msec reflects a change from delays caused by simple queueing to delays
caused by router processing and needs more work to substantiate.
The autocorrelation function for the first 64000 RTTs for (there was no packet
loss in this period) is shown below. It can be seen that in general there is a
very weak (< 0.01) positive correlation for lags of less than 300 seconds.
This weak correlation is present even for pings separated by only 1 second. The red
horizontal lines are plotted at +-2/(sqrt(64000)) and indicate
twice the standard error if the autocorrelation is zero (95% of the Autocorrelation
values will lie within +-2/sqrt(64000)) if the autocorrelation is zero).
The following quote is from "Nonlinear Time Series Analysis" by Kantz and
Schreiber.
Stochastic processes have decaying autocorrelations, but the reate of decay depends on the properties of the process. Autocorrealtions of signals from deterministic chaotic systems decay exponentially with increasing lag. Autocorrelations are not characteristic enough to distinguish random from a deterministic chaotic signal.
traceroute to hershey.es.net (198.128.1.11), 30 hops max, 40 byte packets 1 RTR-CORE1.SLAC.Stanford.EDU (134.79.199.2) 1 ms 1 ms 1 ms 2 RTR-CGB6.SLAC.Stanford.EDU (134.79.135.6) 2 ms 1 ms 2 ms 3 RTR-DMZ.SLAC.Stanford.EDU (134.79.111.4) 2 ms 2 ms 2 ms 4 ESNET-A-GATEWAY.SLAC.Stanford.EDU (192.68.191.18) 2 ms 175 ms 212 ms 5 lbl1-atms.es.net (134.55.24.11) 4 ms 4 ms 4 ms 6 esnet-lbl.es.net (134.55.23.66) 4 ms 4 ms 4 ms 7 hershey.es.net (198.128.1.11) 5 ms 4 ms 5 ms
The pathchar behavior between a host on the same subnet as minos (minos is an AIX host and pathchar does not run on it) and hershey is shown below:
>pathchar -q 64 hershey.es.net pathchar to hershey.es.net (198.128.1.11) mtu limitted to 8192 bytes at local host doing 64 probes at each of 64 to 8192 by 260 0 FLORA03.SLAC.Stanford.EDU (134.79.16.55) | 77 Mb/s, 462 us (1.77 ms) 1 RTR-CGB5.SLAC.Stanford.EDU (134.79.19.3) | 294 Mb/s, 218 us (2.43 ms) 2 RTR-CGB6.SLAC.Stanford.EDU (134.79.135.6) | 18 Mb/s, 276 us (6.53 ms) 3 RTR-DMZ.SLAC.Stanford.EDU (134.79.111.4) | ?? b/s, -85 us (2.44 ms) 4 ESNET-A-GATEWAY.SLAC.Stanford.EDU (192.68.191.18) -> 192.68.191.18 (1) | ?? b/s, 1.42 ms (5.13 ms) 5?lbl1-atms.es.net (134.55.24.11) | 245 Mb/s, 71 us (5.54 ms) 6 esnet-lbl.es.net (134.55.23.66) | 9.7 Mb/s, 95 us (12.5 ms) 7 hershey.es.net (198.128.1.11) 7 hops, rtt 4.91 ms (12.5 ms), bottleneck 9.7 Mb/s, pipe 42418 bytes
To better understand the behavior of ping Round Trip Time (RTT) in the WAN, we pinged CERN (ping.cern.ch)
from SLAC (minos.slac.stanford.edu) every minute with a timeout of 20 seconds for 260K pings between
8:36 am Sunday May 9 and 10:35am Wednesday May 12, 1999 (PDT).
The packet loss for these measurements was about 0.053%.
The distribution of the RTT is seen in the
chart below.
The distribution shows a lot of structure. First there is a sharp peak at about 224 msec. with a
width of (90% of the peak is contained in) 9.5 msec. On the high RTT side of the peak several smaller
peaks are seen, together with a long tail. If we look at the individual RTTs in the high RTT tail
beyond 260 msec. then we
get the chart shown below:
The clusters of points for Tuesday May 11, also show up in the
Surveyor data as shown in the
graphs below:
Of particular interest is the cluster around 18:00 hours on
Tuesday May 11. The ping RTT and loss data is
shown for this data in the chart below. The loss is calculated by looking for missing ping sequence
numbers. The routes are obtained from Surveyor measurements which use traceroute to measure
the routes about every 15 minutes.
There is a clear change in behavior starting at about 18:10 hours
and stopping at about 19:20 hours. At the start of this period
there is a loss of 169 consecutive ping packets (or a break in connectivity of
169
seconds, since the pings are sent at one second intervals,
while the network routing
converges to a new route),
and at the end a further loss of 36 consecutive ping packets.
Apart from this period the
route (as measured by traceroute)
to CERN is from SLAC to ESnet to the New York Sprint NAP,
then to West Orange in New Jersey
and thence back to Chicago to the STAR-Tap
and onto CERN. During the period from 18:10 hours to 19:20 hours,
the route is from SLAC to ESnet to BBN which goes via New York,
London, to Geneva and is more congested, and hence the
increase in packet loss, but avoids the trip back from New Jersey to Chicago (and so saves an extra 30
msec. in the round trip). The complete routes can be seen below:
The ping RTT data for the cluster around 1:00am on May 11, 1999 can be
seen in more detail in
the chart below. In the chart it can be seen that there is a complete loss of
connectivity (i.e. no pings responded) of about 14 minutes starting at about
1:07am until about 1:21am. After this performance looks fairly normal. Prior
to the loss of connectivity, there are periods of longer RTT (almost double)
followed by shorter losses of connectivity. For CERN to SLAC,
Surveyor shows a change from the normal route at 1:00am and 1:15am
returning to the normal route at 1:35am. For SLAC to CERN, Surveyor
shows a change in route at 0:56am returning to the normal route at
the next measurement at 1:23am. The alternate routes are limited to
the SLAC site. This cluster is coincident with problems occurring
as a result of making changes to a core switch at SLAC.
The cluster around 7:15am on May 11, 1999 shown in more detail below
is actually 3 sudden changes in RTT from about 220
msec. to about 525 msec. and back after 1 to 2 minutes,
with RTT top hat shaped peaks at about 7:14am to 7:16am,
7:19am to 7:20am, and 7:23am
to 7:24am. Surveyor traceroute samples did not coincide with any of these
peaks and saw no route changes. Only one packet was lost in the period shown
below. The black line is a moving average with the average being over 10
seconds. It is inserted to help the eye discern the top hat peaks.
Surveyor also does not indicate any route changes for the clusters around 14:00 hours on May 11, 1999 or 15:00 hours or 18:30 hours on May 10, 1999.
The pathchar information for the normal path from SLAC to CERN is
shown below:
>pathchar -q 64 ping.cern.ch pathchar to dxcoms.cern.ch (137.138.28.176) mtu limitted to 8192 bytes at local host doing 64 probes at each of 64 to 8192 by 260 0 FLORA03.SLAC.Stanford.EDU (134.79.16.55) | 162 Mb/s, 369 us (1.14 ms) 1 RTR-CORE1.SLAC.Stanford.EDU (134.79.19.2) | 115 Mb/s, 281 us (2.28 ms) 2 RTR-CGB6.SLAC.Stanford.EDU (134.79.159.12) | 19 Mb/s, 242 us (6.29 ms) 3 RTR-DMZ.SLAC.Stanford.EDU (134.79.111.4) | ?? b/s, -100 us (2.29 ms) 4 ESNET-A-GATEWAY.SLAC.Stanford.EDU (192.68.191.18) -> 192.68.191.18 (1) | ?? b/s, 31.1 ms (64.4 ms) 5?nynap1-atms.es.net (134.55.24.9) | 914 Mb/s, 118 us (64.7 ms) 6 1-sprint-nap.cw.net (192.157.69.11) -> 192.157.69.11 (1) | 1997 Mb/s, 1.72 ms (68.2 ms) 7?core4-hssi6-0-0.WestOrange.cw.net (204.70.10.225) | 591 Mb/s, 9.52 ms (87.4 ms) 8 bordercore4.WillowSprings.cw.net (166.48.34.1) -> 166.48.34.1 (2) | 86 Mb/s, 1.13 ms (90.4 ms) 9?cern-cwe.WillowSprings.cw.net (166.48.34.6) -> 166.48.34.6 (3) | 130 Mb/s, 59.9 ms (211 ms) 10?cernh9-ar1-chicago.cern.ch (192.65.184.166) -> 192.65.184.166 (2) | ?? b/s, 356 us (211 ms) 11?cgate2.cern.ch (192.65.185.1) | 2634 Mb/s, 135 us (211 ms) 12 cgate1-dmz.cern.ch (192.65.184.65) -> 192.65.184.65 (3) | 551 Mb/s, 327 us (212 ms) 13?r513-c-rci47-15-gb0.cern.ch (128.141.211.41) -> 128.141.211.41 (1) | 15 Mb/s, -225 us (216 ms) 14?dxcoms.cern.ch (137.138.28.176) 14 hops, rtt 210 ms (216 ms), bottleneck 15 Mb/s, pipe 425545 bytes