"dl2.gif Network connectivity problems reporting via BWE
SLAC logo

Network connectivity problems reporting via BWE Network logo

Jiri Navratil. Page created: September 21 2002.

Central Computer Access | Computer Networking | Network Group | ICFA-NTF Monitoring
SLAC Welcome
Highlighted Home
Detailed Home
Search
Phonebook

Introduction

This case shows the possibility to discover network problems via continues monitoring of available bandwidth. Our experimental monitor is running 24 hours a day. It is using Packet-pair dispersion techniques to analyze ABW (available Bandwidth) on selected pathes. Experiments shows, that such monitor could use very few packets to analyse situation. Ours measurement is done only from 20 packet-pairs probes. So it means that it doesn't generate big traffic and it is totally non-intrusive. The current mode is set to run 1 measurent to each host per 60 second. You can see the examples from the latest version (with better filtering and floating average) in following document monitoring-results.html .

KPNQest case - July 24 2002

The following paragraphs describe the situations in July 24th,2002 when KPNQuest stopped networking activity in UK. Our monitor discovered very dramatic ABW changes. And because the drop of ABW was very unusual, we immediately analysed the traceroutes to our UK sites and discovered the routing changes. After several hours we have got information from Geant people about changes in UK. The whole situation is documented in following paragraphs.

Normally, the avarage of ABW between SLAC and Daresbury lab was on the level 65 Mbps. When KPNQest stopped its activity the routing in GEANT was automaticaly converted into backup (commercial ISP: ALTER NET) but backup links had no capacity as original path and in that moment our ABW shows the drop into level of 10 Mbps. Later, during afternoon and next night we have seen several other changes, which was not so dramatic as previous situation. However, it was also interesting to see that our monitoring system has capability distinquish also very small changes as "flipping interfaces" or "level-balancing" as second case looks like.

The original path ("normal situation") from SLAC to Daresbury was via Es.net and Geant node in UK. Our traceroute has been taken just few minutes before changes happend. It was following:

Wed Jul 24  8:25:14 2002 (stamp) 1027524314

traceroute to rtlin1.dl.ac.uk
[1] RTR-CORE1A.SLAC.Stanford.EDU,134.79.143.2,0,0
[2] RTR-DMZ1-GER.SLAC.Stanford.EDU,134.79.135.15,0,0
[3] 192.68.191.146,192.68.191.146,0,0
[4] snv-pos-slac.es.net,134.55.209.1,0,14
[5] chi-s-snv.es.net,134.55.205.102,0,59
[6] nyc-s-chi.es.net,134.55.205.105,0,73
[7] 62.40.126.5,62.40.126.5,0,67
[8] 62.40.126.14,62.40.126.14,0,138
[9] janet-gw.uk1.uk.geant.net,62.40.103.150,0,149
[10] 146.97.37.81,146.97.37.81,0,137
[11] po6-0.read-scr.ja.net,146.97.35.133,0,138
[12] po3-0.warr-scr.ja.net,146.97.33.54,0,142
[13] po0-0.manchester-bar.ja.net,146.97.35.46,0,143
[14] 146.97.40.178,146.97.40.178,0,143
[15] 194.66.25.30,194.66.25.30,0,144
[16] gw-fw.dl.ac.uk,193.63.74.233,0,143
[17] rtlin1.dl.ac.uk,193.62.119.20,0,144
[18] rtlin1.dl.ac.uk,193.62.119.20,0,145
After changes the path was following:
Wed Jul 24  8:45:13 2002 (stamp 1027525513)i

traceroute to rtlin1.dl.ac.uk
[1] RTR-CORE1A.SLAC.Stanford.EDU,134.79.143.2,0,0
[2] RTR-DMZ1-GER.SLAC.Stanford.EDU,134.79.135.15,0,0
[3] 192.68.191.146,192.68.191.146,0,0
[4] snv-pos-slac.es.net,134.55.209.1,0,22
[5] orn-s-snv.es.net,134.55.205.121,0,65
[6] dchub-orn.es.net,134.55.209.18,0,85
[7] 198.124.192.21,198.124.192.21,100,85
[8] 0.so-3-1-0.XL1.DCA6.ALTER.NET,152.63.38.118,0,130
[9] 0.so-0-0-0.TL1.DCA6.ALTER.NET,152.63.38.69,0,130
[10] 0.so-7-0-0.IL1.DCA6.ALTER.NET,152.63.9.193,0,106
[11] so-0-0-0.IR1.DCA4.Alter.Net,146.188.13.34,0,202
[12] so-6-1-0.TR2.LND9.Alter.Net,146.188.4.82,0,257
[13] so-6-0-0.XR1.LND9.Alter.Net,146.188.15.42,0,291
[14] pos1-0.gw1.lnd9.alter.net,158.43.150.142,0,257
[15] ukerna-gw.pipex.net,158.43.37.202,0,256
[16] po15-0.lond-scr.ja.net,146.97.35.137,0,256
[17] po4-0.read-scr.ja.net,146.97.33.74,0,244
[18] po3-0.warr-scr.ja.net,146.97.33.54,0,258
[19] po0-0.manchester-bar.ja.net,146.97.35.46,0,243

Picture shows situation when path to UK has changed characteristics dramatically (violet in the picture). The log shows more detail data:

timestamp   abw      
...
1027524541  62.280 
1027524631  63.158 
1027524721  62.490 
1027524810 109.474 
1027524900  61.962 
1027524990   6.344 
1027525080  10.493 
1027525170  10.770 
1027525259  10.740 
1027525349  10.639 
...

The second case shows the not very stabile situation during afternoon and during night. Probably as a conseqences of "work" of networking people, who was trying to fix problem. The one segment in the path (router LND9.Alter.Net) used probably two different interfaces for connection with neighbours (in our traceroutes once reported as 146.188.13.34 and once as so-6-1-0.TR2.LND9.Alter.Net)

path1 
....
[8] 152.63.38.118,152.63.38.118,0,86 
[9] 0.so-0-0-0.TL1.DCA6.ALTER.NET,152.63.38.69,0,80 
[10] 0.so-7-0-0.IL1.DCA6.ALTER.NET,152.63.9.193,0,82 
[11] 146.188.13.34,146.188.13.34,0,87 
[12] so-6-1-0.TR2.LND9.Alter.Net,146.188.4.82,0,169 
[13] so-6-0-0.XR1.LND9.Alter.Net,146.188.15.42,0,165 
[14] pos1-0.gw1.lnd9.alter.net,158.43.150.142,0,168 
...
or path2
...
[8] 152.63.38.118,152.63.38.118,0,80 
[9] 0.so-0-0-0.TL1.DCA6.ALTER.NET,152.63.38.69,0,85 
[10] 0.so-7-0-0.IL1.DCA6.ALTER.NET,152.63.9.193,0,81 
[11] 146.188.13.34,146.188.13.34,0,90 
[12] so-6-1-0.TR2.LND9.Alter.Net,146.188.4.82,0,166 
[13] 146.188.15.42,146.188.15.42,0,258 
[14] pos1-0.gw1.lnd9.alter.net,158.43.150.142,0,275 

In reality, for us it means that ABW alternated on the level 
 6, 10 or 15 Mbps for quite easily visible time period.

timestamp   abw      
...
1027582547  10.752 
1027582640  10.541
1027582731  10.778 
1027582824  10.493
1027583006   6.293
1027583097   6.000
1027583188   6.482
1027583279   6.370
...

The picture show this situation between 15.00 - 16.00.
Thei whole situation has been stabilized Thu Jul 25 11:00:30 US/Pacific 2002
 
...
1027615543  10.690 8 25  10.726  10.991 rtlin1
1027615635  10.558 7 25  10.717  10.984 rtlin1
1027615727  10.535 10 25  10.707  10.977 rtlin1
1027615838  10.592 10 25  10.701  10.971 rtlin1
1027615930  10.561 11 25  10.694  10.964 rtlin1
1027620030  65.561 13 25  27.154  11.859 rtlin1
1027620123  61.307 16 25  37.400  12.670 rtlin1
1027620215  62.866 16 25  45.040  13.493 rtlin1
1027620308  64.923 14 25  51.005  14.336 rtlin1
1027620400  62.293 17 25  54.391  15.122 rtlin1
1027620491  60.719 18 25  56.289  15.869 rtlin1

New Routing was setup in frame of Geant network:

[1] RTR-CORE1A.SLAC.Stanford.EDU,134.79.143.2,0,0
[2] RTR-DMZ1-GER.SLAC.Stanford.EDU,134.79.135.15,0,0
[3] 192.68.191.146,192.68.191.146,0,0
[4] snv-pos-slac.es.net,134.55.209.1,0,11
[5] chi-s-snv.es.net,134.55.205.102,0,60
[6] nyc-s-chi.es.net,134.55.205.105,0,88
[7] abilene-nyc.es.net,198.124.216.106,0,66
[8] abilene-gtren.de2.de.geant.net,62.40.103.253,0,155
[9] de2-1.de1.de.geant.net,62.40.96.129,0,146
[10] de.fr1.fr.geant.net,62.40.96.50,0,157
[11] fr.uk1.uk.geant.net,62.40.96.90,0,162
[12] janet-gw.uk1.uk.geant.net,62.40.103.150,0,160
[13] 146.97.37.81,146.97.37.81,0,162
[14] po6-0.read-scr.ja.net,146.97.35.133,0,164
[15] po3-0.warr-scr.ja.net,146.97.33.54,0,167
[16] po0-0.manchester-bar.ja.net,146.97.35.46,0,167
[17] 146.97.40.178,146.97.40.178,0,168
[18] 194.66.25.30,194.66.25.30,0,169
[19] gw-fw.dl.ac.uk,193.63.74.233,0,169
[20] rtlin1.dl.ac.uk,193.62.119.20,0,170
[21] rtlin1.dl.ac.uk,193.62.119.20,0,170

The path is much longer  (see [9,10,11]) but since this time 
there is again stable BW at about 61-64 Mbps. 

ANL case - September 17 2002

Simmilar situation as discribed above happend quite often. Last time I have seen this situation September 18th, when ANL did changes on its infrastructrure. The situation is shown on following picture. The change happens suddenly at 15:59 and the the cascaded drop in the picture is caused by Floating average, (same as change in the end of problems at 18:00). There is also a limit for displaying data over 90 Mbps and this is reason why data between 16.00 and 18.00 are not visible on the graph.

The real data from the monitori show following value:

timestamp  abw (mpbs)
1032302992 211.321
1032303107 379.661
1032303223 342.857
1032303339  23.133
1032303455  24.050
1032303571  22.958
...
1032308205  23.688
1032308321  23.440
1032308437 274.286
1032308553 231.818
1032308668 194.595
1032308784 248.780
...

During  problematic time was following traceroute:
traceroute to wiggum.mcs.anl.gov (140.221.11.99), 30 hops max, 38 byte
packets
 1  rtr-core1-pub6 (134.79.27.2)  52.475 ms  159.393 ms  144.713 ms
 2  rtr-dmz1-ger (134.79.135.15)  126.598 ms  128.322 ms  66.988 ms
 3  slac-rt4.es.net (192.68.191.146)  63.108 ms  133.659 ms  149.581 ms
 4  snv-pos-slac.es.net (134.55.209.1)  171.181 ms  140.351 ms  58.375 ms
 5  chi-s-snv.es.net (134.55.205.102)  134.145 ms  167.378 ms  172.571 ms
 6  198.125.140.162 (198.125.140.162)  113.928 ms  215.703 ms  269.653 ms
 7  140.221.20.124 (140.221.20.124)  188.569 ms  80.969 ms  59.247 ms
 8  wiggum.mcs.anl.gov (140.221.11.99)  50.143 ms  54.176 ms  134.031 ms

After changes finished and the normal traffic has been seen, 
the traceroute has returned to original:

traceroute to wiggum.mcs.anl.gov (140.221.11.99), 30 hops max, 38 byte
packets
 1  rtr-core1-pub6 (134.79.27.2)  79.794 ms  42.669 ms  47.118 ms
 2  rtr-dmz1-ger (134.79.135.15)  35.323 ms  0.421 ms  0.399 ms
 3  slac-rt4.es.net (192.68.191.146)  0.499 ms  0.406 ms  0.398 ms
 4  snv-pos-slac.es.net (134.55.209.1)  0.785 ms  0.762 ms  0.759 ms
 5  chi-s-snv.es.net (134.55.205.102)  48.599 ms  48.587 ms  50.995 ms
 6  anl-chi.es.net (134.55.208.42)  62.907 ms  65.672 ms  200.640 ms
 7  kiwi-esnet.anchor.anl.gov (192.5.170.77)  131.487 ms  168.004 ms 139.712 ms
 8  stardust-guava.anchor.anl.gov (130.202.222.73)  110.705 ms  209.990 ms 146.158 ms
 9  wiggum.mcs.anl.gov (140.221.11.99)  158.669 ms  96.055 ms  127.274 ms
So changes has been only between: chi-s-snv.es.net and destination node.
[ Feedback | Reporting Problems ]
Page owner: Les Cottrell