DOE Office of Science Notice 01-06

Collaboratory Pilot: High Performance Networks

 

Title of Proposed Project:

Active Measurement Infrastructure for ESnet (AMIE)

 

Principal Investigator:

Les Cottrell, (650)926-2523, FAX (650)926-3329, <cottrell@slac.stanford.edu>

Stanford Linear Accelerator Center (SLAC), MS97, 2575 Sand Hill Rd., Menlo Park, California 94025
Key members of team:

Warren Matthews + student, SLAC

Vern Paxson, LBNL/ACIRI

Rich Wolski, University of Tennessee at Knoxville

Linda Winkler, Bill Nickless, Argonne National Laboratory

Andrew Adams, PSC

 

Submitted to:

High Performance Networks Research Program, Mathematical, Information and Computational Sciences Division,

Office of Advanced Scientific Research, U. S. Department of Energy, 19901 Germantown Rd, Germantown, MD 20874-1207.

 


Summary description of proposed research

 

We propose to deploy and extend a second instance of the National Internet Measurement Infrastructure (NIMI) to enable active end-to-end performance measurements of paths between ESnet and ESnet collaborator sites with high performance network links, including extremely high performance links such as provided by the National Transparent Optical Network (NTON). The project will procure, install, and deploy the NIMI probes and gather, archive and analyze the data and make the results publicly available via the web. The analysis and reporting will leverage and extend the tools developed for the PingER and Beacon projects, add new tools, and also provide selected data to the Network Weather Service to assist with predicting future performance. We also plan to extend the measurements to IPv6 networks, and to evaluate the impacts of Quality of Service (QoS).

 

A comprehensive network monitoring architecture and its implementation are essential to collaboratory and Grid systems. This comprehensive system includes passive monitoring, several types of active monitoring, and a consistent data management approach to  catalogue and make available the results of the monitoring. This proposal addresses the active monitoring. For the final submission, it will be combined with the proposals for

the procurement and deployment of an operational passive monitoring facility for ESnet and improvements to the IP Multicast infrastructure submitted by LBNL, ORNL, UTK and ANL.

 

 

1.      Statement of importance - identification of problem or opportunity, or situation being addressed

 

The extraordinary network challenges presented by scientists, researchers and in particular high energy nuclear and particle physics (HENP) experiments has created a critical need for network monitoring to understand present performance, set expectations, trouble shoot, and to allocate resources to optimize/improve performance. While an infrastructure to provide advanced IP based data transport services is largely in place in the N. American, W. European and Japanese research and education communities, there currently does not exist a well defined, always on, systematic and automated approach to characterizing the QoS  parameters of all the components involved in data transport services from a source to a destination. There are several existing Internet active end-to-end measurement projects in place today, including AMP, NIMI, PingER, RIPE, skitter, Surveyor, Beacon and the Network Weather Service (NWS). A comparison of some of the projects indicates that most restrict themselves to measuring delays (or round trip times (RTT)), losses and routes. NIMI and the NWS on the other hand are mainly envisioned as infrastructures that enable monitoring of both delays and throughput.  In addition the NWS is capable of generating real-time forecasts of future performance levels. The need to provide throughput measurements is dramatically increasing in order to support the emerging needs for high throughput by data intensive science applications such as the Particle Physics Data Grid (PPDG), or data replication, remote backup and archiving, and content distribution such as video streaming. New services such as QoS, and new applications such as interactive voice over IP (VoIP), experiment control, multimedia and multicast applications are increasing the need for new types of measurements including the effects of applying QoS and measuring metrics such as jitter, continuous availability and multicast.

 

The existing DoE/MICS sponsored PingER project provides RTT, loss, and reachability information for over 3000 pairs of hosts in over 70 countries with data going back more than 5 years. The current proposal should be regarded as complementary to the PingER project in that this proposal focuses on ESnet and ESnet collaborator sites with high performance connectivity. For such sites more intensive monitoring is possible (due to the high performance links available) and needed (to characterize today’s high performance applications), than is provided by the low impact PingER monitoring..

 

The existing NLANR Beacon project provides real-time visualization of multicast reachability, loss, delay, and jitter. But it does not provide any sort of history mechanism to see how these variables evolve over time. 

 

2.      Explanation of methodology and equipment needs

 

Our proposal has three main goals:

a.        the procurement and deployment of an operational NIMI-based monitoring facility for ESnet,

b.       the enhancement of NIMI monitoring capabilities to include new or extend existing vital network performance characteristics such as jitter, bottleneck bandwidth estimation, and multicast performance, and

c.        the provision of forecasting capabilities within the monitoring framework.

The resulting infrastructure will serve both as an invaluable resource enabling the development of new HENP applications and as a network research tool of unmatched scope and capability.

 

We propose to deploy an active Internet measurement infrastructure based on the National Laboratory for Applied Network Research (NLANR) NIMI. NIMI is based on a collection of measurement probes that cooperatively measure the properties of Internet paths and clouds by exchanging test traffic amongst themselves. It provides: decentralized control of measurements; strong authentication and security; mechanisms for both maintaining tight administrative control over who can perform what measurements using which probes; delegation of some forms of measurements; and simple configuration and maintenance of probes. The ESnet NIMI infrastructure will be administered separately from the existing DARPA-funded NIMI infrastructure.  It should, however, be possible to link the two infrastructures, as one of NIMIs basic design goals was to support administratively heterogeneous infrastructures.  Vern Paxson, a key architect of the DARPA NIMI project is collaborating and advising the current project and Andrew Adams is a major NIMI developer, so we expect on-going, fruitful interactions between the two projects. The NIMI probes will be deployed at ESnet and major ESnet collaborator sites. The probes will be centrally purchased and configured, and as far as possible, replicas of each other.

 

The probes will start with existing NIMI tools to measure round-trip times loss, reordering, unicast and multicast, traceroutes, TCP throughput and bulk transfer capacity, ftp transfer and fetching web pages. They will also contain a packet filter that will look at only the probe’s traffic. We will look at extending/enhancing these tools and also adding tools to measure inter packet delay variability and/or jitter, reordering and pathologies such as duplicate packets. We expect to integrate the Beacon functionality into the NIMI framework, and extend the deployment to include some IPv6 paths and paths that support QoS. We will also work with ESnet folks to investigate the use of NIMI to look at link utilization at site border routers by accessing the SNMP MIBs in read only mode. On-demand measurements of bulk throughput capability will also be added to help understand and trouble shoot bulk throughput applications, and to validate more simple/lightweight bulk-throughput estimators (e.g. simulation).  The architecture allows new measurement suites to be added without modification of the NIMI daemon (nimid).

 

There will be a central archive machine that will retrieve and store the measurements in a file system. The archive machine can also host some default data analysis client/applications, however in the interests of scalability and simplicity we may separate this task out to other hosts. Reports from the applications will be made available via the web so there will also be a web site host with powerful tools to enable a user to navigate to find the information of interest.

 

Web accessible reports will include tabular time series similar to those provided by PingER. They will allow user selection of metric (e.g. delay, loss, jitter, throughput, reachability), time scales (both the time separation of the adjacent points, and the window in time being reported on), paths (both the source and destination and grouping by affinities such as collaboration, geographical region, Internet Service Provider (ISP)). The user will also be able to sort the data by simply clicking on column headings. In addition the reports will provide user drill down to display time series plots and frequency histogram details for individual paths and groups of paths.  The tabular data will also be exportable to applications such as Excel to enable customized analysis and reports to be generated by interested users. Traceroute history information will also be made available.

 

We plan to provide selected data to the Network Weather Service for predicting future performance. While network monitoring alone is critical to network capacity-planning and diagnostic activities, if HENP applications are to use the resulting data for scheduling, a forecast of future performance levels is required.  Performance can change rapidly with time, so the performance system must be able to develop predictions of future performance levels.  As such, we plan to provide an interface between NIMI performance monitoring facilities and the NWS.  We will use this interface both to study the problem of real-time throughput forecasting from network-level performance measurements, and to support HENP application scheduling.

 

We also plan to extend the current project to instrument the National Transparent Optical network (NTON) testbed with our tools to ensure they scale to the next generation high speed networks and to assist with understanding the NTON performance.

 

3. Anticipated results

 

The current proposal extends the highly successful PingER project, in particular by providing more detailed (more frequent as well as more metrics) information between critical sites with high performance network connections, and by an increased focus on high performance network ESnet sites and sites with strong ESnet site collaborative requirements. It will also provide an alternativeeasurement technique for paths where ICMP may be restricted and/or a way to validate whether ICMP rate limiting is being used on a path. At the same time it will leverage the analysis and reporting facilities of PingER.

 

The proposal also builds on the NIMI infrastructure architecture that is now successfully deployed at about 45 sites (including two of the proposer sites: SLAC & LBNL). It will add more frequent and new measurements in particular of bulk throughput performance, inter packet delay variability, and border link utilization. A major contribution of the current proposal to NIMI will be to extend the NIMI analysis and reporting by leveraging and extending the tools in use in PingER to provide public web access to tabular time series, with selection of metrics, time scales, path groupings, and drill down to more details and graphs. We will also work closely with the NIMI developers to follow their developments, to provide feedback on existing or required features, to provide new measurement suites and analyses.

 

Results will be available publicly via the web. They will provide information that can be passed between disciplines (e.g. from user to site network engineer to ISP NOC), enable network engineers to be able to make realistic promises, network and applications users to have realistic expectations of performance for existing and new applications; provide trouble shooting assistance by identifying when changes occurred, what the changes were,  what the impact of the changes may have been; assist in verification that expected levels of service are being met; provide input to setting and verifying service level agreements (SLA’s); help decide which path to use when more than one is available; assist in deciding where to locate a remote computing/replication facility, and help with planning.

 

In addition the integration with the Network Weather Service will provide a base for ESnet application developers to instrument and improve their applications to take advantage of dynamic forecasts of performance characteristics. A key member of the current proposal’s team (Rich Wolski) is also the chief architect of the Network Weather Service.  The close contact of key people in this proposal with PPDG application developers will assist in extending mature applications such as parallel FTP to take advantage of the NWS. In addition, we anticipate that the integration of NWS and NIMI capabilities will generate new network performance analysis and forecasting research results.  This proposal will enable PPDG developers to leverage those results immediately.

 

Another valuable contribution will be IPv6 monitoring. As IPv6 is deployed it will be useful to monitor the performance and peering arrangements. It is proposed that porting NIMI to IPv6 will be investigated and a small number of IPv6 aware NIMI boxes will be deployed.  One of the key collaborators, Warren Matthews, is also a leader in monitoring IPv6 paths and SLAC is connected to the ESnet IPv6 testbed.

 

We will enable the NIMI measurement tools to mark packets and make measurements over paths that support QoS. SLAC is connected to the ESnet QoS testbed; SLAC also has a joint (SLAC’s end is not funded) proposal with Daresbury Lab in England to investigate the effectiveness of QoS techniques, which will provide access to a transatlantic QoS controlled bottleneck.

 

Instrumenting the NTON will enable us to understand whether and how the probes, the measurements and analysis can be scaled to an extremely high performance (OC48) network, and also assist in providing a better understanding of the end-to-end performance of the NTON. SLAC is an NTON site with an OC48 and has demonstrated > 900Mbits/sec throughput from Dallas to SLAC at SC2K in November 2000. Also several major SLAC/Babar collaborators are connected to NTON, so there is significant interest in ensuring it works well for their major applications..

 

We also will work closely with the LBNL passive measurement infrastructure team to understand how the 2 sets of information complement one another and validate our data.

 

4. Project schedule

 

1.        The first phase (roughly months 1 to 6) of the work will be to understand NIMI in detail, select an operating system and hardware and replicate the existing NIMI architecture and tools on it. This prototype will then be replicated and put into production use at SLAC to gain experience in a production environment. At the same time we will be working with two or three other friendly sites (including ANL and University of Tennessee at Knoxville) to agree upon and arrange for early deployment of NIMI probes at their sites. The ANL team will start to look at how to utilize the existing NIMI multicast measurements and to evaluate how to add Beacon functionality. The University of Tennessee team will develop an initial interface between the NIMI performance gathering infrastructure and the NWS forecasting subsystem. The PSC team will provide NIMI support for the ESnet environment.

2.        The next phase (roughly months 7 to 12) will be to procure the first 8 to 10 NIMI probes, to document and package the software and hardware for distribution to the first remote sites, and to deploy them. During this phase we will evaluate and understand the existing NIMI analysis and reporting tools and also investigate if and how to integrate the NIMI collected data with the PingER analysis and reporting tools. The ANL team will start to integrate Beacon into its NIMI and provide analysis and reporting. The University of Tennessee team will deploy an enhanced NWS capable of serving NIMI-gathered data via the NWS, and will begin the study of new forecasting capabilities. During this phase we will investigate various other existing tools (e.g. traceping or surveyor) for analyzing and reporting on the traceroute information we have gathered, choose one and use it to make our traceroute information available.

3.        In the second year, we will focus on broadening the AMIE community. We will feedback our experience into the documentation and procedures, decide on the strategic sites to position the next set of probes and extend the deployment to a further 15 to 20 sites. User support will be expanded. We will include some IPv6, and non U.S. sites in this deployment. We will also extend the measurements / analysis / reporting suites to add inter packet delay variation, multicast, IPv6 and QoS measurements. We will investigate other possible measurement suites such as bottle neck bandwidth estimation, e.g. via pipechar. During this period we will evaluate if and how to integrate our measurements with those from Surveyor or other measurement projects. We will investigate the cost and effectiveness of placing multiple NIMIs at strategic points inside a site to assist in addressing network performance problems local to a site and hence get closer to the user end of things.

4.        In the third year, we will extend the deployment to a further 20 sites. We will carefully validate and compare the measurements versus other mechanisms, and add new measurement suites based on our experiences. We will also investigate providing access to the measurements and analyses to trouble shooting tools to make problem isolation easier for non-expert users. We will start to deploy multiple NIMI probes at a few sites. We will evaluate how to provide a smooth transition to a production service as opposed to a project.

 

Budget

 

Total budget for this project is $475K/year for three years including equipment. The people costs breakdown is $45K each to ANL, PSC and the University of Tennessee Knoxville, and $260K to SLAC. Vern Paxson LBNL/ACIRI will be funded from elsewhere. Equipment required in the first year includes an archive machine with several hundreds of Gbytes of storage, a few (say 4) NIMI probes. In the second year we may need to add more storage to the archive machine, and there will be an additional 15-20 NIMI probes. The probes are estimated to cost about $2K each. The site-local personnel for maintaining the NIMIs should be quite low, because the NIMIs are designed to be remotely maintained and managed.  If sites wish to undertake site-specific measurements and analyses, then that will require additional effort; the analysis generally requires significantly more time than the measurement. Accordingly, as sites gain experience with the value of the AMIE data, there may be a need to fund their analysis activities. This is not included in the present proposal.