CHEP95 Distributed Computing Environment Monitoring and User Expectations

Les Cottrell & Connie Logg
SLAC, Sept 1995
Outline of Talk:

· Why Monitor

· What should we monitor

· What is the Current State

· How do we Monitor

¯ Collection

¯ Analysis

¯ Reduction

¯ Notification

· Results

· Costs

· Future

Why Monitor


To provide:

1. Performance Tuning - Improve service - proactively id & reduce bottlenecks, tune and optimize systems, improve QOS, optimize investments - id under/over utilized resources, balance workloads

2. Trouble Shooting - Get out of crisis mode, id probs & start diagnosis/fixing before end user notices, increase reliability/availability, allow user to accomplish work more effectively and maximize productivity.

3. Planning - understand performance trends for planning

4. Expectations - set expectations for the Distributed System (from network thru applications) and see how well they are met

5. Security

6. Accounting

ESnet Sites Survey on "Why Monitor"


Results of May-95 Survey from 9 ESnet sites3 representing > 50K nodes

What's Changed that Makes Monitoring so Crucial now


1. Distributed environment (client/server)

· critically relies on network to function.

· very different from central environment, yet users expect as good or better
Table 1: Comparison of Old Environment to Distributed Environment

Mainframe/Workstation

Distributed Environment

One OS

Many OSs & dist. sys. services

One local file system

Multiple distributed file systems

In "Glass House"

All over site, mods by people with varying skills & responsibilities

Mature diagnostics with vendor call in

Roll your own diagnostics & reports


What's Changed that Makes Monitoring so Crucial now


2. Network growth:

· Extent/coverage of network increasing

· Number of devices increasing exponentially (30-50% / year is typical)

· Traffic doubling typically every 18 months

· Technology to manage network is not growing as fast as network technology

What's Changed that Makes Monitoring so Crucial now


3. Complexity:

· a typical ESnet site has:

¯ products from about ten vendors, suppliers, carriers

¯ ~ a dozen different configurable equipment types (routers, bridges, hubs, switches ...)

¯ ~ half dozen network management applications (NMS, trouble ticket, probe management ...)

¯ ~ 9 different vendor MIBs

¯ 5 protocol suites (TCP/IP, DEC, AppleTalk, Netware,..._) typically routing 4 protocols, bridging 3 and tunnelling 2.

¯ 9 server platforms (VMS, MacOS, AIX, SunOS, WNT ...)

¯ ~ 30 networked applications

· this results in:

¯ decreased support effectiveness

¯ decreased QOS

¯ inability to support existing & new applications

¯ increased downtime, lost opportunity, user's time wasted & security exposures

What's Changed that Makes Monitoring so Crucial now


4. Reduced Resources:

· budgets increasingly constrained

· few experienced personnel available, hard to retain after training

So need simple to use, well integrated tools to automate network management and improve the productivity of existing personnel

What Should we Monitor


The ultimate measures of performance are the users' perceptions of the performance of their networked applications (e.g. WWW, email, a distributed RDBMS, a spreadsheet accessing a distributed file system etc.)

This performance is affected by the performance of the complete Distributed System, which includes:

· physical network plant

· communications devices (e.g. routers, switches) , computers and peripherals attached to the network plant

· host resource utilization

· software from device interfaces, thru operating systems to applications running on computers and devices

To set and meet user expectations for distributed system performance, we must monitor all of the above

What is the Current State


Companies are finding it difficult to manage network performance4:

· Only 24% adequately manage network performance

· only 16% have network performance service level agreements

· 55% indicate they are understaffed for managing network performance

· 56% have a project in works or plan to improve network performance

· 65% have a project in work or plan to improve network management

· 95% would like to report on network utilization, but only 55% do

· 91% would like to report on network availability, but only 25% do

ESnet Sites: Practices vs. Desires for Monitoring


Largest Increase is in Security and Applications

What is the Current State of Tools


Table 2: Summary of Existing Tools

Target

Key Issues

SNMP agents & managers

Device management

Node focus

Physical layer

Umbrella Management Systems

Integration platform

Sys admin tools

Centralized polling

Costly

SNMP with RMON applications

LAN analyzers

Trouble shooting

Cost

Not proactive

· expensive, hard to learn

· mainly aimed at real-time trouble shooting

· generate massive amounts of data

¯ needs to be squeezed into digestible reports

¯ needs to automatically identify baselines and exceptions

¯ needs automated expertise to correlate apparent multiple error sources and find root cause



Components of Network Monitoring

Network Data Collection at SLAC


Collect data via SNMP from:

· Bridges, routers, ethermeters, hubs and switches

· Data collected includes:

· # good packets, # kilobytes, pkt size distribution

· # errors (# of types of errors)

· # pkts dropped, discarded, buffer/controller overflows

· top-10 talkers & protocol distributions

Collect data via Ping - for response, pkt loss, connectivity from:

· critical servers, router interfaces, ethermeters

· off-site collaborators' nodes

Other Sources:

· Poll critical Unix network daemons & services (e.g. mail, WWW, name, font, NFS ...)

· ARP caches

· appearance of new unregistered nodes

Data Analysis at SLAC


Once a day (in the early morning), via batch jobs:

· The previous day's data is analyzed and summarized into ASCII files (usually tabular) and graphs

· Long term graphs (fortnightly, monthly, 180 days) are updated

Ongoing analysis during the day consists of:

· Generating files of hourly graphs and other displays of data collected to far today.

· Bridge, router and ethermeter interface stats

· Top10 talkers and subnet protocol usage

Data Reduction at SLAC


Analysis generates thousands of reports most of which are uninteresting

Reduction examines the analysis reports and extracts the exceptions e.g.

· Duplicate IP addresses

· Appearance of new unregistered nodes

· Loss of connectivity

· Data values exceeding thresholds, e.g.

¯ CRC & alignment errors > 1 in 10000 packets

¯ total utilization on a subnet of > 10% for the day

¯ broadcast rate > 150 pkts/sec

¯ (shorts+collisions)/good_packets > 10%

¯ packet loss from onsite pings > 1% in a day

¯ bridge/router overflows and queue drops

· Creates exception reports (for display by WWW) with hypertext links to tables and plots with more information

Alert Notification


The daily WWW visible exception reports are manually reviewed each working morning and used as input to the morning H. O. T. meeting

· 5-15 min open meeting of network ops & development, systems admins, help desk and other interested people

· covers: scheduled outages and installations, newly identified problems, outstanding/unresolved problems

In addition:

NMS maps show when a managed critical interface becomes inactive (goes red)

SNMP and ping-polling of critical interfaces results in:

· issuing of X-window pop-up windows

· phone pages being issued

· e-mail messages

Security intrusions result in:

· phone pages being issued by the pager system

Results


Service Level Expectations:

· Examples

¯ Ping response time for on-site network layer < 10msec for 95% of samples

¯ Network reachability of critical nodes of >= 99%

¯ Sub-second response for trivial network services (name, font, network daemons (smtp, nfsrpc) ...)

¯ 95% of trivial mail delivered on site in 10 minutes

¯ 95% of requests for SLAC WWW home page served in < 0.1 secs.

·
Table 1: Response time for WWW Aug 31, 1995

Node

Avg

50%tile

(thresh)

95%tile

(thresh)

WWW

0.036s

0.05s

(<0.04s)

0.055s

(<0.1s)

The expectations are used in conjunction with thresholds,

Wide Area Monitoring


WAN monitoring for an end site has different requirements to LAN monitoring

· often outsourced and have limited control

· there is much greater variability in results

However we (users and networkers) still want to have reasonable expectations for planning and problem identification

The main tools used today are:

· ping response time, packet loss and connectivity

· traceroute to discover the route to another node

· FTP transfer rates

· traffic measurements from firewall routers

Wide Area Monitoring


Several HEP sites (e.g. IN2P3, Padova, RAL, SLAC) have automated the pinging and produce reports

· Significant changes in ping response time may indicate routing problems

¯ use traceroute to look at routes

¯ need to look from both ends (ideally without an account, e.g. via WWW)

PIng Measurements between UC Davis and SLAC

WAN Monitoring


· Significant increase in packet loss may indicate overloaded routers/links

· Try to separate down nodes/links from packet loss

· Since quantitative values vary widely from node to node and even month to month, need automated "expert" help in setting dynamic thresholds

· FTP rates depend on many inter-related factors including:

¯ packet loss

¯ network response time for end to end

¯ number of hops and route used

¯ utilization of links

¯ speed of links

¯ capability of end nodes

¯ Measured between 10am and 5pm PST Labor Day, >= 3 transfers/site

¯ Lot of variability in rates (can be factors of 5 or more even from minute to minute for the same node)

¯ Average transfer rate drops by ~30% for workdays

Host Monitoring


Focus so far been on Monitoring lower layers of network

Managing host systems connected to the network has become increasingly important

Need consistent management of host system resources and mission-critical applications

IETF made great strides to specifying MIBs, in particular the Host Resources MIB

Much of today's practice is still roll-your own, e.g. at SLAC

· extensive accounting/monitoring on Unix servers by Unix group

¯ disk space utilization

¯ cpu utilization

¯ memory usage

¯ paging activity and I/O

¯ security intrusions (e.g. md5 -> page, Enet promiscuous)

· tabular reports and graphs created daily and WWW accessible

· generates alerts, e.g. if users application on a server uses >10% of cpu for > 15 mins

Costs - Disk & People


Disk Space

· ~ 4Mbytes/day of raw data collected

· ~ 750 Mbytes of plots (roughly proportional to number of devices)

People Resources

· Coding is the easy part

¯ About 16,000 lines of code, mainly SAS, Perl, REXX

¯ 15% for collection, 77% analysis, 8% reduction

¯ Some has been packaged for distribution

¯ Exported to ORNL, PNL, Stanford and interest from LBL, FNAL, Cisco, NAT and others

· Hard part is defining and understanding the details

· Probably 2-3 FTE years invested in last 3 years by 5 people

Costs - Bandwidth


Polling:

· Uses bandwidth1, e.g. polling 20 stations/interfaces/agents every 5 secs can use all of 64kbps line

· Beware polling routers etc. may cause device cpu and buffer utilization (e.g. for ARP caches) problems

Strategies:

· Select only essential variables

· Match polling volume to node capability

¯ # remote nodes < PollingInterval/AvgDelayPerPoll

¯ Remote sites: minimal

¯ Core sites: normal polling

· Consider data "relevance"

¯ Accounting: hours or even daily

¯ Interface status: minutes

¯ Real-time troubleshooting: seconds

Future - Switched Network


Shared media => switched network

Need a probe on every switch/hub port

· not acceptable to drag protocol analyzer to remote places

· with enough ports difficult to have enough probes

· SNMP MIBS do not provide network level utilization

Switch and router vendors are incorporating RMON into devices

· some RMON functions are expensive, e.g. matrix & packet capture, so compromises are needed

· statistics, alarm and event groups are cheap, should be on all ports

Future - RMON2


Provides monitoring for full 7 layers of OSI model

RMON still handles layers 1 and 2 of OSI model

RMON 2 will enable trouble-shooting tools to

· show what clients are talking to what servers through what protocols (i.e. can determine how much applications such as WWW are using the network)

· help with traffic flow

· detect duplicate IP addresses

· improve packet capture selection

RMON2 also enhances configuration

IETF draft RMON version 2 published as proposed standard and RFC July 1995

Many RMON vendors are already providing proprietary protocol analysis above the OSI layer 2

Future - ATM


Challenges:

· speed

¯ buffers overflow in milliseconds

¯ hundreds of thousands of cells lost in a second

¯ traffic burst come & go before tools can detect

· cell processing all in hardware

¯ common chips throw away bad cells - the very ones the network operator would like to see

· volume of data

¯ large ATM network will handle hundreds of millions of cells, and thousands of virtual calls per second

· complexity

¯ ATM domain - circuits are virtual, multiple QOS requirements (e.g. response time for voice vs. data very different)

¯ LAN emulation and IP over ATM, i.e., lots of places to look for problems

Future - AMON


Some ATM MIBs exist for lower layers

No RMON for ATM

· a consortium (AMON) formed July 1995 with 15 equipment manufacturers representing switches, remote monitoring and test equipment

· includes establishing circuit steering to direct traffic to monitoring equipment

Summary


Still no out of the box integrated solution available

Require distributed, easy-to-use, heterogeneous "system" management to enable focus to shift to service management

Need to make information digestible

Developing tools is costly and still has to be done in-house

· Yet good monitoring tools can:

¯ effectively leverage scarce people resources

¯ provide realistic input to Service Level Expectations

Encourage vendors to provide tools:

· Demand SNMP agents in all network devices

· Encourage RMON in network connectivity devices (in addition to in stand alone probes)

· Tell vendors network monitoring is an important consideration in selecting equipment

WWW wonderful user interface and distribution system, but concern over inadequate control of extent of visbility

Lots of challenges to come, switched network, RMON2, ATM, increased expectations (job security!)


References

1
SLAC has roughly 300 devices (routers, bridges, Ethermeters, servers), or 1000 interfaces that are polled, and uses < 0.3% of its Ethernet bandwidth

2 Talk presented at the CHEP95 Conference, Rio de Janeiro, September 1995

3 CEBAF, FNAL, GA, LANL, LBL, ORNL, PNL, PPPL, SLAC

4 Results from Survey by International Network Services, June 1995


cottrell@slac.stanford.edu
Copyright © 1996, SLAC and Stanford University. All rights reserved.