CHEP95 Distributed Computing Environment Monitoring and User
Expectations
Les Cottrell & Connie Logg
SLAC, Sept 1995
Outline of Talk:
· Why Monitor
· What should we monitor
· What is the Current State
· How do we Monitor
¯ Collection
¯ Analysis
¯ Reduction
¯ Notification
· Results
· Costs
· Future
Why Monitor
To provide:
1. Performance Tuning - Improve service - proactively id & reduce bottlenecks, tune and optimize systems, improve QOS, optimize investments - id under/over utilized resources, balance workloads
2. Trouble Shooting - Get out of crisis mode, id probs & start diagnosis/fixing before end user notices, increase reliability/availability, allow user to accomplish work more effectively and maximize productivity.
3. Planning - understand performance trends for planning
4. Expectations - set expectations for the Distributed System (from network thru applications) and see how well they are met
5. Security
6. Accounting
ESnet Sites Survey on "Why Monitor"
Results of May-95 Survey from 9 ESnet sites3 representing > 50K nodes
What's Changed that Makes Monitoring so Crucial now
1. Distributed environment (client/server)
· critically relies on network to function.
· very different from central environment, yet users expect as good or better
What's Changed that Makes Monitoring so Crucial now
2. Network growth:
· Extent/coverage of network increasing
· Number of devices increasing exponentially (30-50% / year is typical)
· Traffic doubling typically every 18 months
· Technology to manage network is not growing as fast as network technology
What's Changed that Makes Monitoring so Crucial now
3. Complexity:
· a typical ESnet site has:
¯ products from about ten vendors, suppliers, carriers
¯ ~ a dozen different configurable equipment types (routers, bridges, hubs, switches ...)
¯ ~ half dozen network management applications (NMS, trouble ticket, probe management ...)
¯ ~ 9 different vendor MIBs
¯ 5 protocol suites (TCP/IP, DEC, AppleTalk, Netware,..._) typically routing 4 protocols, bridging 3 and tunnelling 2.
¯ 9 server platforms (VMS, MacOS, AIX, SunOS, WNT ...)
¯ ~ 30 networked applications
· this results in:
¯ decreased support effectiveness
¯ decreased QOS
¯ inability to support existing & new applications
¯ increased downtime, lost opportunity, user's time wasted & security exposures
What's Changed that Makes Monitoring so Crucial now
4. Reduced Resources:
· budgets increasingly constrained
· few experienced personnel available, hard to retain after training
So need simple to use, well integrated tools to automate network management and improve the productivity of existing personnel
What Should we Monitor
The ultimate measures of performance are the users' perceptions of the performance of their networked applications (e.g. WWW, email, a distributed RDBMS, a spreadsheet accessing a distributed file system etc.)
This performance is affected by the performance of the complete Distributed System, which includes:
· physical network plant
· communications devices (e.g. routers, switches) , computers and peripherals attached to the network plant
· host resource utilization
· software from device interfaces, thru operating systems to applications running on computers and devices
To set and meet user expectations for distributed system performance, we must monitor all of the above
What is the Current State
Companies are finding it difficult to manage network performance4:
· Only 24% adequately manage network performance
· only 16% have network performance service level agreements
· 55% indicate they are understaffed for managing network performance
· 56% have a project in works or plan to improve network performance
· 65% have a project in work or plan to improve network management
· 95% would like to report on network utilization, but only 55% do
· 91% would like to report on network availability, but only 25% do
ESnet Sites: Practices vs. Desires for Monitoring
Largest Increase is in Security and Applications
What is the Current State of Tools
· expensive, hard to learn
· mainly aimed at real-time trouble shooting
· generate massive amounts of data
¯ needs to be squeezed into digestible reports
¯ needs to automatically identify baselines and exceptions
¯ needs automated expertise to correlate apparent multiple error sources and find root cause
Components of Network Monitoring
Network Data Collection at SLAC
Collect data via SNMP from:
· Bridges, routers, ethermeters, hubs and switches
· Data collected includes:
· # good packets, # kilobytes, pkt size distribution
· # errors (# of types of errors)
· # pkts dropped, discarded, buffer/controller overflows
· top-10 talkers & protocol distributions
Collect data via Ping - for response, pkt loss, connectivity from:
· critical servers, router interfaces, ethermeters
· off-site collaborators' nodes
Other Sources:
· Poll critical Unix network daemons & services (e.g. mail, WWW, name, font, NFS ...)
· ARP caches
· appearance of new unregistered nodes
Data Analysis at SLAC
Once a day (in the early morning), via batch jobs:
· The previous day's data is analyzed and summarized into ASCII files (usually tabular) and graphs
· Long term graphs (fortnightly, monthly, 180 days) are updated
Ongoing analysis during the day consists of:
· Generating files of hourly graphs and other displays of data collected to far today.
· Bridge, router and ethermeter interface stats
· Top10 talkers and subnet protocol usage
Data Reduction at SLAC
Analysis generates thousands of reports most of which are uninteresting
Reduction examines the analysis reports and extracts the exceptions e.g.
· Duplicate IP addresses
· Appearance of new unregistered nodes
· Loss of connectivity
· Data values exceeding thresholds, e.g.
¯ CRC & alignment errors > 1 in 10000 packets
¯ total utilization on a subnet of > 10% for the day
¯ broadcast rate > 150 pkts/sec
¯ (shorts+collisions)/good_packets > 10%
¯ packet loss from onsite pings > 1% in a day
¯ bridge/router overflows and queue drops
· Creates exception reports (for display by WWW) with hypertext links to tables and plots with more information
Alert Notification
The daily WWW visible exception reports are manually reviewed each working morning and used as input to the morning H. O. T. meeting
· 5-15 min open meeting of network ops & development, systems admins, help desk and other interested people
· covers: scheduled outages and installations, newly identified problems, outstanding/unresolved problems
In addition:
NMS maps show when a managed critical interface becomes inactive (goes red)
SNMP and ping-polling of critical interfaces results in:
· issuing of X-window pop-up windows
· phone pages being issued
· e-mail messages
Security intrusions result in:
· phone pages being issued by the pager system
Results
Service Level Expectations:
· Examples
¯ Ping response time for on-site network layer < 10msec for 95% of samples
¯ Network reachability of critical nodes of >= 99%
¯ Sub-second response for trivial network services (name, font, network daemons (smtp, nfsrpc) ...)
¯ 95% of trivial mail delivered on site in 10 minutes
¯ 95% of requests for SLAC WWW home page served in < 0.1 secs.
·
The expectations are used in conjunction with thresholds,
Wide Area Monitoring
WAN monitoring for an end site has different requirements to LAN monitoring
· often outsourced and have limited control
· there is much greater variability in results
However we (users and networkers) still want to have reasonable expectations for planning and problem identification
The main tools used today are:
· ping response time, packet loss and connectivity
· traceroute to discover the route to another node
· FTP transfer rates
· traffic measurements from firewall routers
Wide Area Monitoring
Several HEP sites (e.g. IN2P3, Padova, RAL, SLAC) have automated the pinging and produce reports
· Significant changes in ping response time may indicate routing problems
¯ use traceroute to look at routes
¯ need to look from both ends (ideally without an account, e.g. via WWW)
PIng Measurements between UC Davis and SLAC
WAN Monitoring
· Significant increase in packet loss may indicate overloaded routers/links
· Try to separate down nodes/links from packet loss
· Since quantitative values vary widely from node to node and even month to month, need automated "expert" help in setting dynamic thresholds
· FTP rates depend on many inter-related factors including:
¯ packet loss
¯ network response time for end to end
¯ number of hops and route used
¯ utilization of links
¯ speed of links
¯ capability of end nodes
¯ Measured between 10am and 5pm PST Labor Day, >= 3 transfers/site
¯ Lot of variability in rates (can be factors of 5 or more even from minute to minute for the same node)
¯ Average transfer rate drops by ~30% for workdays
Host Monitoring
Focus so far been on Monitoring lower layers of network
Managing host systems connected to the network has become increasingly important
Need consistent management of host system resources and mission-critical applications
IETF made great strides to specifying MIBs, in particular the Host Resources MIB
Much of today's practice is still roll-your own, e.g. at SLAC
· extensive accounting/monitoring on Unix servers by Unix group
¯ disk space utilization
¯ cpu utilization
¯ memory usage
¯ paging activity and I/O
¯ security intrusions (e.g. md5 -> page, Enet promiscuous)
· tabular reports and graphs created daily and WWW accessible
· generates alerts, e.g. if users application on a server uses >10% of cpu for > 15 mins
Costs - Disk & People
Disk Space
· ~ 4Mbytes/day of raw data collected
· ~ 750 Mbytes of plots (roughly proportional to number of devices)
People Resources
· Coding is the easy part
¯ About 16,000 lines of code, mainly SAS, Perl, REXX
¯ 15% for collection, 77% analysis, 8% reduction
¯ Some has been packaged for distribution
¯ Exported to ORNL, PNL, Stanford and interest from LBL, FNAL, Cisco, NAT and others
· Hard part is defining and understanding the details
· Probably 2-3 FTE years invested in last 3 years by 5 people
Costs - Bandwidth
Polling:
· Uses bandwidth1, e.g. polling 20 stations/interfaces/agents every 5 secs can use all of 64kbps line
· Beware polling routers etc. may cause device cpu and buffer utilization (e.g. for ARP caches) problems
Strategies:
· Select only essential variables
· Match polling volume to node capability
¯ # remote nodes < PollingInterval/AvgDelayPerPoll
¯ Remote sites: minimal
¯ Core sites: normal polling
· Consider data "relevance"
¯ Accounting: hours or even daily
¯ Interface status: minutes
¯ Real-time troubleshooting: seconds
Future - Switched Network
Shared media => switched network
Need a probe on every switch/hub port
· not acceptable to drag protocol analyzer to remote places
· with enough ports difficult to have enough probes
· SNMP MIBS do not provide network level utilization
Switch and router vendors are incorporating RMON into devices
· some RMON functions are expensive, e.g. matrix & packet capture, so compromises are needed
· statistics, alarm and event groups are cheap, should be on all ports
Future - RMON2
Provides monitoring for full 7 layers of OSI model
RMON still handles layers 1 and 2 of OSI model
RMON 2 will enable trouble-shooting tools to
· show what clients are talking to what servers through what protocols (i.e. can determine how much applications such as WWW are using the network)
· help with traffic flow
· detect duplicate IP addresses
· improve packet capture selection
RMON2 also enhances configuration
IETF draft RMON version 2 published as proposed standard and RFC July 1995
Many RMON vendors are already providing proprietary protocol analysis above the OSI layer 2
Future - ATM
Challenges:
· speed
¯ buffers overflow in milliseconds
¯ hundreds of thousands of cells lost in a second
¯ traffic burst come & go before tools can detect
· cell processing all in hardware
¯ common chips throw away bad cells - the very ones the network operator would like to see
· volume of data
¯ large ATM network will handle hundreds of millions of cells, and thousands of virtual calls per second
· complexity
¯ ATM domain - circuits are virtual, multiple QOS requirements (e.g. response time for voice vs. data very different)
¯ LAN emulation and IP over ATM, i.e., lots of places to look for problems
Future - AMON
Some ATM MIBs exist for lower layers
No RMON for ATM
· a consortium (AMON) formed July 1995 with 15 equipment manufacturers representing switches, remote monitoring and test equipment
· includes establishing circuit steering to direct traffic to monitoring equipment
Summary
Still no out of the box integrated solution available
Require distributed, easy-to-use, heterogeneous "system" management to enable focus to shift to service management
Need to make information digestible
Developing tools is costly and still has to be done in-house
· Yet good monitoring tools can:
¯ effectively leverage scarce people resources
¯ provide realistic input to Service Level Expectations
Encourage vendors to provide tools:
· Demand SNMP agents in all network devices
· Encourage RMON in network connectivity devices (in addition to in stand alone probes)
· Tell vendors network monitoring is an important consideration in selecting equipment
WWW wonderful user interface and distribution system, but concern over inadequate control of extent of visbility
Lots of challenges to come, switched network, RMON2, ATM, increased expectations (job security!)
References
1
SLAC has roughly 300 devices (routers, bridges, Ethermeters, servers), or 1000 interfaces that are polled, and uses < 0.3% of its Ethernet bandwidth
2
Talk presented at the CHEP95 Conference, Rio de Janeiro, September 1995
3
CEBAF, FNAL, GA, LANL, LBL, ORNL, PNL, PPPL, SLAC
4
Results from Survey by International Network Services, June 1995
cottrell@slac.stanford.edu
Copyright © 1996, SLAC and Stanford University. All rights
reserved.