Last update:
Apr 1, 2000
|
|
|
Monitoring and Management of Large Distributed Computing Systems
|
Eileen Berman,
Lauri Loebel Carpenter,
Jim Fromm,
Kryzysztof Genser,
Lisa Giaccheti,
Terry Jones,
Tanya Levshina,
Igor Mandrichenko,
Stan Naymola,
Don Petravick,
Rick Thies,
Rich Thompson
Fermi National Accelerator Laboratory
Presented by:
Tanya Levshina
The computing needs for future projects at Fermilab will include
thousands of computers. This will add to the complexities of operating
and managing a large computing facility. The need for a Distributed
Management System (DMS) that is able to efficiently run Fermilab's
computing systems has been recognized and investigations about these
types of systems have begun.
DMS should be based on industry standards (e.g. SNMP) and provide
self-management of different OSs and mission-critical applications.
This system should be scalable, extendable and secure. DMS should be
able to detect hardware, network, system and application problems.
The system may perform "healing" actions, generate alarms using
configurable algorithms, provide log file monitoring and statistics
reporting. DMS should provide monitoring via Web browser and also via
a GUI or command line interface. DMS GUI should be able to dynamically
configure monitoring systems, alarm severity, notification
methods.
A number of free/commercially available products will be evaluated
based on the proposed requirements. The decision whether to use an
existing product or develop DMS will be made based on the evaluation.
Short Paper: |
|