Monitoring and Management of Large Distributed Computing Systems

Eileen Berman, Lauri Loebel Carpenter, Jim Fromm, Kryzysztof Genser, Lisa Giaccheti, Terry Jones, Tanya Levshina, Igor Mandrichenko, Stan Naymola, Don Petravick, Rick Thies, Rich Thompson
 Fermi National Accelerator Laboratory

Presented by: Tanya Levshina

  The computing needs for future projects at Fermilab will include thousands of computers. This will add to the complexities of operating and managing a large computing facility. The need for a Distributed Management System (DMS) that is able to efficiently run Fermilab's computing systems has been recognized and investigations about these types of systems have begun. DMS should be based on industry standards (e.g. SNMP) and provide self-management of different OSs and mission-critical applications. This system should be scalable, extendable and secure. DMS should be able to detect hardware, network, system and application problems. The system may perform "healing" actions, generate alarms using configurable algorithms, provide log file monitoring and statistics reporting. DMS should provide monitoring via Web browser and also via a GUI or command line interface. DMS GUI should be able to dynamically configure monitoring systems, alarm severity, notification methods. A number of free/commercially available products will be evaluated based on the proposed requirements. The decision whether to use an existing product or develop DMS will be made based on the evaluation.

