Apr 1, 2000
(for the BaBar Prompt3
Visualization Tools for Monitoring and Evaluation of Distributed
- Laboratory for Nuclear Science, M.I.T.
- Lab de l'Accelerateur Lineaire, Orsay
- Reconstruction and Computing groups)
We describe several tools used to evaluate the operation of distributed
computing systems. Included are tools developed for visual presentation of
data accumulated from several sources. Examples are taken from the BaBar
Prompt Reconstruction system, which consists of more than 200 individual
nodes and transmits 500 gigabytes/day to an object-oriented data store. Each
node records its actions in a log file, and along with other performance logs,
these supply the data required. One tool, a log analyzer and browser based on
Perl/PerlTk, was developed to spot failures in the log files. It was built
primarily to narrow the search for synchronous events ("hickups") across the
nodes to a few useful lines per node instead of a full log file of several
megabytes each. It is also used to navigate through these log files and other
failure reports, and as a presenter for the monitoring of the whole system.
Another presents each node's activities in a parallel manner to help
detect situations where resource demands by one node affect the activities on
others. These tools have contributed to the understanding of several problems
encountered during this system's development.