Go to Home Page
CHEP INFORMATION
Bulletins
Committees
Scientific Program
Docs by topics
Social events
Conference location
Secretariat
GRID INFORMATION
 • Grid WShop & Tutorial
 • Grid Program
USEFUL LINKS
 • Visiting Padova
 • INFN Padova
 • University of Padova
 • CHEP: '94 '95 '97 '98

Last update: Apr 1, 2000

to first abs  to previous absby abs number to next abs  to last abs

 

to first abs on this KT  to previous abs on this KTon same keytopic to next abs on this KT  to last abs on this KT


E060

Report on the D0 Linux Production Farms

X X
 D0 Data Handling Group

Speaker: Heidi Schellman

  The D0 Proton-antiProton Experiment at Fermilab will begin taking data in early 2001. Events of size 250-500 kB will be written at a rate ranging from 20-50Hz over a period of several years. These events will be written into an robotic data store and then transferred to a 'farm' of commodity PC's running Linux for reconstruction. The final system is expected to deliver 8000-10000 SpecInt95 and be able to handle peak input bandwidths of up to 25MB/sec
  We report on a full system test at 1/2 bandwidth and 1/4 CPU utilization using a prototype system of 5-10 tape drives, 50 dual processor PC's for reconstruction and a single 4 processor SMP machine for output merging. Monte Carlo simulated data is stored in the tape library and transferred to the processing nodes and back via the Enstore data access system. The Fermilab Sequential Access (SAM) system is used for file delivery and tracking. The Fermilab Batch System (FBS) is used for job control.
  In the D0 farm system, a 'project' or set of files is defined in advance. The projects is submitted to the batch system and at run time is split among N CPU's. Each. CPU functions as an essentially independent entity and bootstraps the D0 code configuration and Fermilab products. It then proceeds to request the next available file in the 'project' set from the SAM/Enstore system. It processes this file and then returns the processing status to the SAM file tracking. Because the processing CPU's pull both the configuration and individual files across at run time, the system is robust against the lost of any individual CPU. If a process or machine dies, the files currently being processed on that machine are lost but all subsequent files are correctly requested to and delivered to the remaining CPU's. When the CPU recovers, it becomes available to the batch system again, if home areas and basic Fermilab products are available, it can rebootstrap the appropriate D0 specific environment and resume processing. Automatic retries of files skipped due to processing problems will be added in a later version of the SAM system.

Presentation:  PowerPoint Short Paper:  Adobe Acrobat pdf 



  | Top | Home | Bulletins | Committees | Scientific Program | Docs by topics | Social Event | Conference Location | Secretariat |