Last update: Apr 1, 2000
D0 Data Handling Group
Speaker: Heidi SchellmanThe D0 Proton-antiProton Experiment at Fermilab will begin taking data in early 2001. Events of size 250-500 kB will be written at a rate ranging from 20-50Hz over a period of several years. These events will be written into an robotic data store and then transferred to a 'farm' of commodity PC's running Linux for reconstruction. The final system is expected to deliver 8000-10000 SpecInt95 and be able to handle peak input bandwidths of up to 25MB/sec
We report on a full system test at 1/2 bandwidth and 1/4 CPU utilization using a prototype system of 5-10 tape drives, 50 dual processor PC's for reconstruction and a single 4 processor SMP machine for output merging. Monte Carlo simulated data is stored in the tape library and transferred to the processing nodes and back via the Enstore data access system. The Fermilab Sequential Access (SAM) system is used for file delivery and tracking. The Fermilab Batch System (FBS) is used for job control.
In the D0 farm system, a 'project' or set of files is defined in advance. The projects is submitted to the batch system and at run time is split among N CPU's. Each. CPU functions as an essentially independent entity and bootstraps the D0 code configuration and Fermilab products. It then proceeds to request the next available file in the 'project' set from the SAM/Enstore system. It processes this file and then returns the processing status to the SAM file tracking. Because the processing CPU's pull both the configuration and individual files across at run time, the system is robust against the lost of any individual CPU. If a process or machine dies, the files currently being processed on that machine are lost but all subsequent files are correctly requested to and delivered to the remaining CPU's. When the CPU recovers, it becomes available to the batch system again, if home areas and basic Fermilab products are available, it can rebootstrap the appropriate D0 specific environment and resume processing. Automatic retries of files skipped due to processing problems will be added in a later version of the SAM system.
|| | Home | Bulletins | Committees | Scientific Program | Docs by topics | Social Event | Conference Location | Secretariat |