from JLab Scientific Computing
-- BEGIN included message
- To: jlab-scicomp-briefs@jlab.org
- Subject: Batch Compute Farm Outage and Changes 7/14/09
- From: Sandy Philpott <Sandy.Philpott@jlab.org>
- Date: Mon, 13 Jul 2009 07:07:30 -0400
- Reply-To: Sandy.Philpott@jlab.org
- User-Agent: Thunderbird 2.0.0.22 (Windows/20090605)
With the batch farm scheduled for shutdown tomorrow morning, only jobs submitted today requesting less than the default 24 hours of walltime will be able to run. ------- The experimental physics batch compute farm will be unavailable on Tuesday, July 14, 2009 to upgrade hardware and software, and to include support for new CentOS 5.3 64 bit systems. All jobs in the system not completed in time for the upgrade will be drained and will require resubmission. On Tuesday 7/14, a new PBS server system FARMPBS will be installed, and changes to Auger will occur to improve throughput and capacity of the farm: 1. Just-in-time file caching -- input files will be copied into the cache area closer to the time that the job will actually run instead of immediately caching input files when the job is submitted. 2. jkill will now cancel Auger and JASMine request at the same time. 3. The cache filesystem is cleaned in a more timely manner (no more unnecessary files pinned). 4. The batch server will not send email itself. All email is handled (or not) by Auger through the jsub script. 5. Adding another 80 CPUs to the Fedora 8 farm. 6. Changing the way memory allocation/enforcement is done on the farm. 7. Resolving an issue where a job that is requesting already cached data has to unnecessarily go through the JASMine accounting system when the JASMine request is in fact already fulfilled. Additionally, a simulation project has bought 10 nodes that will be integrated into the farm, so that when they are not in use, other farm users will have access. The systems are dual 2.8 GHz Nehalem, quad core with hyperthreading for 16 concurrent processes or threads per box, 24 GBytes memory, 500GByte disk. These systems are 10x or more faster than the oldest farm nodes. They are running CentOS 5.3 64 bit, as a testbed for the evolution of the farm. The oldest farm nodes are past end of life, and 83 of the oldest nodes will soon be turned off, including 6 year old former LQCD nodes which are memory lean. Additional nodes will be added to the 64 bit OS batch queue as load in that queue goes up. Early tester volunteers can contact their Computing Coordinator for access to the CentOS nodes: Hall A - Bob Michaels rom@jlab.org Hall B - Dennis Weygand weygand@jlab.org Hall C - Steve Wood saw@jlab.org Hall D - Mark Ito marki@jlab.org Phy - Graham Heyes heyes@jlab.org CASA - Yves Roblin roblin@jlab.org Please contact Sandy.Philpott@jlab.org, 757-269-7152 with questions, suggestions, or comments.
-- END included message