[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Fwd: Batch Compute Farm Outage and Changes 7/14/09]

To: Hall D Offline Software Working Group <halld-offline@jlab.org>
Subject: [Fwd: Batch Compute Farm Outage and Changes 7/14/09]
From: "Mark M. Ito" <marki@jlab.org>
Date: Mon, 13 Jul 2009 08:47:27 -0400
Reply-To: "Mark M. Ito" <marki@jlab.org>
Sender: owner-halld-offline@jlab.org
User-Agent: Thunderbird 2.0.0.22 (X11/20090609)

from JLab Scientific Computing

-- BEGIN included message

To: jlab-scicomp-briefs@jlab.org
Subject: Batch Compute Farm Outage and Changes 7/14/09
From: Sandy Philpott <Sandy.Philpott@jlab.org>
Date: Mon, 13 Jul 2009 07:07:30 -0400
Reply-To: Sandy.Philpott@jlab.org
User-Agent: Thunderbird 2.0.0.22 (Windows/20090605)

With the batch farm scheduled for shutdown tomorrow morning, only jobs 
submitted today requesting less than the default 24 hours of walltime 
will be able to run.

-------

The experimental physics batch compute farm will be unavailable on
             Tuesday, July 14, 2009
to upgrade hardware and software, and to include support for new CentOS
5.3 64 bit systems.  All jobs in the system not completed in time for
the upgrade will be drained and will require resubmission.

On Tuesday 7/14, a new PBS server system FARMPBS will be installed, and
changes to Auger will occur to improve throughput and capacity of the farm:

    1. Just-in-time file caching -- input files will be copied into the
cache area closer to the time that the job will actually run instead of
immediately caching input files when the job is submitted.
    2. jkill will now cancel Auger and JASMine request at the same time.
    3. The cache filesystem is cleaned in a more timely manner (no more
unnecessary files pinned).
    4. The batch server will not send email itself. All email is handled
(or not) by Auger through the jsub script.
    5. Adding another 80 CPUs to the Fedora 8 farm.
    6. Changing the way memory allocation/enforcement is done on the farm.
    7. Resolving an issue where a job that is requesting already cached
data has to unnecessarily go through the JASMine accounting system when
the JASMine request is in fact already fulfilled.

Additionally, a simulation project has bought 10 nodes that will be
integrated into the farm, so that when they are not in use, other farm
users will have access.  The systems are dual 2.8 GHz Nehalem, quad core
with hyperthreading for 16 concurrent processes or threads per box, 24
GBytes memory, 500GByte disk.  These systems are 10x or more faster than
the oldest farm nodes. They are running CentOS 5.3 64 bit, as a testbed
for the evolution of the farm.

The oldest farm nodes are past end of life, and 83 of the oldest
nodes will soon be turned off, including 6 year old former LQCD nodes
which are memory lean. Additional nodes will be added to the 64 bit OS
batch queue as load in that queue goes up.

Early tester volunteers can contact their Computing Coordinator for
access to the CentOS nodes:

Hall A - Bob Michaels    rom@jlab.org
Hall B - Dennis Weygand  weygand@jlab.org
Hall C - Steve Wood      saw@jlab.org
Hall D - Mark Ito        marki@jlab.org
Phy    - Graham Heyes    heyes@jlab.org
CASA   - Yves Roblin     roblin@jlab.org

Please contact Sandy.Philpott@jlab.org, 757-269-7152 with questions,
suggestions, or comments.

-- END included message

Prev by Date: [Fwd: r5342 - in trunk/src: libraries/FCAL libraries/HDDM programs/Analysis/plugins/fcal_histsprograms/Simulation/mcsmear]
Next by Date: no Offline Meeting July 15
Prev by thread: [Fwd: Batch Compute Farm Outage and Changes 7/14/09]
Next by thread: Minutes of the July 1 Offline Meeting
Index(es):
- Date
- Thread