Beowulf Batch Processors and Job Schedulers
Edward L. Haletky and Patrick Lampert
Beowulf and grid technology provide an attractive mechanism to build powerful
compute clusters with inexpensive off-the-shelf components. Yet this technology
also introduces new and complex scheduling problems. How do you distribute your
work over the cluster? Can you schedule your work for times of lower activity?
These issues are addressed by a variety of job-scheduling and load-balancing
software tools that we will examine in this article. We review seven different
queuing engines that can manage your resources, schedule jobs, and even interlock
runs based on execution dependencies. These seven systems range from the simplest
cron-related tools to grid engines.
We will examine ease of installation, configuration, creation of a single
queue, the steps required to submit an extremely simple job, as well as the
steps to dispatch jobs based on time of day without using cron to enable and
disable queues. Furthermore, we will comment upon availability of multi-node
capability, support, and security, then present a simple chart to assist you
in picking your Job Scheduling/Queuing Software.
The systems discussed are: at(1)/bbq(1), Clusterware/Load Sharing Facility
(LSF), Condor, Generic Network Queuing System (GNQS), GNU Queue, Open Portable
Batch System (OpenPBS), and the Sun Grid Engine (SGE). Although two of these
systems are primarily grid engines (Condor and SGE), they all provide a way
to queue up jobs for execution as the resources allow.
Each system was installed upon a Scyld Beowulf cluster with six slave nodes
comprising off-the-shelf spare parts. Installation of the queuing agents took
place on the master node leaving the six slave nodes as computational nodes.
|