NCCS | User Info | search  

LoadLeveler on Cheetah


Contents


Introduction

Because Cheetah only has four login CPUs, resource-intensive sequential jobs and all parallel jobs must be submitted through LoadLeveler. (If the login node gets overloaded with user processes, we may be forced to halt processes that use more than their fair share of the login node.)

LoadLeveler is the batch-job scheduler for Cheetah. It also allocates nodes for interactive parallel jobs. This document provides information for getting started with the batch facilities of LoadLeveler.


Classes

In the LoadLeveler parlance, the term "class" is analogous to the term "queue" for other batch systems. Different users may have access to different classes, and different classes may have different job limits or may target different nodes.

Use the "llclass" command to see the current list of classes.

$ llclass
Name                 MaxJobCPU     MaxProcCPU  Free   Max  Description
                    d+hh:mm:ss     d+hh:mm:ss Slots Slots
--------------- -------------- -------------- ----- -----  ---------------------
--------------- -------------- -------------- ----- -----  ---------------------
hsmq                 undefined      undefined     4     4  HSMQ jobs
interactive          undefined      undefined    90   592  Interactive POE jobs
batch                undefined      undefined    74   576  Batch jobs
climate_prod         undefined      undefined   139   448  Production climate runs
No_Class             undefined      undefined     0     0  
sys                  undefined      undefined    74   576  System administration
special              undefined      undefined    71   544  Benchmarking, testing, and other special cases.
climate_dev          undefined      undefined    74   576  Development climate runs
gyro                 undefined      undefined    74   576  Plasma Microturbulence Project
recon                undefined      undefined    74   576  Magnetic Reconnection - expires June 4, 2003
--------------------------------------------------------------------------------
"Maximum Slots" value of the class "No_Class" is constrained by the MAX_STARTERS limit(s).
"Free Slots" values of the classes "No_Class", "interactive", "batch", "climate_prod",
"sys", "climate_dev", "gyro", "recon", "special" are constrained by the MAX_STARTERS limit(s).
The "batch" class is the default class for jobs submitted as LoadLeveler scripts. Specifically, if you don't specify the class, then the default class is batch. However if you have the class set to nothing, then your job will be put in the "empty" class which is defined nowhere and thus the job won't run. The "interactive" class is the default class for interactive-shell "poe" jobs. Other classes are for specific sets of users, such as system administrators. You can set the class with the following:
#@ class = batch

Each "Slots" number represents the number of "job instances" that may be started in the given class. For MPI jobs, this is the number of MPI processes that may run under the given class. It is typically equivalent to the number of processors that allow the class. "Max Slots" represents the total number of slots configured on the system, and "Free Slots" represents the number of slots that are currently not occupied.

This number is misleading, however. A 32-processor node may have 32 slots each for five classes, for example. Because nodes are typically dedicated to a single job, only 32 of the node's 160 slots can be allocated at a time. The rest appear to be "free", although they are not usable.

"MaxJobCPU" and "MaxProcCPU" indicate the per-job and per-process aggregate CPU time limits. None of the classes listed here have CPU time limits; this is not particularly useful information because the classes do have wall-clock time limits.

You can get more information on a class, such as it's wall-clock time limit, using "llclass -l".

$ llclass -l batch
=============== Class batch ===============
                Name: batch
            Priority: 0
       Exclude_Users: 
       Include_Users: 
      Exclude_Groups: 
      Include_Groups: 
               Admin: 
           NQS_class: F
          NQS_submit: 
           NQS_query: 
      Max_processors: -1
             Maxjobs: -1
Resource_requirement: ConsumableCpus(1) ConsumableMemory(256.000 mb)
       Class_comment: Batch jobs
      Class_ckpt_dir: 
          Ckpt_limit: undefined, undefined
    Wall_clock_limit: 12:00:00, undefined (43200 seconds, undefined)
       Job_cpu_limit: undefined, undefined
           Cpu_limit: undefined, undefined
          Data_limit: undefined, undefined
          Core_limit: undefined, undefined
          File_limit: undefined, undefined
         Stack_limit: undefined, undefined
           Rss_limit: undefined, undefined
                Nice: 0
          Free_slots: 80
       Maximum_slots: 80
    Execution_factor: 1
     Max_total_tasks: -1
       Preempt_class: 
         Start_class: 
The most useful information here is the "Wall_clock_limit", which is set to twelve hours. This is a hard upper limit for any job submitted to the "batch" class. The "undefined" indicates there is no soft limit. You may wish to "grep" for useful nuggets like this from the full listing, as in the following example.
$ llclass -l | egrep "Name|Wall_clock_limit"
                Name: hsmq
    Wall_clock_limit: 04:00:00, undefined (14400 seconds, undefined)
                Name: interactive
    Wall_clock_limit: 02:05:00, undefined (7500 seconds, undefined)
                Name: batch
    Wall_clock_limit: 12:00:00, undefined (43200 seconds, undefined)
                Name: climate_prod
    Wall_clock_limit: 1+00:00:00, undefined (86400 seconds, undefined)
                Name: No_Class
    Wall_clock_limit: undefined, undefined
                Name: sys
    Wall_clock_limit: 1+00:00:00, undefined (86400 seconds, undefined)
                Name: special
    Wall_clock_limit: 1+00:00:00, undefined (86400 seconds, undefined)
                Name: climate_dev
    Wall_clock_limit: 12:00:00, undefined (43200 seconds, undefined)
                Name: gyro
    Wall_clock_limit: 1+00:00:00, undefined (86400 seconds, undefined)
                Name: recon
    Wall_clock_limit: 1+00:00:00, undefined (86400 seconds, undefined)

System status

Through the "Free Slots" entries, the "llclass" command can give some information about the status of the system and what your chances are for running jobs immediately. As mentioned above, however, this information is misleading. For more accurate information about the load on the system, use the "llstatus" command.
$ llstatus
Name                      Schedd  InQ Act Startd Run LdAvg Idle Arch      OpSys
cheetah01.ccs.ornl.gov  Avail     0   0 Busy    32 34.40 9999 RS6000    AIX51    
cheetah02.ccs.ornl.gov  Avail     1   1 Run      8 20.77 9999 RS6000    AIX51    
cheetah03.ccs.ornl.gov  Avail   204  21 Idle     0 2.75     1 RS6000    AIX51    
cheetah04.ccs.ornl.gov  Avail     0   0 Busy    32 34.02 9999 RS6000    AIX51    
...
cheetah27.ccs.ornl.gov  Avail     0   0 Busy    32 32.10 9999 RS6000    AIX51    
cheetah41.ccs.ornl.gov  Avail     0   0 Run      0  0.00 9999 RS6000    AIX51    
cheetah42.ccs.ornl.gov  Avail     0   0 Run      0  0.00 9999 RS6000    AIX51    
...
cheetah48.ccs.ornl.gov  Avail     0   0 Run      0  0.00 9999 RS6000    AIX51    
RS6000/AIX51               35 machines    208  jobs    566  running
Total Machines             35 machines    208  jobs    566  running

The Central Manager is defined on cheetah15.ccs.ornl.gov

The BACKFILL scheduler is in use

All machines are on the machine_list are present.
The "Schedd" column indicates whether the node is able to schedule LoadLeveler jobs; "Avail" means it can. "InQ" gives the number of current jobs submitted from (not running on) the given node, and "Act" gives the number of those jobs that are actually running (on other nodes). "Startd" indicates whether any jobs are running on the given node, and "Run" indicates the number of job instances that are running.

For most Cheetah nodes, the "Run" number will be equal to or less than the number of processors in that node. The same job can have more than one instance running on a given node; for example, a 32-processor node may have 32 MPI processes from the same job.

"LdAvg" is the Berkeley one-minute load average, and "Idle" is the time in seconds since the last keyboard or mouse activity on the node. For Cheetah nodes, "Idle" is often "9999".

The lines at the bottom of the output indicate that 46 nodes (including the control workstation) are currently under the control of LoadLeveler. On these nodes, 208 jobs are running, and those jobs consume 566 slots. Because one slot can represent a single-thread or multiple-thread process, slots are neither equivalent to processors nor nodes.

A more effective way to determine what resources Cheetah has, along with which of those resources are available, is to use the "-R" option. This causes "llstatus" to display "consumable resources", usable processors and memory.

$ llstatus -R
Machine                        Consumable Resource(Available, Total)
------------------------------ -------------------------------------------------
cheetah04c.ccs.ornl.gov        ConsumableCpus(32,32) ConsumableMemory(32.000 gb,32.000 gb) ConsumableScratch(160,160)
cheetah06c.ccs.ornl.gov        ConsumableCpus(0,32) ConsumableMemory(0.000 mb,32.000 gb) ConsumableScratch(160,160)
cheetah07c.ccs.ornl.gov        ConsumableCpus(0,32) ConsumableMemory(0.000 mb,32.000 gb) ConsumableScratch(160,160)
cheetah12c.ccs.ornl.gov        ConsumableCpus(32,32) ConsumableMemory(32.000 gb,32.000 gb) ConsumableScratch(160,160)
cheetah13c.ccs.ornl.gov        ConsumableCpus(32,32) ConsumableMemory(32.000 gb,32.000 gb) ConsumableScratch(160,160)
cheetah14c.ccs.ornl.gov        ConsumableCpus(32,32) ConsumableMemory(32.000 gb,32.000 gb) ConsumableScratch(160,160)
cheetah15c.ccs.ornl.gov        ConsumableCpus(16,32) ConsumableMemory(112.000 gb,128.000 gb)
cheetah16c.ccs.ornl.gov        ConsumableCpus(32,32) ConsumableMemory(128.000 gb,128.000 gb)
cheetah20c.ccs.ornl.gov        ConsumableCpus(24,32) ConsumableMemory(24.000 gb,32.000 gb)
cheetah21c.ccs.ornl.gov        ConsumableCpus(24,32) ConsumableMemory(24.000 gb,32.000 gb)
cheetah26c.ccs.ornl.gov        ConsumableCpus(24,32) ConsumableMemory(24.000 gb,32.000 gb)
cheetah27c.ccs.ornl.gov        ConsumableCpus(24,32) ConsumableMemory(24.000 gb,32.000 gb)
cheetah41c.ccs.ornl.gov        ConsumableCpus(3,3) ConsumableMemory(3.000 gb,3.000 gb)
cheetah42c.ccs.ornl.gov        ConsumableCpus(3,3) ConsumableMemory(3.000 gb,3.000 gb)
cheetah43c.ccs.ornl.gov        ConsumableCpus(3,3) ConsumableMemory(3.000 gb,3.000 gb)
cheetah44c.ccs.ornl.gov        ConsumableCpus(3,3) ConsumableMemory(3.000 gb,3.000 gb)
cheetah45c.ccs.ornl.gov        ConsumableCpus(3,3) ConsumableMemory(3.000 gb,3.000 gb)
cheetah46c.ccs.ornl.gov        
cheetah47c.ccs.ornl.gov        ConsumableCpus(3,3) ConsumableMemory(3.000 gb,3.000 gb)
cheetah48c.ccs.ornl.gov 
LoadL_startd daemons of machines with "#" appended to their names are down.
This command displays the number of processors and the amount of memory on each node. For example, node 06 has 32 processors available for LoadL jobs and 32.0GB of memory. 32 processors are in use, 32GB of memory is taken, and all the local scratch space is still available. Some of the nodes have 160 GB of local scratch space ($NODE_JOBDIR), some do not.

All of the p690 nodes are configured as 32-way SMP nodes. We have some p655s configured into loadleveler for interactive use, these are numbered 41-48. These are also I/O server nodes.

Notice that nodes with "#" appended to their names are not available to LoadLeveler.

Some of the columns of default "llstatus" output are not particularly useful, and "llstatus" is capable of displaying useful information that is not shown by default. To remedy this, you can configure the output generated by "llstatus" on the command line. Here is an example configuration.

$ llstatus -f %n %mt %r %l %v %scs %sts
Name                    MaxT  Run     LdAvg   FreeVMemory  Schedd   Startd  
cheetah01.ccs.ornl.gov  32    32      32.05   66048868     Avail    Busy    
cheetah02.ccs.ornl.gov  32    8       16.92   66052976     Avail    Run     
cheetah03.ccs.ornl.gov  32    0       2.78    65774208     Avail    Idle    
cheetah04.ccs.ornl.gov  32    32      33.13   66034208     Avail    Busy    
...
cheetah27.ccs.ornl.gov  32    8       19.23   66053208     Avail    Run     
cheetah41.ccs.ornl.gov  3     0       8.29    8904108      Avail    Idle    
cheetah42.ccs.ornl.gov  3     0       8.54    8902292      Avail    Idle    
...
cheetah45.ccs.ornl.gov  3     0       41.56   4706172      Avail    Idle    
cheetah47.ccs.ornl.gov  3     0       41.56   4706172      Avail    Idle    
...

This example prunes out some of the default information and adds "MaxT" and "FreeVMemory". "MaxT" gives the maximum number of job instances (regardless of class) that may run on the given host at a time, and "FreeVMemory" gives the available swap space, in kilobytes. See "man llstatus" for more information on configuring output. You may want to create an "alias" for the "llstatus" configuration you prefer.


Job command files

To run a batch job under LoadLeveler, you first need to write a job command file. LoadLeveler command files have two components: LoadLeveler keyword statements and shell commands. The LoadLeveler keyword statements are preceded by "#@", making them appear as comments to a shell. The shell commands follow the "#@ queue" keyword statement and represent the executable content of the batch job.

A nice feature LoadLeveler provides is the ability to define job prolog and epilog scripts. If you have steps that should take place at the beginning or end of all your jobs, you can define a prolog and/or epilog script and code these steps once for all your jobs.

Before starting your job, LoadLeveler looks for an environment variable named $MY_CCS_LOADL_PROLOG. If it contains the path of an executable file, that file is run before starting your job script. If $MY_CCS_LOADL_PROLOG is not defined and $HOME/llprolog exists and is executable, it will be run.

Similarly, after your job completes, if $MY_CCS_LOADL_EPILOG is defined and contains the path of an executable file, that file will be run. If $MY_CCS_LOADL_EPILOG is not defined, but $HOME/llepilog exists and is executable, it will be run.

Generally, defining and exporting these environment variables in your .profile (or setting them with setenv in your .login if you use a csh variant), is sufficient to define them to LoadLeveler.

Below you will find examples of various command files, specifying different parallel paradigms and resource requirements.


What not to do

When porting job command files from other systems, such as Eagle or Seaborg, there are a few LoadLeveler statements you should not use on Cheetah.

#@ node_usage = not_shared

This statement requests that each node associated with your job is not shared with any other jobs. Cheetah has large, powerful nodes. If your job does not use all the resources on a node, you need to make the remaining resources available to other users. If your job does need all the resources on a node, there are more appropriate ways to request those resources than with "not_shared". See the examples below for details.

#@ network.MPI = csss,not_shared,US

The "not_shared" in this statement requests that the SP Switch2 interconnect on each node be reserved for exclusive use by the requesting job. If your job does not use all the resources on a node, you need to make the interconnect available to other jobs. If your job does use all the resources on a node, requesting "not_shared" is unnecessary since no other jobs will be allowed on the node anyway. Please use "shared".

No ending newline ("csh" only)

Make sure that your command file ends with a newline. If it does not, LoadLeveler will not execute the last command in your file. You can use the following command to check your file.

tail command_file
You need to add a newline (using "return" or "enter" in an editor) if the next command prompt appears on the same line as the last line of the file. Here is an example of such a case.
cheetah48% tail csh.ll
#@ error = $(host).$(jobid).err
#@ wall_clock_limit = 30:00
#@ tasks_per_node = 32
#@ node = 1
#@ queue
pwd
echo $LOADL_PROCESSOR_LIST
setenv MP_SHARED_MEMORY yes
poe a.outcheetah48%
If a newline is not added to this file, the command "poe a.out" will not be executed when the job runs!

Multiple "resources" lines

As described below, you can use the "resources" keyword to define how many processors per task and how much memory per task your job needs.

Always make all your resource requests on one line, for multiple requests are not additive. Each "resource" line logically overwrites all previous lines. For example, the following lines result in the default number of processors per task, 1, not 8, and no scratch reserved!

  #@ resources = ConsumableCpus(8)
  #@ resources = ConsumableScratch(4)
  #@ resources = ConsumableMemory(1 gb)
Use a single line instead.
  #@ resources = ConsumableCpus(8) ConsumableMemory(1 gb) ConsumableScratch(4)

NOTE: Requested consumable resources are per process -- not per job or task


MPI jobs

Here is an example command file for a parallel MPI job.
#@ shell = /bin/ksh
#@ job_type = parallel
#@ network.MPI = csss,shared,US
#@ output = $(host).$(jobid).out
#@ error = $(host).$(jobid).err
#@ wall_clock_limit = 30:00
#@ tasks_per_node = 32
#@ node = 2
#@ queue
pwd
echo $LOADL_PROCESSOR_LIST
export MP_SHARED_MEMORY=yes
poe a.out

Here is a description of each line. The script has no line specifying "class", so the default class, "batch", will be used.

#@ shell = /bin/ksh

Use the Korn shell, "ksh", to interpret the command file. By default, LoadLeveler interprets the command file using your login shell. The sample script is written in "ksh" syntax, so the explicit request of "ksh" allows it to work regardless of your login shell. If you prefer to use C-shell syntax, make the following changes to the sample command file.
Korn shell C shell
#@ shell = /bin/ksh #@ shell = /bin/csh
export MP_SHARED_MEMORY=yes setenv MP_SHARED_MEMORY yes

#@ job_type = parallel

Use multiple nodes for parallel commands. This keyword is required for parallel jobs. The keywords "tasks_per_node", "node", etc. won't work without it.

#@ network.MPI = csss,shared,US

For MPI communication, use the SP Switch2 with the User Space protocol. This line requests that parallel MPI programs use the fastest form of internode communication available on the SP, User Space (US) protocol over the SP Switch2 using both switch interfaces on each node (device "csss").

A separate "network" keyword is allowed for IBM's Low-Level Application Programming Interface, "network.LAPI".

#@ output = $(host).$(jobid).out

Send standard output to the file "$(host).$(jobid).out". "$(host)" is a LoadLeveler variable that represents the host where the job was submitted. It is not necessarily related to where the job runs. "$(jobid)" is a number ID of the running job. Each "$(jobid)" is unique for a given job submitted from a particular host. Each "$(jobid)" is not necessarily unique across LoadLeveler; two jobs submitted from two different hosts can have the same value for "$(jobid)". The combination of "$(host).$(jobid)" is unique, however. Example: "131.out" and "131.out" versus "cheetah0001.131.out" and "cheetah0017.131.out". Another useful variable is "$(executable)", which represents the name of the LoadLeveler command file.

Unless you specify a full path, the output file is stored in the directory from which you submitted the job. If you don't specify the "output" keyword, the standard output is not saved.

#@ error = $(host).$(jobid).err

Send standard error output to the file "$(host).$(jobid).err". See the information above for the "output" keyword. You can send standard output and standard error to the same file.

#@ wall_clock_limit = 30:00

Limit the job to 30 minutes of real time. If you do not specify a "wall_clock_limit", your job will get the default limit of two hours, regardless of class. For jobs longer than two hours, you must specify a longer limit. For shorter jobs, specifying a shorter time limit may allow the scheduler to fit your job in earlier.

#@ tasks_per_node = 32

Use 32 tasks per node for parallel jobs. A task is equivalent to a process, and a single task may have multiple threads. This line specifies that 32 tasks, 32 MPI processes in this case, should be started on each node. Note that there are only 32-way nodes on Cheetah.

#@ node = 2

Allocate 2 nodes for parallel commands. Yes, the keyword is "node", not "nodes".

#@ queue

Queue the job! This keyword is critical. Without it, no job is created. Each "queue" keyword uses the environment specified by the keywords listed before it, so make sure to put it after the other relevant keywords.

The remaining lines of the file specify the shell commands to be executed by the batch job. All sequential commands, such as the first three commands in this example, run on only the first node allocated to the job. Parallel commands start multiple processes spread across all allocated nodes.

pwd

Display the name of the current working directory. The job starts in the directory where the job was submitted. This behavior is different from some other batch systems, which always start jobs in the user's home directory.

echo $LOADL_PROCESSOR_LIST

Display the nodes allocated to this job. LoadLeveler automatically sets the value of the environment variable "LOADL_PROCESSOR_LIST" to a list of the nodes allocated for the given job. Printing this list in each job can help diagnose system problems. If you have more than 128 tasks, however, do not print this variable. LoadLeveler has trouble printing this for more than 128 tasks; it may cause your job to fail.

export MP_SHARED_MEMORY=yes

Use shared memory for MPI. IBM's MPI can implement communication within a node using shared memory. This implementation greatly improves the bandwidth and latency of on-node communication without affecting communication between nodes. This is used by default so you don't need to set it in your batch script, but be aware that it uses extra memory. If you wish to turn it off, in "ksh" use "export MP_SHARED_MEMORY=no". For "csh", use "setenv MP_SHARED_MEMORY no" instead.

To take advantage of this shared-memory optimization, an MPI code must be compiled with the thread-safe version of the MPI library, i.e. using "mpxlf_r" or "mpcc_r".

poe a.out

Run 64 copies of "a.out" across 2 nodes. If "a.out" is not a parallel program, this command will run 64 identical copies on 2 different nodes. If "a.out" is parallel (compiled with "mpxlf", "mpcc", etc.), it will run as a single 64-process application across 2 nodes. Specifying "poe" is optional for programs compiled to be parallel.

Note that POE options specified through LoadLeveler keyword commands ("node", "tasks_per_node", "network", etc.) override options on the "poe" command line.


OpenMP jobs

Here is an example command file for a threaded OpenMP job.
#@ shell = /bin/ksh
#@ job_type = serial
#@ output = $(host).$(jobid).out
#@ error = $(host).$(jobid).err
#@ wall_clock_limit = 30:00
#@ resources = ConsumableCpus(8)
#@ queue
pwd
echo $LOADL_PROCESSOR_LIST
export OMP_NUM_THREADS=8
a.out

Here is a description of each line that differs with the MPI example. See above for details on the other statements.

#@ job_type = serial

Use a single node for the job. Each statement in the script should use only one process, though each process may have multiple threads.

#@ resources = ConsumableCpus(8)

Reserve 8 processors. This statement is critical for OpenMP jobs! Though the job is "serial" in terms of processes, it uses multiple threads per process. This statement reserves 8 processes, but it does not set the number of OpenMP threads!

export OMP_NUM_THREADS=8

Use 8 OpenMP threads per process. This number is typically the same as the number of "ConsumableCpus" set above. For "csh", use the following instead.

setenv OMP_NUM_THREADS 8

a.out

Run a single copy of "a.out" using 8 threads and 8 processors. Note that each "8" is set separately.


Hybrid MPI-OpenMP jobs

Here is an example command file for a hybrid MPI-OpenMP job. Each MPI process uses multiple OpenMP threads.
#@ shell = /bin/ksh
#@ job_type = parallel
#@ network.MPI = csss,shared,US
#@ output = $(host).$(jobid).out
#@ error = $(host).$(jobid).err
#@ wall_clock_limit = 30:00
#@ tasks_per_node = 4
#@ node = 2
#@ resources = ConsumableCpus(8)
#@ queue
pwd
echo $LOADL_PROCESSOR_LIST
export MP_SHARED_MEMORY=yes
export OMP_NUM_THREADS=8
poe a.out

Here is a description of each line that differs with the MPI example. See above for details on the other statements.

#@ resources = ConsumableCpus(8)

Reserve 8 processors for each MPI task. This statement is critical for OpenMP jobs! This statement reserves 8 processes per MPI task, but it does not set the number of OpenMP threads per task!

export OMP_NUM_THREADS=8

Use 8 OpenMP threads per MPI task. This number is typically the same as the number of "ConsumableCpus" set above. For "csh", use the following instead.

setenv OMP_NUM_THREADS 8

poe a.out

Run 8 copies of "a.out" across two nodes, where each copy uses 8 threads on 8 processors. This job uses a total of 64 processors. Because "ConsumableCpus" and "OMP_NUM_THREADS" are set the same, each thread will have a full processor to use.


Memory requirements

If you do not specify a memory requirement, each process gets the default, which may be as little as 256MB per process.

Most Cheetah nodes have roughly 1GB per processor, but a few have more. You can see what memory resources are available using "llstatus -R", as described above.

Use the "ConsumableMemory" resource to specify memory requirements, as in the following example, which requests 2GB per task.

#@ resources = ConsumableMemory(2 gb)
You can specify memory in other units, including MB ("mb") and kB ("kb"). This resource is a "per-task" resource - it is not the total amount you want to use.

Make sure to include all resource requests on a single "resources" line. The following example requests 32 processors and 64GB per task, such as for a large OpenMP job.

#@ resources = ConsumableCpus(32) ConsumableMemory(64 gb)


Scratch disk requirements

If you do not specify a scratch space requirement, then each process gets the default which is none.

Some of the 32-way nodes have 160 GB of local scratch space. There are no LPARs. Use a llstatus -R to check the current configuration.

Use the "ConsumableScratch" resource to specify memory requirements, as in the following example, which requests 10 GB of disk space assuming you asked for 1 task. As with all Consumable resources, this is a per task request..

#@ resources = ConsumableScratch(10)

You cannot specify the units, it is always in Gigabytes.

Make sure to include all resource requests on a single "resources" line. The following example requests 32 processors and 64GB per task, such as for a large OpenMP job, and 10 GB of local scratch space.

#@ resources = ConsumableCpus(32) ConsumableMemory(64 gb) ConsumableScratch(10)

Submitting jobs

Use "llsubmit" to submit a job command file for batch execution.
$ llsubmit command_file
llsubmit: Processed command file through Submit Filter: "/opt/bin/llsubmitfilter".
llsubmit: The job "cheetah48.ccs.ornl.gov.12765" has been submitted.
The job shell will inherit the working directory from where you submitted the job. Also, unless you use full path names, the standard output and standard error files will be saved in this same directory.

If you forget to supply a "wall_clock_limit", your job will get the default limit, regardless of class.

$ llsubmit command_file
/opt/bin/llsubmitfilter: WARNING:  wall_clock_limit is set to "2:05:00, 2:00:00"
llsubmit: Processed command file through Submit Filter: "/opt/bin/llsubmitfilter".
llsubmit: The job "cheetah48.ccs.ornl.gov.12766" has been submitted.
Some classes have limits on the number of nodes a single job can request (though "batch" and "interactive" currently do not). Unfortunately, "llclass" does not reveal such limits. You may first discover the limit at submit time.
$ llsubmit command_file
llsubmit: Processed command file through Submit Filter: "/opt/bin/llsubmitfilter".
llsubmit: 2512-135 For the "node" keyword, maximum number of nodes requested is greater than allowed for this "class".
llsubmit: 2512-051 This job has not been submitted to LoadLeveler.
Unfortunately, "llsubmit" does not report what the limit actually is. See above to see how to list such limits.

Job status

Use "llq" to check the status of submitted jobs.
$ llq
Id                    Owner      Submitted   ST PRI Class      Running On 
---------------------- ---------- ----------- -- --- ------------ -----------
cheetah48.12813.0      ernie      11/27 04:20 R  50  batch        cheetah04
cheetah48.12816.0      ernie      11/27 04:50 R  50  batch        cheetah06
cheetah48.12820.0      ernie      11/27 08:10 R  50  batch        cheetah27
cheetah48.12814.0      grover     11/27 04:29 R  50  batch        cheetah26
cheetah48.12815.0      grover     11/27 04:30 I  50  batch   
cheetah01.218.0        zoe        11/27 08:41 I  1   batch  
cheetah48.12846.0      bert       11/27 09:40 I  50  batch                   
cheetah48.12848.0      elmo       11/27 09:42 I  50  batch                   
cheetah48.12850.0      bert       11/27 09:46 I  50  batch                   
cheetah48.12851.0      bert       11/27 09:50 I  50  batch                   
cheetah48.12852.0      bert       11/27 09:52 I  50  batch                   
cheetah48.12853.0      bert       11/27 09:54 I  50  batch                   
cheetah48.12854.0      bert       11/27 09:56 I  50  batch                   
cheetah48.12856.0      bert       11/27 09:58 I  50  batch                   
cheetah48.12860.0      bert       11/27 10:03 I  50  batch                   
cheetah48.12861.0      herry      11/27 10:08 I  50  batch                   
cheetah48.12862.0      oscar      11/27 10:08 I  50  batch                   
cheetah48.12863.0      cookie     11/27 10:22 I  50  batch                   
cheetah48.12865.0      kermit     11/27 11:07 I  50  batch                   

19 job steps in queue, 15 waiting, 0 pending, 4 running, 0 held
The first column is the name of each job step, the second column is the owner of the job, and the third column is the time when the job was first submitted to LoadLeveler. The "ST" column gives the status of each job. Here are some common status values.
R Running
ST STarting
I Idle, waiting for resources
H Held by the user
S held by the System
RP Remove Pending, being removed
The "PRI" column gives the user priority of the job, though this priority is not currently used in making scheduling decisions. The "Class" column gives the class specified in the job command file ("batch" is the default). The final column, "Running On", gives the first node assigned to each running job. Only this first node appears, even for parallel jobs running on multiple nodes.

Some of the columns of default "llq" output are not particularly useful, and "llq" is capable of displaying useful information that is not shown by default. To remedy this, you can configure the output generated by "llq" on the command line. Here is an example configuration.

$ llq -f %o %id %nh %st %dd %dq
Owner     Step Id              NM   ST Disp. Date  Queue Date  Running On
----------- ---------------------- ---- -- ----------- ----------- --------------
kermit      cheetah48.12865.0      0    I              11/27 11:07         
cookie      cheetah48.12863.0      0    I              11/27 10:22        
oscar       cheetah48.12862.0      0    I              11/27 10:08       
herry       cheetah48.12861.0      0    I              11/27 10:08      
bert        cheetah48.12860.0      0    I              11/27 10:03     
bert        cheetah48.12856.0      0    I              11/27 09:58    
bert        cheetah48.12854.0      0    I              11/27 09:56   
bert        cheetah48.12853.0      0    I              11/27 09:54         
bert        cheetah48.12852.0      0    I              11/27 09:52        
bert        cheetah48.12851.0      0    I              11/27 09:50       
bert        cheetah48.12850.0      0    I              11/27 09:46      
elmo        cheetah48.12848.0      0    I              11/27 09:42     
bert        cheetah48.12846.0      0    I              11/27 09:40    
ernie       cheetah48.12866.0      0    I              11/27 11:20   
grover      cheetah48.12815.0      0    I  11/27 04:30 11/27 04:30              
ernie       cheetah48.12816.0      8    R  11/27 04:50 11/27 04:50 cheetah04 
zoe         cheetah01.218.0        0    I  11/27 08:41 11/27 08:41  
ernie       cheetah48.12813.0      8    R  11/27 04:20 11/27 04:20 cheetah06    
ernie       cheetah48.12820.0      16   R  11/27 08:10 11/27 08:10 cheetah27    
grover      cheetah48.12814.0      32   R  11/27 04:29 11/27 04:29 cheetah26
In addition to the owner, job name, and status, this format gives "NM", the number of nodes used by the job, "Disp. Date", the time the job was started, and "Queue Date", the time the job was queued. See "man llq" for more information on configuring output. You may want to create an alias for the "llq" configuration you prefer.

As an alternative to "llq", we provide the local utility "llqn", which lists a different set of job characteristics. To list all the characteristics available from "llqn", use the "-a" option.

$ llqn -a
Job Id                     Owner    Class        SysPrio  S Date
Node
----------------------------- -------- ------------ -------- - ---------------- ----
cheetah48.ccs.ornl.gov.12815.0 grover   batch        -4837086 R  Nov 28 04:30      32
cheetah48.ccs.ornl.gov.12813.0 ernie    batch        -4740231 R  Nov 28 04:20       8
cheetah48.ccs.ornl.gov.12820.0 ernie    batch        -4718494 R  Nov 28 08:10      16
cheetah48.ccs.ornl.gov.12814.0 grover   batch        -4718498 R  Nov 27 12:29      32

cheetah48.ccs.ornl.gov.12816.0 ernie    batch        -4842014 I  Nov 28 04:50       8
cheetah01.ccs.ornl.gov.218.0   zoe      batch        -4842039 I  Nov 28 08:41      32
cheetah48.ccs.ornl.gov.12865.0 kermit   batch        -4747045 I  Nov 27 11:07     144 
cheetah48.ccs.ornl.gov.12863.0 cookie   batch        -4749527 I  Nov 27 10:22      80
cheetah48.ccs.ornl.gov.12862.0 oscar    batch        -4836911 I  Nov 27 10:08      16
...
Unlike "PRI" with "llq", "SysPrio" is an accurate representation of the scheduling priority; the job with the largest (least negative) priority is scheduled next. Jobs with lower priority can skip ahead if they can fit in holes in the scheduled job mix. This is called backfilling.

"Date" means different things for running and waiting ("I") jobs. For waiting jobs, "Date" is the queue time. For running jobs, "Date" is the latest time the job will finish, based on the start time and the wall-clock limit.

See "man llqn" for more details.


Why isn't my job running?

You can verify that your job is not running by checking the "ST" column of "llq" output. You can then use "llq -s" with the job name to find out why it isn't running. The output created by "llq -s" is long, so you may want to pick out the useful lines using "sed". The following example demonstrates how to display lines of "llq -s" output between the line "SUMMARY" and the line "ANALYSIS".

$ llq -s cheetah48.12865.0
...
(pages of information)
...

$ llq -s cheetah48.12865.0 | sed -n '/SUMMARY/,/ANALYSIS/p'
SUMMARY

This LoadLeveler cluster does not have sufficient resources at the present time
to run this job step.

ANALYSIS
The LoadLeveler cluster may not have sufficient resources for a variety of reasons. Nodes may be busy with other jobs, for example. Unfortunately, LoadLeveler cannot distinguish between a temporary reduction of resources and permanent system limitations. Therefore, if a job requests more nodes than the system has, the job will wait, and "llq -s" will return the message above, despite the fact that the job will never be able to run.

In addition to "I", waiting jobs may appear with the "H" (hold) status. This status often means something has gone wrong. Here are some common reasons that jobs are held.

  • NFS is down on one or more nodes. This can be verified using the following command.
    dsh 'ls -d ~/public'
  • You have exceeded your NFS quota. This can be checked by the following command.
    lsquota
  • One or more of the LoadLeveler options is set to a resource that does not exist.
If you have a job in the "H" state and cannot determine how it got there, feel free to contact "consult@ccs.ornl.gov".


What nodes is my job using?

You can use "llq -l" to display detailed information about LoadLeveler jobs, including a list of the nodes allocated for each job. You can use "grep" to isolate this node list, as in the following example.

$ llq -l cheetah48.12813.0 | grep "gov::"
   Allocated Hosts : cheetah04.ccs.ornl.gov::en3(-1,MPI,IP,0M),en3(-1,MPI,IP,0M),,en3(-1,MPI,IP,0M),
                   + en3(-1,MPI,IP,0M),en3(-1,MPI,IP,0M),en3(-1,MPI,IP,0M),en3(-1,MPI,IP,0M),en3(-1,MPI,IP,0M)
Notice that this example has 8 "en3" entries, indicating that 8 MPI processes are running on node 04.

Stopping jobs

You can use "llcancel" with a list of job names to cancel those jobs. The command removes waiting jobs and aborts running jobs.
$ llcancel cheetah48.12816.0
llcancel: Cancel command has been sent to the central manager.
You can also keep a job from running without removing it from LoadLeveler using "llhold" with a list of job names. You can then use "llhold -r" to release held jobs and allow them to run.
$ llhold cheetah48.12817.0 
llhold: Hold command has been sent to the central manager.
$ ...
...
$ llhold -r cheetah48.12817.0 
llhold: Hold command has been sent to the central manager.
The "llhold" command has no effect on running jobs.

Documentation

Cheetah has "man" pages for each of the LoadLeveler commands. Full HTML and PDF documentation is also available from IBM's website. Note that Cheetah currently runs LoadLeveler Version 3 Release 1.
http://www-1.ibm.com/servers/eserver/pseries/library/sp_books/loadleveler.html
The document entitled Using and Administering is particularly useful.

For more information on "poe" options, see the "man" page or the online documentation for IBM's Parallel Environment (PE). Cheetah currently runs PE Version 3 Release 2.

http://www-1.ibm.com/servers/eserver/pseries/library/sp_books/pe.html

phoenix | ram | cheetah | eagle
ornl | nccs | ccs | computers | disclaimer

URL http://www.ccs.ornl.gov/Cheetah/LL.html
Updated: Monday, 14-Feb-2005 13:26:06 EST
consult@ccs.ornl.gov