CCS home | contacts | search  

Running on the ORNL AlphaServer SC Systems


Contents


Running parallel programs

The ORNL Compaq AlphaServer SC systems currently have no batch system. All parallel jobs run through the Resource Management System (RMS), which allocates processors for interactive jobs. Each job has a maximum wall-clock time limit of 12 hours. The number of falcon processors which can be used by a single user is not limited, but users are requested to contact consult@ccs.ornl.gov if they plan to run a full-configuration job. The number of colt processors which can be used by a single user is limited to 8 nodes or 32 processors. Special consideration for larger jobs will be considered upon request.

Use the "prun" command to start a parallel job using RMS. Here are some examples.

The following command runs the MPI job "a.out" with 16 processes. By default, the processes will be spread across 4 nodes (4 processes per node).

The following commands run an OpenMP code on a dedicated node using 4 OpenMP threads. This example assumes a Korn shell ("ksh"). Note that "prun" exports the environment of the host shell, including the "OMP_NUM_THREADS" environment variable. Here is the same example for a C shell ("csh"). The following commands run a hybrid MPI/OpenMP code using 8 MPI processes, one per node, with 4 OpenMP threads per MPI process. This example assumes a Korn shell ("ksh"). Here is the same example for a C shell ("csh").

Dealing with contention

If you attempt to request more nodes than are available, your "prun" command will block until it can run. RMS has a simple scheduler; pending requests are served first-in-first-out (FIFO). If you don't want your request to block, you can use the "-I" option to cause "prun" to fail immediately if resources are not available. You can use the "rinfo" command to see the current status of the system. This example shows that RMS is running a single partition, "parallel", that includes nodes 1 through 63. The user "ernie" has reserved 128 processors, nodes 32 through 63. The same user is currently using 96 of those processors, nodes 40 through 63, for an actual parallel job. The user "bert" is waiting for 248 processors to become available.

This example raises a question. How did "ernie" reserve 128 nodes but only use 96? The "allocate" command provides the answer; it reserves a group of nodes and starts a new shell. The "prun" commands issued from that shell run within the allocated group of nodes. The nodes are released when you exit the shell.

This capability is useful when you need to run a number of parallel jobs in a row on the same set of nodes. A good example of such a need is a scalability study of parallel performance. The following example reserves 128 processors and runs an executable with increasing numbers of processors.

The prefixes "(old)" and "(new)" have been added for clarity. For actual sessions, the most visible sign that "allocate" creates a new shell is that you must issue an "exit" for the allocation to disappear from "rinfo".

Dealing with credentials

The CCS systems use the Distributed Computing Environment (DCE) for user authentication. With DCE, all user information is centralized in the "registry", so each domain of Falcon does not require a separate copy of each user's login information ("/etc/passwd").

DCE credentials are needed to access native DFS and to call HSI without a password. The "prun" command automatically distributes DCE credentials to the nodes of each parallel job. To do this, "prun" must request your DCE password the first time you run, whenever you change your password, and whenever the system software is upgraded.

To confirm that "prun" can distribute your DCE credentials, call "prun" with no arguments. If it needs your password, it will request it at this point.

Subsequent calls should produce the "usage" information without requesting a password.

Documentation

For more information on RMS commands, consult the following "man" pages on Falcon and Colt.

Manuals for RMS are available in PDF and PostScript format in the following directory on Falcon and Colt.

colt | eagle | falcon | ccs
ornl | ccs | csm: research | people | sitemap | search

URL http://www.ccs.ornl.gov/falcon/rms.html
Updated: 8/27/2002
webmaster