Compiling
and Linking Fortran Programs
|
To enable the use of OpenMP Fortran API
compiler directives in your program, you must include the -omp compiler
option on your f90 command:
|
% f90 -omp prog.f -o prog |
Compiling
and Linking Fortran Programs
|
To enable the use of OpenMP C API compiler
directives in your program, you must include the -omp compiler
option on your C command:
|
% cc -omp prog.c -o prog |
Compiler
Options
|
o –omp
Causes the compiler to recognize only
OpenMP manual decomposition pragmas and to ignore old-style manual
decomposition directives. (Note that the -mp and -omp switches
are the same except for their treatment of old-style manual decomposition
directives; -mp recognizes the old-style directives and -omp does
not.)
o -granularity size
Controls the size of shared data in memory that can
be safely accessed from different threads. Valid values for size are: byte, longword, and quadword:
byte
Requests that all data of one byte or greater can be
accessed from different threads sharing data in memory. This option will slow
run-time performance.
longword
Ensures that naturally aligned data of four bytes or
greater can be accessed safely from different threads sharing access to that
data in memory. Accessing data items of three bytes or less and unaligned data
may result in data items written from multiple threads being inconsistently
updated.
quadword
Ensures that naturally aligned data of eight bytes
can be accessed safely from different threads sharing data in memory. Accessing
data items of seven bytes or less and unaligned data may result in data items
written from multiple threads being inconsistently updated. This is the
default.
o -check_omp
Enables run-time checking of certain OpenMP
constructs. This includes run-time detection of invalid nesting and other
invalid OpenMP cases. When invalid nesting is discovered at run time and this
switch is set, the executable will fail with a Trace/BPT trap. If this switch
is not set and invalid nesting is discovered, the behavior is indeterminate
(for example, an executable may hang).
The compiler detects the following invalid nesting
conditions:
o Entering a
for, single, or sections directive
if already in a work-sharing construct, critical section,
or a master
o Executing
a barrier directive if already in a
work-sharing sharing construct, a critical section,
or a master
o Executing
a master directive if already in a
work-sharing construct
o Executing
an ordered directive if already in a critical section
o Executing
an ordered directive unless already in an ordered for
The default is disabled run-time checking.
Adjusting
the Run-Time Environment
|
o The OpenMP API and the Compaq parallel compiler directive
sets also provide environment variables that adjust the run-time environment in
unusual situations.
o Regardless of whether you used the -omp or the -mp compiler
option, when the compiler needs information supplied by an environment
variable, the compiler first looks for an OpenMP API environment variable and
then for a Compaq parallel compiler environment variable.
o If neither one is found, the compiler uses a default.
o The compiler looks for environment variable information in
the following situations:
o
When entering a
parallel region, it looks for the number of threads (OMP_NUM_THREADS or
MP_THREAD_COUNT),the spin count (MP_SPIN_COUNT), the yield count
(MP_YIELD_COUNT), and the stack size (MP_STACK_SIZE).
o
When entering a DO or
PARALLEL DO directive that has RUNTIME specified, it looks at schedule type
(OMP_SCHEDULE.
o
When entering a
worksharing directive, it looks at chunk size (MP_CHUNK_SIZE).
OpenMP
API Environment Variables
|
The OpenMP API environment variables are listed:
|
OpenMP API Environment Variables |
|
|
Environment
Variable1 |
Interpretation
|
|
OMP_SCHEDULE |
|
|
|
This variable applies only to DO and PARALLEL DO directives
that have the schedule type of RUNTIME. You can set the schedule type and an
optional chunk size for these loops at run time. The schedule types are
STATIC, DYNAMIC, GUIDED, and RUNTIME. For
directives that have a schedule type other than RUNTIME, this variable is
ignored. The compiler default schedule type is STATIC. If the optional chunk
size is not set, a chunk size of one is assumed, except for the STATIC
schedule type. For this schedule type, the default chunk size is set to the
loop iteration space divided by the number of threads applied to the loop. |
|
OMP_NUM_THREADS |
|
|
|
Use this environment variable to set the number of threads
to use during execution. This number applies unless you explicitly change it by
calling the OMP_SET_NUM_THREADS run-time library routine. When
you have enabled dynamic thread adjustment, the value assigned to this
environment variable represents the maximum number of threads that can be
used. The default value is the number of processors in the current system.
For more information about dynamic thread adjustment, see the online release
notes. |
|
OMP_DYNAMIC |
|
|
|
Use this environment variable to enable or disable dynamic
thread adjustment for the execution of parallel regions. When set to TRUE,
the number of threads used can be adjusted by the run-time environment to
best utilize system resources. When set to FALSE, dynamic adjustment is
disabled. The default is FALSE. For more information about dynamic thread
adjustment, see the online release notes. |
|
OMP_NESTED |
|
|
|
Use this environment variable to enable or disable nested
parallelism. When set to TRUE, nested parallelism is enabled. When set to
FALSE, it is disabled. The default is FALSE. For more information about
nested parallelism, see the online release notes. |
Compaq
Environment Variables
|
The
Compaq parallel compiler environment variables are listed:
|
Compaq Environment Variables |
|
|
Environment
Variable |
Interpretation
|
|
MP_THREAD_COUNT |
|
|
|
Specifies the number of threads the run-time system is to
create. The default is the number of processors available to your process. |
|
MP_CHUNK_SIZE |
|
|
|
Specifies the chunk size the run-time system uses when
dispatching loop iterations to threads if the program specified the RUNTIME
schedule type or specified another schedule type requiring a chunk size, but
omitted the chunk size. The default chunk size is 1. |
|
MP_STACK_SIZE |
|
|
|
Specifies how many bytes of stack space the runtime system
allocates for each thread when creating it. If you specify zero, the runtime
system uses the default, which is very small. Therefore, if a program
declares any large arrays to be PRIVATE, specify a value large enough to
allocate them. If you do not use this environment variable at all, the
runtime system allocates 5 MB. |
|
MP_SPIN_COUNT |
|
|
|
Specifies how many times the runtime system spins while waiting
for a condition to become true. The default is 16,000,000, which is
approximately one second of CPU time. When one of the
threads needs to wait for an event caused by some other thread, a 3-level
process begins. First the thread spins for a number of iterations waiting for
the event to occur; second it yields the processor to other threads a number
of times checking for the event to occur; and finally it posts a request to
be awakened and goes to sleep – when another thread causes the event to occur
it will awaken the sleeping thread. If your application is running
stand-alone, the default settings of MP_SPIN_COUNT and MP_YIELD_COUNT will
give good performance. But if your application needs to share the processors
with others, it is probably appropriate to reduce MP_SPIN_COUNT. This will
make the threads waste less time spinning, and give up the processor sooner;
the cost is extra time to put a thread to sleep and re-awaken it. In such a
shared environment, a SPIN_COUNT of about1000 might be a good choice. Usually
MP_YIELD_COUNT does not need to be adjusted. |
|
MP_YIELD_COUNT |
|
|
|
Specifies how many times the runtime system alternates
between calling sched_yield and testing the condition before going to sleep by
waiting for a thread condition variable. The default is 10. |
Schedule
Type and Chunksize Settings
|
o The choice of settings for the schedule type and the
chunksize can affect the ultimate performance of the resulting parallelized application,
either positively or negatively.
o Choosing inappropriate settings for the schedule type and
the chunksize can degrade the performance of parallelized application to the
point where it performs as bad or worse than it would if it was serialized.
o The general guidelines are as follows:
o Smaller chunksize values generally perform faster than
larger. The values for the chunksize should be less than or equal to the values
derived by dividing the number of iterations by the number of available threads.
o The behavior of the dynamic and guided schedule types make them better suited for target machines
with a variety of workloads, other than the parallelized application. These
types assign iterations to threads as they become available; if a processor (or
processors) becomes tied up with other applications, the available threads will
pick up the next iterations.
o Although the runtime schedule type does facilitate tuning of the schedule type
at run time, it results in a minor performance penalty in run-time overhead.
o An effective means of determining appropriate settings for
schedule and chunksize can be to set the schedule to runtime and
experiment with various schedule and chunksize pairs through the OMP_SCHEDULE
environment variable. After the exercise, explicitly set the schedule and
chunksize to the values that yielded the best performance.
Additional
Performance Considerations
|
o Note that the schedule and chunksize settings are only two of
the many factors that can affect the performance of your application. Some of
the other areas that can affect performance include:
o Availability of system resources: CPUs on the target machine
spending time processing other applications are not available to the
parallelized application.
o Structure of parallelized code: Threads of a parallelized
region that perform disproportionate amounts of work.
o Use of implicit and explicit barriers: Parallelized regions
that force synchronization of all threads at these explicit or implicit points
may cause the application to suspend while waiting for a thread (or threads).
o Use of critical sections versus atomic statements: Using critical sections incurs more overhead than atomic.
Implementation
Specific Behavior
|
o The OpenMP specification identifies several features and
default values as implementation-specific.
o Support for nested parallel regions - Whenever a nested parallel region is encountered,
a team consisting of one thread is created to execute that region.
o Default value for OMP_SCHEDULE - The default value
is dynamic,1. If an application uses the run-time schedule but OMP_SCHEDULE
is not defined, then this value is used.
o Default value for OMP_NUM_THREADS - The default value
is equal to the number of processors on the machine.
o Default value for OMP_DYNAMIC - The default value
is 0. Note that this implementation does not support dynamic adjustments to the
thread count. Attempts to use omp_set_dynamic to a nonzero value have no effect on the run-time
environment.
o Default schedule -
When a for or parallel for loop does not contain a schedule clause, a dynamic
schedule type is used with the chunksize set to 1.
o Flush directive -
The flush directive, when encountered, will flush all variables, even
if one or more variables are specified in the directive.
References
and Revisions
|
|
|
© Copyright
2000, Compaq, All Rights Reserved
For
more information on Compaq’s OpenMP contact Frank.Pietryka@Compaq.com
011100 – Pietryka – Created
from Compaq
Fortran User Manual for Tru64 UNIX and Linux Systems and the Tru64 Unix
Programmers Guide for C (Version 5.0 Dec 99)
|