OpenMP on Compaq Systems

 

 

 

Based on the Compaq Fortran User Manual for

Tru64 UNIX and Linux Systems and

 the Tru64 Unix Programmers Guide (C)

 

 

Compiling and Linking Fortran Programs

To enable the use of OpenMP Fortran API compiler directives in your program, you must include the -omp compiler option on your f90 command:

 

% f90 -omp prog.f -o prog

 

 

Compiling and Linking Fortran Programs

To enable the use of OpenMP C API compiler directives in your program, you must include the -omp compiler option on your C command:

 

% cc -omp prog.c -o prog

 

 

Compiler Options

 

 

o      –omp

 

Causes the compiler to recognize only OpenMP manual decomposition pragmas and to ignore old-style manual decomposition directives. (Note that the -mp and -omp switches are the same except for their treatment of old-style manual decomposition directives; -mp recognizes the old-style directives and -omp does not.)

 

o      -granularity  size

Controls the size of shared data in memory that can be safely accessed from different threads. Valid values for size are: byte, longword, and quadword:

byte

Requests that all data of one byte or greater can be accessed from different threads sharing data in memory. This option will slow run-time performance.

longword

Ensures that naturally aligned data of four bytes or greater can be accessed safely from different threads sharing access to that data in memory. Accessing data items of three bytes or less and unaligned data may result in data items written from multiple threads being inconsistently updated.

quadword

Ensures that naturally aligned data of eight bytes can be accessed safely from different threads sharing data in memory. Accessing data items of seven bytes or less and unaligned data may result in data items written from multiple threads being inconsistently updated. This is the default.

o      -check_omp

Enables run-time checking of certain OpenMP constructs. This includes run-time detection of invalid nesting and other invalid OpenMP cases. When invalid nesting is discovered at run time and this switch is set, the executable will fail with a Trace/BPT trap. If this switch is not set and invalid nesting is discovered, the behavior is indeterminate (for example, an executable may hang).

The compiler detects the following invalid nesting conditions:

o      Entering a for, single, or sections directive if already in a work-sharing construct, critical section, or a master

 

o      Executing a barrier directive if already in a work-sharing sharing construct, a critical section, or a master

 

o      Executing a master directive if already in a work-sharing construct

 

o      Executing an ordered directive if already in a critical section

 

o      Executing an ordered directive unless already in an ordered for

The default is disabled run-time checking.

 

 

Adjusting the Run-Time Environment

 

o      The OpenMP API and the Compaq parallel compiler directive sets also provide environment variables that adjust the run-time environment in unusual situations.

 

o      Regardless of whether you used the -omp or the -mp compiler option, when the compiler needs information supplied by an environment variable, the compiler first looks for an OpenMP API environment variable and then for a Compaq parallel compiler environment variable.

 

o      If neither one is found, the compiler uses a default.

 

o      The compiler looks for environment variable information in the following situations:

o      When entering a parallel region, it looks for the number of threads (OMP_NUM_THREADS or MP_THREAD_COUNT),the spin count (MP_SPIN_COUNT), the yield count (MP_YIELD_COUNT), and the stack size (MP_STACK_SIZE).

 

o      When entering a DO or PARALLEL DO directive that has RUNTIME specified, it looks at schedule type (OMP_SCHEDULE.

 

o      When entering a worksharing directive, it looks at chunk size (MP_CHUNK_SIZE).

 

 

OpenMP API Environment Variables

The OpenMP API environment variables are listed:

OpenMP API Environment Variables

Environment Variable1

Interpretation

OMP_SCHEDULE

 

This variable applies only to DO and PARALLEL DO directives that have the schedule type of RUNTIME. You can set the schedule type and an optional chunk size for these loops at run time. The schedule types are STATIC, DYNAMIC, GUIDED, and RUNTIME.

For directives that have a schedule type other than RUNTIME, this variable is ignored. The compiler default schedule type is STATIC. If the optional chunk size is not set, a chunk size of one is assumed, except for the STATIC schedule type. For this schedule type, the default chunk size is set to the loop iteration space divided by the number of threads applied to the loop.

OMP_NUM_THREADS

 

Use this environment variable to set the number of threads to use during execution. This number applies unless you explicitly change it by calling the OMP_SET_NUM_THREADS run-time library routine.

When you have enabled dynamic thread adjustment, the value assigned to this environment variable represents the maximum number of threads that can be used. The default value is the number of processors in the current system. For more information about dynamic thread adjustment, see the online release notes.

OMP_DYNAMIC

 

Use this environment variable to enable or disable dynamic thread adjustment for the execution of parallel regions. When set to TRUE, the number of threads used can be adjusted by the run-time environment to best utilize system resources. When set to FALSE, dynamic adjustment is disabled. The default is FALSE. For more information about dynamic thread adjustment, see the online release notes.

OMP_NESTED

 

Use this environment variable to enable or disable nested parallelism. When set to TRUE, nested parallelism is enabled. When set to FALSE, it is disabled. The default is FALSE. For more information about nested parallelism, see the online release notes.

 

 

Compaq Environment Variables

The Compaq parallel compiler environment variables are listed:

Compaq Environment Variables

Environment Variable

Interpretation

MP_THREAD_COUNT

 

Specifies the number of threads the run-time system is to create. The default is the number of processors available to your process.

MP_CHUNK_SIZE

 

Specifies the chunk size the run-time system uses when dispatching loop iterations to threads if the program specified the RUNTIME schedule type or specified another schedule type requiring a chunk size, but omitted the chunk size. The default chunk size is 1.

MP_STACK_SIZE

 

Specifies how many bytes of stack space the runtime system allocates for each thread when creating it. If you specify zero, the runtime system uses the default, which is very small. Therefore, if a program declares any large arrays to be PRIVATE, specify a value large enough to allocate them. If you do not use this environment variable at all, the runtime system allocates 5 MB.

MP_SPIN_COUNT

 

Specifies how many times the runtime system spins while waiting for a condition to become true. The default is 16,000,000, which is approximately one second of CPU time.

 

When one of the threads needs to wait for an event caused by some other thread, a 3-level process begins. First the thread spins for a number of iterations waiting for the event to occur; second it yields the processor to other threads a number of times checking for the event to occur; and finally it posts a request to be awakened and goes to sleep – when another thread causes the event to occur it will awaken the sleeping thread. If your application is running stand-alone, the default settings of MP_SPIN_COUNT and MP_YIELD_COUNT will give good performance. But if your application needs to share the processors with others, it is probably appropriate to reduce MP_SPIN_COUNT. This will make the threads waste less time spinning, and give up the processor sooner; the cost is extra time to put a thread to sleep and re-awaken it. In such a shared environment, a SPIN_COUNT of about1000 might be a good choice. Usually MP_YIELD_COUNT does not need to be adjusted.

MP_YIELD_COUNT

 

Specifies how many times the runtime system alternates between calling sched_yield and testing the condition before going to sleep by waiting for a thread condition variable. The default is 10.

 

 

Schedule Type and Chunksize Settings

o      The choice of settings for the schedule type and the chunksize can affect the ultimate performance of the resulting parallelized application, either positively or negatively.

 

o      Choosing inappropriate settings for the schedule type and the chunksize can degrade the performance of parallelized application to the point where it performs as bad or worse than it would if it was serialized.

 

o      The general guidelines are as follows:

o      Smaller chunksize values generally perform faster than larger. The values for the chunksize should be less than or equal to the values derived by dividing the number of iterations by the number of available threads.

 

o      The behavior of the dynamic and guided schedule types make them better suited for target machines with a variety of workloads, other than the parallelized application. These types assign iterations to threads as they become available; if a processor (or processors) becomes tied up with other applications, the available threads will pick up the next iterations.

 

o      Although the runtime schedule type does facilitate tuning of the schedule type at run time, it results in a minor performance penalty in run-time overhead.

 

o      An effective means of determining appropriate settings for schedule and chunksize can be to set the schedule to runtime and experiment with various schedule and chunksize pairs through the OMP_SCHEDULE environment variable. After the exercise, explicitly set the schedule and chunksize to the values that yielded the best performance.

 

Additional Performance Considerations

o      Note that the schedule and chunksize settings are only two of the many factors that can affect the performance of your application. Some of the other areas that can affect performance include:

o      Availability of system resources: CPUs on the target machine spending time processing other applications are not available to the parallelized application.

 

o      Structure of parallelized code: Threads of a parallelized region that perform disproportionate amounts of work.

 

o      Use of implicit and explicit barriers: Parallelized regions that force synchronization of all threads at these explicit or implicit points may cause the application to suspend while waiting for a thread (or threads).

 

o      Use of critical sections versus atomic statements: Using critical sections incurs more overhead than atomic.

 

Implementation Specific Behavior

 

o      The OpenMP specification identifies several features and default values as implementation-specific.

 

 

o      Support for nested parallel regions  - Whenever a nested parallel region is encountered, a team consisting of one thread is created to execute that region.

 

 

o      Default value for OMP_SCHEDULE  - The default value is dynamic,1. If an application uses the run-time schedule but OMP_SCHEDULE is not defined, then this value is used.

 

o      Default value for OMP_NUM_THREADS  - The default value is equal to the number of processors on the machine.

 

o      Default value for OMP_DYNAMIC  - The default value is 0. Note that this implementation does not support dynamic adjustments to the thread count. Attempts to use omp_set_dynamic to a nonzero value have no effect on the run-time environment.

 

o      Default schedule  - When a for or parallel for loop does not contain a schedule clause, a dynamic schedule type is used with the chunksize set to 1.

 

 

o      Flush directive  - The flush directive, when encountered, will flush all variables, even if one or more variables are specified in the directive.

 

 

References and Revisions

 

 

© Copyright 2000, Compaq, All Rights Reserved

 

For more information on Compaq’s OpenMP contact Frank.Pietryka@Compaq.com

 

011100 – Pietryka – Created from Compaq Fortran User Manual for Tru64 UNIX and Linux Systems and the Tru64 Unix Programmers Guide for C (Version 5.0 Dec 99)