The
XLSMPOPTS environment variable allows
you to specify options that affect SMP execution. You
can declare
XLSMPOPTS by using the following
bash command
format:
.-:-------------------------------------------.
V |
>>-XLSMPOPTS=--+---+----runtime_option_name-- =----option_setting---+--+---+-><
'-"-' '-"-'
You can specify option names and settings in uppercase or lowercase.
You can add blanks before and after the colons and equal signs to
improve readability. However, if the XLSMPOPTS option
string contains imbedded blanks, you must enclose the entire option
string in double quotation marks (").
You can specify the following runtime options with the
XLSMPOPTS environment
variable:
- Scheduling options
- When the SMP run time is used
to divide the iteration space of a loop, either through auto-parallelization
or OpenMP FOR/DO loops,
a scheduling algorithm is used to assign iterations to the threads
in the parallel region. Each thread receives and executes a contiguous
range of iterations, which is called a block or a chunk. Threads might
finish their blocks of work at different speeds. After completing
the assigned work, threads can be assigned more work or go to sleep.
The chunk size can be controlled in some algorithms; doing so is a
trade-off between overhead and load balancing.
- schedule=static[=n]
- The iteration space is divided into blocks of n contiguous
iterations. The final block might have fewer than n iterations.
If n is unspecified, its default value is FLOOR(number_of_iterations / number_of_threads). The
first REMAINDER(number_of_iterations/number_of_threads) chunks
have one more iteration. Each thread is assigned a separate chunk.
The
blocks are assigned in a round-robin fashion to threads in the parallel
region until there are no remaining blocks. A thread that completes
all its blocks goes to sleep. This is also known as block-cyclic scheduling,
or cyclic scheduling when n has the value 1.
- schedule=dynamic[=n]
- The iteration space is divided into chunks that contain n contiguous
iterations each. The final chunk might contain fewer than n iterations.
If n is not specified, the chunk contains one iteration.
Each
thread is initially assigned one chunk. After threads complete their
assigned chunks, they are assigned remaining chunks on a "first-come,
first-do" basis.
- schedule=affinity[=n]
- The iteration space is divided into number-of-thread-in-parallel-region partitions.
Each partition has CEILING(number-of-iterations / number-of-thread-in-parallel-region) contiguous
iterations. The final partition might have fewer iterations. The partitions
are further divided into blocks, each with n iterations.
If n is unspecified, its default value is CEILING(
number-of-iterations-in-partition / 2 ); that is, each partition
is divided into two blocks.
Each thread is assigned a partition.
Each thread completes blocks within its local partition until no blocks
remain in its partition. If blocks remain in other partitions, but
a thread completes all blocks in its local partition, the thread might
complete blocks in another thread's partition. A thread goes to sleep
if it completes its blocks and no blocks remain.
Note: This
option has been deprecated and might be removed in a future release.
You can use the guided option for a similar
functionality.
- schedule=guided[=n]
- The iteration space is divided into blocks of successively smaller
size. Each block is sized to the larger of n and CEILING(
number-of-iterations-remaining / number-of-thread-in-parallel-region). The final chunk might contain fewer than n iterations. If n is
unspecified, its default value is 1.
Each thread
is initially assigned one block. As threads complete their work, they
are assigned remaining blocks on a "first-come, first-serve" basis.
A thread goes to sleep if it completes its blocks and no blocks remain.
- schedule=auto
- The compiler and runtime might select any algorithm to assign
work to threads. A different algorithm might be selected for different
loops. In addition, a different algorithm might be selected if the
run time is updated.
The OMP_SCHEDULE environment
variable affects only the constructs with a schedule (runtime) clause
specified.
- Parallel execution options
-
- parthds=num
- Specifies the number of threads (num)
to be used for parallel execution of code that you compiled with the -qsmp option.
By default, this is equal to the number of online processors. There
are some applications that cannot use more than some maximum number
of processors. There are also some applications that can achieve performance
gains if they use more threads than there are processors.
This
option allows you full control over the number of execution threads.
The default value for num is 1 if you did
not specify -qsmp. Otherwise, it is the
number of online processors on the machine. For more information,
see the NUM_PARTHDS intrinsic
function.
- usrthds=num
- Specifies the maximum
number of threads (num) that you expect
your code will explicitly create if the code does explicit thread
creation. The default value for num is 0. For more information, see the NUM_PARTHDS intrinsic
function in the XL Fortran Language
Reference.
- stack=num
- Specifies the largest
amount of space in bytes (num) that a thread's
stack will need. The default value for num is
4194304.
Set stack=num so
it is within the acceptable upper limit. num can
be up to the limit imposed by system resources or the stack size
ulimit, whichever is smaller. An application that exceeds the upper
limit may cause a segmentation fault.
- stackcheck[=num]
- Enables stack overflow checking for worker threads at runtime. num is the size in bytes that you specify, and it
must be a nonnull positive number. When the remaining stack size
is less than num, a runtime warning message is issued. If you
do not specify a value for num, the default value is 4096 bytes.
Note that this option only has an effect when -qsmp=stackcheck has
also been specified at compile time. See -qsmp for
more information.
- startproc=cpu_id
- Enables thread binding and specifies the cpu_id to
which the first thread binds. If the value provided is outside the
range of available processors, the SMP run time issues a warning message
and no threads are bound.
- procs=cpu_id[,cpu_id,...]
- Enables thread binding and specifies a list of cpu_id to
which the threads are bound.
- stride=num
- Specifies the increment used to determine the cpu_id to
which subsequent threads bind. num must
be greater than or equal to 1. If the value provided causes a thread
to bind to a CPU outside the range of available processors, a warning
message is issued and no threads are bound.
- Performance tuning options
- When a thread
completes its work and there is no new work to do, it can go into
either a "busy-wait" state or a "sleep" state. In "busy-wait", the
thread keeps executing in a tight loop looking for additional new
work. This state is highly responsive but harms the overall utilization
of the system. When a thread sleeps, it completely suspends execution
until another thread signals it that there is work to do. This state
provides better utilization of the system but introduces extra overhead
for the application.
The xlsmp runtime
library routines use both "busy-wait" and "sleep" states in their
approach to waiting for work. You can control these states with the spins, yields,
and delays options.
During the busy-wait search for
work, the thread repeatedly scans the work queue up to num times,
where num is the value that you specified
for the option spins. If a thread cannot
find work during a given scan, it intentionally wastes cycles in a
delay loop that executes num times, where num is
the value that you specified for the option delays.
This delay loop consists of a single meaningless iteration. The length
of actual time this takes will vary among processors. If the value spins is
exceeded and the thread still cannot find work, the thread will yield
the current time slice (time allocated by the processor to that thread)
to the other threads. The thread will yield its time slice up to num times,
where num is the number that you specified
for the option yields. If this value num is
exceeded, the thread will go to sleep.
In summary, the ordered
approach to looking for work consists of the following steps:
- Scan the work queue for up to spins number
of times. If no work is found in a scan, then loop delays number
of times before starting a new scan.
- If work has not been found, then yield the current time slice.
- Repeat the above steps up to yields number
of times.
- If work has still not been found, then go to sleep.
The syntax for specifying these options is as follows:
- spins[=num]
- where num is the number of spins before
a yield. The default value for spins is 100.
- yields[=num]
- where num is the number of yields before
a sleep. The default value for yields is 10.
- delays[=num]
- where num is the number of delays while
busy-waiting. The default value for delays is 500.
Zero is a special value for spins and yields,
as it can be used to force complete busy-waiting. Normally, in a benchmark
test on a dedicated system, you would set both options to zero. However,
you can set them individually to achieve other effects.
For
instance, on a dedicated 8-way SMP, setting these options to the
following:
parthds=8 : schedule=dynamic=10 : spins=0 : yields=0
results
in one thread per CPU, with each thread assigned chunks consisting
of 10 iterations each, with busy-waiting when there is no immediate
work to do.
- Options to enable and control dynamic profiling
- You can use dynamic profiling to reevaluate the compiler's decision
to parallelize loops in a program. The three options you can use to
do this are: parthreshold, seqthreshold,
and profilefreq.
- parthreshold=num
- Specifies the
time, in milliseconds, below which each loop must execute serially.
If you set parthreshold to 0, every loop
that has been parallelized by the compiler will execute in parallel.
The default setting is 0.2 milliseconds, meaning that if a loop requires
fewer than 0.2 milliseconds to execute in parallel, it should be serialized.
Typically, parthreshold is set to be
equal to the parallelization overhead. If the computation in a parallelized
loop is very small and the time taken to execute these loops is spent
primarily in the setting up of parallelization, these loops should
be executed sequentially for better performance.
- seqthreshold=num
- Specifies the
time, in milliseconds, beyond which a loop that was previously serialized
by the dynamic profiler should revert to being a parallel loop. The
default setting is 5 milliseconds, meaning that if a loop requires
more than 5 milliseconds to execute serially, it should be parallelized.
seqthreshold acts as the reverse of parthreshold.
- profilefreq=num
- Specifies the
frequency with which a loop should be revisited by the dynamic profiler
to determine its appropriateness for parallel or serial execution.
Loops in a program can be data dependent. The loop that was chosen
to execute serially with a pass of dynamic profiling may benefit from
parallelization in subsequent executions of the loop, due to different
data input. Therefore, you need to examine these loops periodically
to reevaluate the decision to serialize a parallel loop at run time.
The allowed values for this option are the numbers from 0 to 32.
If you set
profilefreq to one of these values,
the following results will occur.
- If profilefreq is 0, all profiling is
turned off, regardless of other settings. The overheads that occur
because of profiling will not be present.
- If profilefreq is 1, loops parallelized
automatically by the compiler will be monitored every time they are
executed.
- If profilefreq is 2, loops parallelized
automatically by the compiler will be monitored every other time they
are executed.
- If profilefreq is greater than or equal
to 2 but less than or equal to 32, each loop will be monitored once
every nth time it is executed.
- If profilefreq is greater than 32, then
32 is assumed.
It is important to note that dynamic profiling is not
applicable to user-specified parallel loops (for example, loops for
which you specified the PARALLEL DO directive).