XLSMPOPTS

The XLSMPOPTS environment variable allows you to specify options that affect SMP execution. You can declare XLSMPOPTS by using the following ksh command format:


                      .-:-------------------------------------------.          
                      V                                             |          
>>-XLSMPOPTS=--+---+----runtime_option_name-- =----option_setting---+--+---+-><
               '-"-'                                                   '-"-'

You can specify option names and settings in uppercase or lowercase. You can add blanks before and after the colons and equal signs to improve readability. However, if the XLSMPOPTS option string contains imbedded blanks, you must enclose the entire option string in double quotation marks (").

You can specify the following runtime options with the XLSMPOPTS environment variable:

Scheduling options

When the SMP run time is used to divide the iteration space of a loop, either through auto-parallelization or OpenMP FOR/DO loops, a scheduling algorithm is used to assign iterations to the threads in the parallel region. Each thread receives and executes a contiguous range of iterations, which is called a block or a chunk. Threads might finish their blocks of work at different speeds. After completing the assigned work, threads can be assigned more work or go to sleep. The chunk size can be controlled in some algorithms; doing so is a trade-off between overhead and load balancing.

schedule=static[=n]: The iteration space is divided into blocks of n contiguous iterations. The final block might have fewer than n iterations. If n is unspecified, its default value is FLOOR(number_of_iterations / number_of_threads). The first REMAINDER(number_of_iterations/number_of_threads) chunks have one more iteration. Each thread is assigned a separate chunk.
The blocks are assigned in a round-robin fashion to threads in the parallel region until there are no remaining blocks. A thread that completes all its blocks goes to sleep. This is also known as block-cyclic scheduling, or cyclic scheduling when n has the value 1.
schedule=dynamic[=n]: The iteration space is divided into chunks that contain n contiguous iterations each. The final chunk might contain fewer than n iterations. If n is not specified, the chunk contains one iteration.
Each thread is initially assigned one chunk. After threads complete their assigned chunks, they are assigned remaining chunks on a "first-come, first-do" basis.
schedule=affinity[=n]: The iteration space is divided into number-of-thread-in-parallel-region partitions. Each partition has CEILING(number-of-iterations / number-of-thread-in-parallel-region) contiguous iterations. The final partition might have fewer iterations. The partitions are further divided into blocks, each with n iterations. If n is unspecified, its default value is CEILING( number-of-iterations-in-partition / 2 ); that is, each partition is divided into two blocks.
Each thread is assigned a partition. Each thread completes blocks within its local partition until no blocks remain in its partition. If blocks remain in other partitions, but a thread completes all blocks in its local partition, the thread might complete blocks in another thread's partition. A thread goes to sleep if it completes its blocks and no blocks remain.

Note: This option has been deprecated and might be removed in a future release. You can use the guided option for a similar functionality.
schedule=guided[=n]: The iteration space is divided into blocks of successively smaller size. Each block is sized to the larger of n and CEILING( number-of-iterations-remaining / number-of-thread-in-parallel-region). The final chunk might contain fewer than n iterations. If n is unspecified, its default value is 1.
Each thread is initially assigned one block. As threads complete their work, they are assigned remaining blocks on a "first-come, first-serve" basis. A thread goes to sleep if it completes its blocks and no blocks remain.
schedule=auto: The compiler and runtime might select any algorithm to assign work to threads. A different algorithm might be selected for different loops. In addition, a different algorithm might be selected if the run time is updated.

The OMP_SCHEDULE environment variable affects only the constructs with a schedule (runtime) clause specified.

Parallel execution options

parthds=num

Specifies the number of threads (num) to be used for parallel execution of code that you compiled with the -qsmp option. By default, this is equal to the number of online processors. There are some applications that cannot use more than some maximum number of processors. There are also some applications that can achieve performance gains if they use more threads than there are processors.

This option allows you full control over the number of execution threads. The default value for num is 1 if you did not specify -qsmp. Otherwise, it is the number of online processors on the machine. For more information, see the NUM_PARTHDS intrinsic function.

usrthds=num

Specifies the maximum number of threads (num) that you expect your code will explicitly create if the code does explicit thread creation. The default value for num is 0. For more information, see the NUM_PARTHDS intrinsic function in the XL Fortran Language Reference.

stack=num

Specifies the largest amount of space in bytes (num) that a thread's stack will need. The default value for num is 4194304.

Set stack=num so it is within the acceptable upper limit. num can be up to 256 MB for 32-bit mode, or up to the limit imposed by system resources for 64-bit mode. An application that exceeds the upper limit may cause a segmentation fault.

stackcheck[=num]

Enables stack overflow checking for worker threads at runtime. num is the size in bytes that you specify, and it must be a nonnull positive number. When the remaining stack size is less than num, a runtime warning message is issued. If you do not specify a value for num, the default value is 4096 bytes. Note that this option only has an effect when -qsmp=stackcheck has also been specified at compile time. See -qsmp for more information.

startproc=cpu_id

Enables thread binding and specifies the cpu_id to which the first thread binds. If the value provided is outside the range of available processors, the SMP run time issues a warning message and no threads are bound.

procs=cpu_id[,cpu_id,...]

Enables thread binding and specifies a list of cpu_id to which the threads are bound.

stride=num

Specifies the increment used to determine the cpu_id to which subsequent threads bind. num must be greater than or equal to 1. If the value provided causes a thread to bind to a CPU outside the range of available processors, a warning message is issued and no threads are bound.

bind=SDL=n1,n2,n3

Specifies different system detail levels to bind threads by using the Resource Set API. This suboption supports binding a thread to multiple logical processors.

SDL stands for System Detail Level and must be one of MCM, L2CACHE, PROC_CORE, or PROC. If the SDL value is not specified, or an incorrect SDL value is specified, the SMP runtime issues an error message.

The list of three integers n1,n2,n3 determines how to divide threads among resources (one of SDLs). n1 is the starting resource_id, n2 is the number of requested resources, and n3 is the stride, which specifies the increment used to determine the next resource_id to bind. n1,n2,n3 must all be specified; otherwise, the default binding rules apply.

When the number of resources specified in bind is greater than the number of threads, the extra resources are ignored.

When the number of threads t is greater than the number of resources x, t threads are divided among x resources according to the following formula:

The ceil(t/x) threads are bound to the first (t mod x) resources. The floor(t/x) threads are bound to the remaining resources.

With the XLSMPOPTS environment variable being set as in the following example, a program runs with 16 threads. It binds threads to PROC 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30.

XLSMPOPTS="bind=PROC=0,16,2"

Notes:

The bind suboption takes precedence over the startproc/stride and procs suboptions. However, bindlist takes precedence over bind.
Resource Set can only be used by a user account with the CAP_NUMA_ATTACH and CAP_PROPAGATE capabilities. These capabilities are set on a per-user basis by using the chuser command as follows:
```
chuser "capabilities=CAP_PROPAGATE,CAP_NUMA_ATTACH" username
```
If the resource_id specified in bind is outside the range of 0 to 2147483647, the default binding rules apply.
The SMP runtime verifies that the resource_id exists. If the resource_id does not exist, the thread is left unbound.
If you change the number of threads inside the program, for example, through omp_set_num_threads() or num_threads clause, the following situation occurs:
- If the number of threads in the application is increased, rebinding takes place based on the environment variable settings.
- If the number of threads is reduced after binding, the original binding remains.

bindlist=SDL=i1,i2,...ix

Specifies different system detail levels to bind threads by using the Resource Set API. This suboption supports binding a thread to multiple logical processors.

The list of x integers i1,i2...ix enumerates the resources (one of SDLs) to be used during binding. When the number of integers in the list is greater than or equal to the number of threads, the position in the list determines the thread ID that will be bound to the resource.

When the number of resources specified in bindlist is greater than the number of threads, the extra resources are ignored.

When the number of threads t is greater than the number of resources x, t threads are divided among x resources according to the following formula:

The ceil(t/x) threads are bound to the first (t mod x) resources. The floor(t/x) threads will be bound to the remaining resources.

For example:

XLSMPOPTS="bindlist=MCM=0,1,2,3"

This example code shows that threads are bound to MCM 0,1,2,3. When the program runs with four threads, thread 0 is bound to MCM 0, thread 1 is bound to MCM 1, thread 2 is bound to MCM 2, and thread 3 is bound to MCM 3. When the program runs with six threads, threads 0 and 1 are bound to MCM 0, threads 2 and 3 are bound to MCM 1, thread 4 is bound to MCM 2, and thread 5 is bound to MCM 3.

With the XLSMPOPTS environment variable being set as in the following example, a program runs with eight (or fewer) threads. It binds all even-numbered threads to L2CACHE 0 and all odd-numbered threads to L2CACHE 1.

XLSMPOPTS="bindlist=L2CACHE=0,1,0,1,0,1,0,1"

Notes:

The bindlist suboption takes precedence over the startproc/stride, procs, and bind suboptions.
Resource Set can only be used by a user account with the CAP_NUMA_ATTACH and CAP_PROPAGATE capabilities. These capabilities are set on a per-user basis by using the chuser command as follows:
```
chuser "capabilities=CAP_PROPAGATE,CAP_NUMA_ATTACH" username
```
The SMP runtime verifies that the thread ID specified for a resource is not less than 0 and greater than the available resources. Otherwise, the thread is left unbound.
If you change the number of threads inside the program, for example, through omp_set_num_threads() or num_threads clause, the following situation occurs:
- If the number of threads in the application is increased, rebinding takes place based on the environment variable settings.
- If the number of threads is reduced after binding, the original binding remains.

Performance tuning options

When a thread completes its work and there is no new work to do, it can go into either a "busy-wait" state or a "sleep" state. In "busy-wait", the thread keeps executing in a tight loop looking for additional new work. This state is highly responsive but harms the overall utilization of the system. When a thread sleeps, it completely suspends execution until another thread signals it that there is work to do. This state provides better utilization of the system but introduces extra overhead for the application.

The xlsmp runtime library routines use both "busy-wait" and "sleep" states in their approach to waiting for work. You can control these states with the spins, yields, and delays options.

During the busy-wait search for work, the thread repeatedly scans the work queue up to num times, where num is the value that you specified for the option spins. If a thread cannot find work during a given scan, it intentionally wastes cycles in a delay loop that executes num times, where num is the value that you specified for the option delays. This delay loop consists of a single meaningless iteration. The length of actual time this takes will vary among processors. If the value spins is exceeded and the thread still cannot find work, the thread will yield the current time slice (time allocated by the processor to that thread) to the other threads. The thread will yield its time slice up to num times, where num is the number that you specified for the option yields. If this value num is exceeded, the thread will go to sleep.

In summary, the ordered approach to looking for work consists of the following steps:

Scan the work queue for up to spins number of times. If no work is found in a scan, then loop delays number of times before starting a new scan.
If work has not been found, then yield the current time slice.
Repeat the above steps up to yields number of times.
If work has still not been found, then go to sleep.

The syntax for specifying these options is as follows:

spins[=num]: where num is the number of spins before a yield. The default value for spins is 100.
yields[=num]: where num is the number of yields before a sleep. The default value for yields is 10.
delays[=num]: where num is the number of delays while busy-waiting. The default value for delays is 500.

Zero is a special value for spins and yields, as it can be used to force complete busy-waiting. Normally, in a benchmark test on a dedicated system, you would set both options to zero. However, you can set them individually to achieve other effects.

For instance, on a dedicated 8-way SMP, setting these options to the following:

parthds=8 : schedule=dynamic=10 : spins=0 : yields=0

results in one thread per CPU, with each thread assigned chunks consisting of 10 iterations each, with busy-waiting when there is no immediate work to do.

You can also use the environment variables SPINLOOPTIME and YIELDLOOPTIME to tune performance. Refer to the AIX® Performance Management for more information on these variables.

Options to enable and control dynamic profiling

You can use dynamic profiling to reevaluate the compiler's decision to parallelize loops in a program. The three options you can use to do this are: parthreshold, seqthreshold, and profilefreq.

parthreshold=num

Specifies the time, in milliseconds, below which each loop must execute serially. If you set parthreshold to 0, every loop that has been parallelized by the compiler will execute in parallel. The default setting is 0.2 milliseconds, meaning that if a loop requires fewer than 0.2 milliseconds to execute in parallel, it should be serialized.

Typically, parthreshold is set to be equal to the parallelization overhead. If the computation in a parallelized loop is very small and the time taken to execute these loops is spent primarily in the setting up of parallelization, these loops should be executed sequentially for better performance.

seqthreshold=num

Specifies the time, in milliseconds, beyond which a loop that was previously serialized by the dynamic profiler should revert to being a parallel loop. The default setting is 5 milliseconds, meaning that if a loop requires more than 5 milliseconds to execute serially, it should be parallelized.

seqthreshold acts as the reverse of parthreshold.

profilefreq=num

Specifies the frequency with which a loop should be revisited by the dynamic profiler to determine its appropriateness for parallel or serial execution. Loops in a program can be data dependent. The loop that was chosen to execute serially with a pass of dynamic profiling may benefit from parallelization in subsequent executions of the loop, due to different data input. Therefore, you need to examine these loops periodically to reevaluate the decision to serialize a parallel loop at run time.

The allowed values for this option are the numbers from 0 to 32. If you set profilefreq to one of these values, the following results will occur.

If profilefreq is 0, all profiling is turned off, regardless of other settings. The overheads that occur because of profiling will not be present.
If profilefreq is 1, loops parallelized automatically by the compiler will be monitored every time they are executed.
If profilefreq is 2, loops parallelized automatically by the compiler will be monitored every other time they are executed.
If profilefreq is greater than or equal to 2 but less than or equal to 32, each loop will be monitored once every nth time it is executed.
If profilefreq is greater than 32, then 32 is assumed.

It is important to note that dynamic profiling is not applicable to user-specified parallel loops (for example, loops for which you specified the PARALLEL DO directive).