.-:-------------------------------------------. V | >>-XLSMPOPTS=--+---+----runtime_option_name-- =----option_setting---+--+---+->< '-"-' '-"-'
You can specify option names and settings in uppercase or lowercase. You can add blanks before and after the colons and equal signs to improve readability. However, if the XLSMPOPTS option string contains imbedded blanks, you must enclose the entire option string in double quotation marks (").
The blocks are assigned in a round-robin fashion to threads in the parallel region until there are no remaining blocks. A thread that completes all its blocks goes to sleep. This is also known as block-cyclic scheduling, or cyclic scheduling when n has the value 1.
Each thread is initially assigned one chunk. After threads complete their assigned chunks, they are assigned remaining chunks on a "first-come, first-do" basis.
Each thread is assigned a partition. Each thread completes blocks within its local partition until no blocks remain in its partition. If blocks remain in other partitions, but a thread completes all blocks in its local partition, the thread might complete blocks in another thread's partition. A thread goes to sleep if it completes its blocks and no blocks remain.
Each thread is initially assigned one block. As threads complete their work, they are assigned remaining blocks on a "first-come, first-serve" basis. A thread goes to sleep if it completes its blocks and no blocks remain.
The OMP_SCHEDULE environment variable affects only the constructs with a schedule (runtime) clause specified.
This option allows you full control over the number of execution threads. The default value for num is 1 if you did not specify -qsmp. Otherwise, it is the number of online processors on the machine. For more information, see the NUM_PARTHDS intrinsic function.
Set stack=num so it is within the acceptable upper limit. num can be up to 256 MB for 32-bit mode, or up to the limit imposed by system resources for 64-bit mode. An application that exceeds the upper limit may cause a segmentation fault.
SDL stands for System Detail Level and must be one of MCM, L2CACHE, PROC_CORE, or PROC. If the SDL value is not specified, or an incorrect SDL value is specified, the SMP runtime issues an error message.
The list of three integers n1,n2,n3 determines how to divide threads among resources (one of SDLs). n1 is the starting resource_id, n2 is the number of requested resources, and n3 is the stride, which specifies the increment used to determine the next resource_id to bind. n1,n2,n3 must all be specified; otherwise, the default binding rules apply.
When the number of resources specified in bind is greater than the number of threads, the extra resources are ignored.
When the number of threads t is greater than the number of resources x, t threads are divided among x resources according to the following formula:
The ceil(t/x) threads are bound to the first (t mod x) resources. The floor(t/x) threads are bound to the remaining resources.
XLSMPOPTS="bind=PROC=0,16,2"
chuser "capabilities=CAP_PROPAGATE,CAP_NUMA_ATTACH" username
SDL stands for System Detail Level and must be one of MCM, L2CACHE, PROC_CORE, or PROC. If the SDL value is not specified, or an incorrect SDL value is specified, the SMP runtime issues an error message.
The list of x integers i1,i2...ix enumerates the resources (one of SDLs) to be used during binding. When the number of integers in the list is greater than or equal to the number of threads, the position in the list determines the thread ID that will be bound to the resource.
When the number of resources specified in bindlist is greater than the number of threads, the extra resources are ignored.
When the number of threads t is greater than the number of resources x, t threads are divided among x resources according to the following formula:
The ceil(t/x) threads are bound to the first (t mod x) resources. The floor(t/x) threads will be bound to the remaining resources.
XLSMPOPTS="bindlist=MCM=0,1,2,3"
This
example code shows that threads are bound to MCM
0,1,2,3. When the program runs with four threads,
thread 0 is bound to MCM 0, thread 1 is bound to MCM 1, thread 2 is
bound to MCM 2, and thread 3 is bound to MCM 3. When the program runs
with six threads, threads 0 and 1 are bound to MCM 0, threads 2 and
3 are bound to MCM 1, thread 4 is bound to MCM
2, and thread 5 is bound to MCM 3.XLSMPOPTS="bindlist=L2CACHE=0,1,0,1,0,1,0,1"
chuser "capabilities=CAP_PROPAGATE,CAP_NUMA_ATTACH" username
The xlsmp runtime library routines use both "busy-wait" and "sleep" states in their approach to waiting for work. You can control these states with the spins, yields, and delays options.
During the busy-wait search for work, the thread repeatedly scans the work queue up to num times, where num is the value that you specified for the option spins. If a thread cannot find work during a given scan, it intentionally wastes cycles in a delay loop that executes num times, where num is the value that you specified for the option delays. This delay loop consists of a single meaningless iteration. The length of actual time this takes will vary among processors. If the value spins is exceeded and the thread still cannot find work, the thread will yield the current time slice (time allocated by the processor to that thread) to the other threads. The thread will yield its time slice up to num times, where num is the number that you specified for the option yields. If this value num is exceeded, the thread will go to sleep.
Zero is a special value for spins and yields, as it can be used to force complete busy-waiting. Normally, in a benchmark test on a dedicated system, you would set both options to zero. However, you can set them individually to achieve other effects.
parthds=8 : schedule=dynamic=10 : spins=0 : yields=0
results
in one thread per CPU, with each thread assigned chunks consisting
of 10 iterations each, with busy-waiting when there is no immediate
work to do.You can also use the environment variables SPINLOOPTIME and YIELDLOOPTIME to tune performance. Refer to the AIX® Performance Management for more information on these variables.
Typically, parthreshold is set to be equal to the parallelization overhead. If the computation in a parallelized loop is very small and the time taken to execute these loops is spent primarily in the setting up of parallelization, these loops should be executed sequentially for better performance.
seqthreshold acts as the reverse of parthreshold.
It is important to note that dynamic profiling is not applicable to user-specified parallel loops (for example, loops for which you specified the PARALLEL DO directive).