-qprefetch

Pragma equivalent

None.

Purpose

Inserts prefetch instructions automatically where there are opportunities to improve code performance.

When -qprefetch is in effect, the compiler may insert prefetch instructions in compiled code. When -qnoprefetch is in effect, prefetch instructions are not inserted in compiled code.

Syntax

Read syntax diagramSkip visual syntax diagram
                    .-:-----------------------------------.     
                    V                                     |     
        .-prefetch----+---------------------------------+-+-.   
        |             |    .-noassistthread-----------. |   |   
        |             +-=--+-assistthread--=--+-SMT-+-+-+   |   
        |             |                       '-CMP-'   |   |   
        |             |    .-noaggressive-.             |   |   
        |             +-=--+-aggressive---+-------------+   |   
        |             '-=--dscr--=--value---------------'   |   
>>- -q--+-noprefetch----------------------------------------+--><

Defaults

-qprefetch=noassistthread:noaggressive:dscr=0

Parameters

assistthread | noassistthread
When you work with applications that generate a high cache-miss rate, you can use -qprefetch=assistthread to exploit assist threads for data prefetching. This suboption guides the compiler to exploit assist threads at optimization level -O3 -qhot or higher. If you do not specify -qprefetch=assistthread, -qprefetch=noassistthread is implied.
CMP
For systems based on the chip multi-processor architecture (CMP), you can use -qprefetch=assistthread=cmp.
SMT
For systems based on the simultaneous multi-threading architecture (SMT), you can use -qprefetch=assistthread=smt.
Note: If you do not specify either CMP or SMT, the compiler uses the default setting based on your system architecture.
aggressive | noaggressive
This suboption guides the compiler to generate aggressive data prefetching at optimization level -O3 or higher. If you do not specify aggressive, -qprefetch=noaggressive is implied.
dscr
You can specify a value for the dscr suboption to improve the runtime performance of your applications. The compiler sets the Data Stream Control Register (DSCR) to the specified dscr value to control the hardware prefetch engine. The value is valid only when -mcpu=pwr8 is in effect and the optimization level is -O2 or greater. The default value of dscr is 0.
value

The value that you specify for dscr must be 0 or greater, and representable as a 64-bit unsigned integer. Otherwise, the compiler issues a warning message and sets dscr to 0. The compiler accepts both decimal and hexadecimal numbers, and a hexadecimal number requires the prefix of 0x. The value range depends on your system architecture. See the product information about the POWER® Architecture for details. If you specify multiple dscr values, the last one takes effect.

Usage

The -qnoprefetch option does not prevent built-in functions such as __prefetch_by_stream from generating prefetch instructions.

When you run -qprefetch=assistthread, the compiler uses the delinquent load information to perform analysis and generates prefetching assist threads. The delinquent load information can either be provided through the built-in __mem_delay function (const void *delinquent_load_address, const unsigned int delay_cycles), or gathered from dynamic profiling using -qpdf1=level=2.

When you use -qpdf to call -qprefetch=assistthread, you must use the traditional two-step PDF invocation:
  1. Run -qpdf1=level=2
  2. Run -qpdf2 -qprefetch=assistthread

Examples

Here is how you generate code using assist threads with __MEM_DELAY:

Initial code:
int y[64], x[1089], w[1024];

  void foo(void){
    int i, j;
    for (i = 0; i &l; 64; i++) {
      for (j = 0; j < 1024; j++) {
        
        /* what to prefetch? y[i]; inserted by the user */
        __mem_delay(&y[i], 10);          
        y[i] = y[i] + x[i + j] * w[j];                            
        x[i + j + 1] = y[i] * 2;       
    }     
  }    
}
Assist thread generated code:
void foo@clone(unsigned thread_id, unsigned version)

{ if (!1) goto lab_1;

/* version control to synchronize assist and main thread */
if (version == @2version0) goto lab_5; 

goto lab_1;

lab_5:

@CIV1 = 0;

do { /* id=1 guarded */ /* ~2 */

if (!1) goto lab_3;

@CIV0 = 0;

do { /* id=2 guarded */ /* ~4 */

/* region = 0 */

/* __dcbt call generated to prefetch y[i] access */
__dcbt(((char *)&y + (4)*(@CIV1)))    
@CIV0 = @CIV0 + 1; 
} while ((unsigned) @CIV0 < 1024u); /* ~4 */  

lab_3:
@CIV1 = @CIV1 + 1;
} while ((unsigned) @CIV1 < 64u); /* ~2 */  

lab_1:

return; 
}


Voice your opinion on getting help information Ask IBM compiler experts a technical question in the IBM XL compilers forum Reach out to us