Memory Stall

Store_Forwarding_Blocked is a warning.

The instruction for which Store_Forwarding_Blocked is issued reads data after a previous instruction wrote data to an overlapping memory space. The stall occurs if either of the following is true:

Size and Alignment Restrictions in Store Forwarding

Size and Alignment Restrictions in Store Forwarding

For more information, see the Intel(R) Pentium(R) 4 processor manuals on the web.

Note

For the Intel(R) Pentium(R) 4 processors with Streaming SIMD Extensions 3 (SSE3), only cases b) and c) above are relevant. That is, the penalty occurs only when a write of small data element/elements is followed by a read of big data element from same address.

Advice:

Read data that will be manipulated by MMX(TM) technology instructions using one of the following:

Write 64-bit quadwords using the MMX technology instruction that writes a 64-bit operand (for example, MOVQ MM0, m64).

Example, preventing memory size stalls:

This example prevents stalls by putting the reads in a separate loop, far away from the writes.

 

Original

Optimized

unsigned short array[1000]; for (i = 0 ; i < 1000 ; i +=2) { array[i] = i; array[i+1] = i + 1; rst1[i/2] = *(unsigned long *) &array[i]; }

unsigned short array[1000]; for (i = 0 ; i < 1000 ; i +=2) { array[i] = i; array[i+1] = i + 1; } for (i = 0 ; i < 500 ; i++) { rst1[i] = *(unsigned long *) &array[i *2]; }

In each pass through the loop, this code writes two bytes to the address of array[i], writes two bytes to the offset of the address of array[i], i.e., array[i+1], and reads four bytes from the address of array[i]. Each read causes a stall.

This code prevents the stalls by writing all the data into the two arrays in the first loop.
Then, it executes all the reads in a separate loop.

Example, preventing memory alignment stalls:

This example prevents stalls by putting reads far away from the writes that come before them.

 

Original

Optimized

for (i = 0 ; i < 1000 ; i +=2) { array[i] = i; rst2[i] = *(unsigned short *) ((unsigned char *)&array[i] + 1); }

for (i = 0 ; i < 1000 ; i +=2) { array[i] = i; } for (i = 0 ; i < 1000 ; i +=2) { rst2[i] = *(unsigned short *) ((unsigned char *)&array[i] + 1); }

In each pass through the loop, this code writes two bytes to the address of array[i], and reads two bytes from an offset from the address of array[i], i.e., array[i] + 1.
Each read causes a stall.

This code prevents the stalls by first writing all the data in one loop to the address of array[i].
Then, it executes all the reads in a separate loop, from the offset from that address.

Memory Penalties  

Affected Processors