Mis-aligned Memory References

On Intel(R) Pentium(R) M processors, a misaligned access that crosses a cache line boundary does incur a penalty. A Data Cache Unit (DCU) split is a memory access that crosses a 64-byte line boundary. Unaligned accesses may cause a DCU split and stall Pentium M processors.

Note

Some algorithms (such as the Motion Estimation and Motion Compensation algorithms, found in video codecs) by nature tend to cause frequent misaligned references. In some cases alternative coding strategies that reduce misaligned references may actually require more total clock cycles.

This insight is relevant when:
The quotient of the counters (Misaligned Data Memory Reference/Data Memory References (all)) is poor. A value of 0.00 should be considered good, and a value of 0.002 should be considered poor.

Advice:

Align data in memory if possible.
Make sure data is properly aligned. If the algorithm has inherent misaligned data references and accesses the data many times, copy misaligned data to an aligned location and operate on the aligned data. Also, avoid storing 80-bit (floating point) values to unaligned locations. (Note that with some compilers, storing 64-bit data on the stack may cause misaligned references.)
If data cannot be aligned in memory, consider reading data in multiple steps to force alignment.
If you cannot align data in memory, consider reading the data in multiple steps, using SHIFT and OR operations to avoid DCU line split penalties. For example, assume we need to read addresses [62-69] (crossing a DCU line boundary between addresses 63 and 64). Let's call the values {v7 v6 v5 v4 v3 v2 v1 v0}. Instead of "naively" reading these 8 bytes and paying a DCU line split penalty, we could read 2 aligned addresses: r0=[56-63] and r1=[64-71]. Now the values are: r1 = {s1 s0 v7 v6 v5 v4 v3 v2}; r0 = {v1 v0 t5 t4 t3 t2 t1 t0}. Now we can shift r0 right by 8*6 bits, so r0={0 0 0 0 0 0 v1 v0}, and shift r1 left by 8*2 bits, so r1={v7 v6 v5 v4 v3 v2 0 0}. Now we can OR r1 and r2 to get {v7 v6 v5 v4 v3 v2 v1 v0}. We read these 8 bytes without paying a misalignment penalty (but note that avoiding the misalignment cost us 2 MOVs, 2 SHIFTs, and 1 OR).
Make sure code is properly aligned.

If you are writing in assembly:

Loop entry labels should be 16-byte-aligned when less than eight bytes away from a 16-byte boundary.
Labels that follow a conditional branch need not be aligned.
Labels that follow an unconditional branch or function call should be 16-byte-aligned when less than eight bytes away from a 16-byte boundary.

If you are writing in a higher-level language:

Use a compiler that will assure the above rules are met for the generated code.
On Pentium M processors, avoid loops that execute in less than two cycles. The target of the tight loops should be aligned on a 16-byte boundary to maximize the use of instructions that will be fetched. On Pentium M processors, it can limit the number of instructions available for execution, limiting the number of instructions retired every cycle. It is recommended that critical loop entries be located on a cache line boundary. Additionally, loops that execute in less than two cycles should be unrolled.