The slow length decoder is activated when one of the following scenarios occurs:
Processing instruction with a Length Changing Prefix (LCP)
Processing instruction with a false LCP
Processing instruction with a modr/m byte
The following sections explain these scenarios and provide examples of alternative assembly code that does not require slow decoder activation.
Instructions with LCP change their length according to the two prefixes:
operand-size prefix (0x66)
address-size prefix (0x67)
For example, the following instruction encoded as (35 FF FF 00 00) looks like
xor eax,0xffff
While the same instruction encoded as (66 35 FF FF) has a different size (the instruction size calculation does not include the 66 prefix):
xor ax,0xffff
The instruction length decoder of the Intel(R) Core(TM) Solo and Intel(R) Core(TM) Duo processors can not decode the length of an LCP instruction in one cycle, therefore it initiates slow decoding, which takes five extra cycles to complete.
Avoid using instructions with immediate values that require a length-changing prefix. The most common scenario for those is 16-bit immediate in 32-bit code.
You can use the VTune(TM) Performance Analyzer to count the number of slow decoder activations by using the LCP stall event.
The following C code stores a constant 0x5000 to a short variable:
short a;
int foo()
{
a = 0x5000;
}
The following table provides an example of assembly with an LCP stall and alternative code without the stall.
Assembly Alternative 1: with LCP Instruction, 2 Instructions |
Assembly Alternative 2: No LCP, 3 Instructions |
---|---|
mov word ptr a,0x5000 ret |
mov eax,0x5000 mov word ptr a,ax ret |
|
Performance relative to Alternative 1: 400% |
The slow length decoder is activated when processing instructions with a false LCP. This happens in the following cases:
When a 0xF7 instruction following an operand-size prefix (0x66) are processed. The 0xF7 instructions are: neg, not, mul, imul, div, and idiv of 16-bit values.
When an instruction with an operand-size prefix and modr/m starts at offset 14. The offset is relative to the fetch line, which is 16-byte aligned on 16-byte boundary. The first byte of the opcode is at offset 14, the instruction starts at offset 13, assuming that the operand-size prefix is the only prefix.
You can use the VTune(TM) Performance Analyzer to count the LCP stall events.
Avoid using 0xF7 instructions with the 0x66 prefix.
Using an instruction with operand-size prefix and modr/m that starts at offset 14, add a NOP before the instruction or re-schedule the instructions to change the 0x66-prefix instruction alignment.
The following C code negates a 16-bit value.
short a;
void foo()
{
a = -a;
}
The following table provides an example of assembly with a false LCP stall and alternative code without the stall:
Assembly Alternative 1: False LCP, 2 Instructions |
Assembly alternative 2: No LCP, 4 Instructions |
---|---|
neg word ptr a ret |
movsx eax,word ptr a neg eax mov word ptr a,ax ret |
|
Performance relative to Alternative 1: 181% |
Avoid using 16-bit variants of 0xF7 instructions.
The slow-length decoder is activated twice when:
The instruction has a modr/m byte in its encoding and invokes the slow decoder either from true LCP or false LCP, as described above.
The instructions starts at offset 14 or 15 of a fetch line (the location is specified for the first opcode byte, the operand-size prefix appears one byte before).
Double activation of the slow decoder creates an 11-cycle decode bubble instead of the five cycles caused by a single slow decoding operation.