L2 Lines Allocated (Excluding Hardware-Prefetched) / Instructions Retired
This ratio calculates the average of L2 misses as a result of L1 misses per instructions retired. This calculation does not take into account misses that result from prefetch operations. If data accesses are organized properly, the prefetchers bring most of the data required by L1 misses beforehand, and the latency impact on the performance is lower. Demand fetches latency calculated by this ratio usually stalls the execution progress until the data arrives from memory. This has a much higher impact on program performance.
If you want to parallelize portions of the code that have moderate or high value of this ratio, examine the data sets accessed by the code. If the parallelized code portions use the same data set, the number of L2 misses and the ratio value decrease and the performance improves. On the other hand, if the sum of the thread data sets does not fit in the L2, the number of L2 misses may increase, since the two threads now compete on the L2 cache space. The prefetchers might be less effective in this case, which will lead to more L2 misses due to L1 misses. This will increase ratio value and reduce performance.
Limits: good < 0.001, bad > 0.003
Make data accesses sequential to exploit prefetchers
Improve data locality
Use smaller data buffer that fits in the L2
If the examined code contains long repeat instructions this ratio may show a value that is high but yet does not represent a performance issue. To detect whether there are long repeat instructions you can use the uOps per Instructions Retired ratio.