The benefits of unrolling loops are:
Unrolling amortizes the branch overhead, since it eliminates branches and some of the code to manage induction variables.
Unrolling allows you to aggressively schedule (or pipeline) the loop to hide latencies. This is useful if you have enough free registers to keep variables live as you stretch out the dependence chain to expose the critical path.
The Intel(R) Pentium(R) 4 processor can correctly predict the exit branch for an inner loop that has 16 or fewer iterations, if that number of iterations is predictable and there are no conditional branches in the loop. Therefore, if the loop body size is not excessive, and the probable number of iterations is known, unroll inner loops until they have a maximum of 16 iterations. With Pentium III or Pentium II processors, do not unroll loops more than 4 iterations.
The potential costs of unrolling loops are:
Excessive unrolling, or unrolling of very large loops can lead to increased code size. This can be harmful if the unrolled loop no longer fits in the trace cache (TC).
Unrolling loops whose bodies contain branches increases the demands on the BTB capacity. If the number of iterations of the unrolled loop is 16 or less, the branch predictor should be able to correctly predict branches in the loop body that alternate direction.
Assembly/Compiler Coding Rule 12. (H impact, M generality)
Unroll small loops until the overhead of the branch and the induction variable accounts, generally, for less than about 10% of the execution time of the loop.
Assembly/Compiler Coding Rule 13. (H impact, M generality)
Avoid unrolling loops excessively, as this may thrash the TC.
Assembly/Compiler Coding Rule 14. (M impact, M generality) Unroll loops that are frequently executed and that have a predictable number of iterations to reduce the number of iterations to 16 or fewer, unless this increases code size so that the working set no longer fits in the trace cache. If the loop body contains more than one conditional branch, then unroll so that the number of iterations is 16/(# conditional branches).
The following loop unrolling example shows how unrolling enables other optimizations:
Before unrolling:
do i=1,100
if (i mod 2 == 0) then a(i) = x
else a(i) = y
enddo
After unrolling
do i=1,100,2
a(i) = y
a(i+1) = x
enddo
In this example, a loop that executes 100 times assigns x to every even-numbered element and y to every odd-numbered element. By unrolling the loop you can make both assignments each iteration, removing one branch in the loop body.