Branch optimizations have some of the greatest impact on performance.
Understanding the flow of branches and improving the predictability of branches can increase the speed of your code significantly.
The basic kinds of optimizations that help branch prediction are:
Keep code and data on separate pages (a very important item, see more details in the Memory Accesses topic).
Eliminate branches.
Arrange code to be consistent with the static branch prediction algorithm.
If it is not possible to arrange code, use branch direction hints where appropriate.
Use the pause instruction in spin-wait loops.
Inline functions and pair up calls and returns.
Unroll as necessary so that repeatedly-executed loops have sixteen or fewer iterations, unless this causes an excessive code size increase.
Separate branches so that they occur no more frequently than every three µops where possible.
Improve branch predictability and optimize instruction prefetching by arranging code to be consistent with the static branch prediction assumptions: backward taken and forward not taken.
Avoid mixing near and far calls and returns.
Avoid implementing a call by pushing the return address and jumping to the target. The hardware can pair up call and return instructions to enhance predictability.
Use the "pause" instruction in spin-wait loops.
Inline functions according to coding recommendations.
Avoid indirect calls.
For more details, see the latest Optimization Manual.