About the Overall Tuning Methodology

The general tuning methodology begins at the system level and goes down to the microarchitecture level. Regardless of your specific tuning goals, you should conduct the analysis level by level, in the following order, that is, from a high to a low level:

First ensure you don't have any system-level bottlenecks. Once you ensure that the processor is highly utilized, focus on application bottlenecks, followed by microarchitecture bottlenecks.

The general rule is that you will achieve greater speedups at a higher level compared to the same time-investment at a lower level.

Follow these steps in order to achieve the best speedup in the shortest amount of time.

For example, if you start with microarchitecture tuning, but the processor was only utilized 10% of the time due to system-level bottlenecks, a 50% speedup at the microarchitecture level would only achieve a 5% workload-level speedup, since the processor is only being used 10% of the time during the workload.

Before beginning tuning and during tuning after making major changes, check your application’s processor utilization to determine whether your application is currently processor-intensive, I/O-intensive, or somewhere in between.

If processor utilization is:

Low on a uniprocessor system (without Hyper-Threading Technology). Your application has many system bottlenecks (network, disk, memory usage, etc.)
Low on one or more processors on a multiprocessor system (or on a uniprocessor system with Hyper-Threading Technology). Either your application has many system bottlenecks (network, disk, memory usage, etc.), and/or

your application is single-threaded
your application is multi-threaded but the threading model is not making effective use of all available processors.

High on all processors. Your application is processor-intensive.

There are three main strategies for improving application performance. Each strategy has an effect on processor utilization.

Balancing I/O and computation. When processor utilization is low because processors are waiting for I/O to complete, balancing I/O and computation can speed up an application (since when balanced, I/O and computation can be performed simultaneously - the I/O time is masked by the computation time). Balancing I/O and computation is usually done during system-level and application-level tuning.
Improving the threading model: Adding multithreading to a single-threaded application, or improving the threading model of a multithreaded application, is an application-level tuning technique that can speed up your application by making more effective use of all available processor resources - this usually raises processor utilization.
Improving the efficiency of computation: Using less or more efficient computation can also speed your application. If the amount of I/O remains the same and the I/O time is not masked by computation time, then processor utilization will decrease (since a higher fraction of the total workload run time will be spent waiting for I/O). These types of changes are made during application-level and microarchitecture-level tuning.

After some amount of system-level or application-level tuning, processor utilization may increase and you may find your application is ready for microarchitecture-level tuning. Conversely, after some amount of application-level or microarchitecture-level tuning, processor utilization may decrease and you may find you are ready for more system-level tuning.

Experience acquired by performance analysts at Intel indicate that speedups in the 3x range are very common when performing system-level tuning, 2x when performing application-level tuning, and 1.1 to 1.5x speedups when performing microarchitecture level tuning.

You can benefit from several system-level optimizations that almost always positively impact the overall application performance.

For example:

Removing partial writes on the system bus that cause a system bus bottleneck, which in turn starves the processor.
Optimize the code of your application that it does not read memory in bytes, instead of double or quad words, etc.

Tuning Goals and Areas to Investigate

Order	Tuning Level	Goals	Key Areas to Investigate	Estimated Speedup
1	High: System-level	Speed up the application by improving how the application interacts with the system	Network Problems Disk Performance Memory Usage	3X
2	Medium: Application-level	Speed up the application by improving the application's algorithms	Locks Heap contention Threading Algorithm APIs Usage	2X
3	Low: Microarchitecture-level	Speed up the application by improving how the application runs on the specific processors	Architecture Coding Pitfalls Data/Code Locality (Cache) Data Alignment	1.1-1.5X

Order

Tuning Level

Goals

Key Areas to Investigate

Estimated Speedup

High: System-level

Speed up the application by improving how the application interacts with the system

Network Problems
Disk Performance
Memory Usage

Medium: Application-level

Speed up the application by improving the application's algorithms

Locks
Heap contention
Threading Algorithm
APIs Usage

Low:

Microarchitecture-level

Speed up the application by improving how the application runs on the specific processors

Architecture Coding Pitfalls
Data/Code Locality (Cache)
Data Alignment

1.1-1.5X

Use the VTune(TM) Performance Analyzer to implement these tuning methodologies to achieve the most performance gain with the least effort.