Chapter 6. Performance Tuning

After analyzing your code to determine where performance bottlenecks are occurring, you can turn your attention to making your programs run their fastest. One way to do this is to use multiple CPUs in parallel processing mode. However, this should be the last step. The first step is to make your program run as efficiently as possible on a single processor system and then consider ways to use parallel processing.

This chapter describes the process of tuning your application for a single processor system, and then tuning it for parallel processing in the following sections:

It also describes how to improve the performance of floating-point programs and MPI applications on SGI Altix UV series systems.

Intel provides detailed application tuning information including the Intel Xeon processor 5500 at this location http://developer.intel.com/Assets/PDF/manual/248966.pdf and specific tuning information tutorial for Nehalem (Intel Xeon 5500) at http://software.intel.com/sites/webinar/tuning-your-application-for-nehalem/ .

Single Processor Code Tuning

Several basic steps are used to tune performance of single-processor code:

Getting the Correct Results

One of the first steps in performance tuning is to verify that the correct answers are being obtained. Once the correct answers are obtained, tuning can be done. You can verify answers by initially disabling specific optimizations and limiting default optimizations. This can be accomplished by using specific compiler options and by using debugging tools.

The following compiler options emphasize tracing and porting over performance:

  • -O: the -O0 option disables all optimization. The default is -O2.

  • -g: the -g option preserves symbols for debugging. In the past, using -g automatically put down the optimization level. In Intel compiler today, you can use -O3 with -g.

  • -fp-model: the -fp-model option lets you specify the compiler rules for:

    • Value safety

    • Floating-point (FP) expression evaluation

    • FPU environment access

    • Precise FP exceptions

    • FP contractions

    Default is -fp-model fast=1. Note that -mp option is an old flag replaced by -fp-model.

  • -r:, -i: the -r8 and -i8 options set default real, integer, and logical sizes to 8 bytes, which are useful for porting codes from Cray, Inc. systems. This explicitly declares intrinsic and external library functions.

Some debugging tools can also be used to verify that correct answers are being obtained. See “Debugging Tools” in Chapter 3 for more details.

Managing Heap Corruption Problems

You can use environment variables to check for heap corruption problems in programs that use glibc malloc/ free dynamic memory management routines.

Set the MALLOC_CHECK_ environment variable to 1 to print diagnostic messages or to 2 to abort immediately when heap corruption is detected.

Overruns and underruns are circumstances where an access to an array is outside the declared boundary of the array. Underruns and overruns cannot be simultaneously detected. The default behavior is to place inaccessible pages immediately after allocated memory.

Using Tuned Code

Where possible, use code that has already been tuned for optimum hardware performance.

The following mathematical functions should be used where possible to help obtain best results:

  • MKL: Intel's Math Kernel Library. This library includes BLAS, LAPACK, and FFT routines.

  • VML: the Vector Math Library, available as part of the MKL package (libmkl_vml_itp.so).

  • Standard Math library

    Standard math library functions are provided with the Intel compiler's libimf.a file. If the -lm option is specified, glibc libm routines are linked in first.

Documentation is available for MKL and VML, as follows: http://intel.com/software/products/perflib/index.htm?iid=ipp_home+software_libraries& .

Determining Tuning Needs

Use the following tools to determine what points in your code might benefit from tuning:

  • time: Use this command to obtain an overview of user, system, and elapsed time.

  • gprof: Use this tool to obtain an execution profile of your program (a pcsamp profile). Use the -p compiler option to enable gprof use.

  • VTune: This is an Intel performance monitoring tool. You can run it directly on your SGI Altix UV system. The Linux server/Windows client is useful when you are working on a remote system.

  • psrun is a PerfSuite (see http://perfsuite.ncsa.uiuc.edu/ ) command-line utility that allows you to take performance measurements of unmodified executables. psrun takes as input a configuration XML document that describes the desired measurement.

For information about other performance analysis tools, see Chapter 3, “Performance Analysis and Debugging”.

Using Compiler Options Where Possible

Several compiler options can be used to optimize performance. For a short summary of ifort or icc options, use the -help option on the compiler command line. Use the -dryrun option to show the driver tool commands that ifort or icc generate. This option does not actually compile.

Use the following options to help tune performance:

  • -ftz: Flushes underflow to zero to avoid kernel traps. Enabled by default
    at -O3 optimization.

  • -fno-alias: Assumes no pointer aliasing. Pointer aliasing can create uncertainty about the possibility that two unrelated names might refer to the identical memory; because of this uncertainty, the compiler will assume that any two pointers can point to the same location in memory. This can remove optimization opportunities, particularly for loops.

    Other aliasing options include -ansi_alias and -fno_fnalias. Note that incorrect alias assertions may generate incorrect code.

  • -ip: Generates single file, interprocedural optimization; -ipo generates multifile, interprocedural optimization.

    Most compiler optimizations work within a single procedure (like a function or a subroutine) at a time. This intra-procedural focus restricts optimization possibilities because a compiler is forced to make worst-case assumptions about the possible effects of a procedure. By using inter-procedural analysis, more than a single procedure is analyzed at once and code is optimized. It performs two passes through the code and requires more compile time.

  • -O3: Enables -O2 optimizations plus more aggressive optimizations, including loop transformation and prefetching. Loop transformation are found in a transformation file created by the compiler; you can examine this file to see what suggested changes have been made to loops. Prefetch instructions allow data to be moved into the cache before their use. A prefetch instruction is similar to a load instruction.

    Note that Level 3 optimization may not improve performance for all programs.

  • -opt_report: Generates an optimization report and places it in the file specified in -opt_report_file .

  • -override_limits: This is an undocumented option that sometimes allows the compiler to continue optimizing when it has hit an internal limit.

  • -prof_gen and -prof_use: Generates and uses profiling information. These options require a three-step compilation process:

    1. Compile with proper instrumentation using -prof_gen.

    2. Run the program on one or more training datasets.

    3. Compile with -prof_use, which uses the profile information from the training run.

  • -S: Compiles and generates an assembly listing in the .s files and does not link. The assembly listing can be used in conjunction with the output generated by the -opt_report option to try to determine how well the compiler is optimizing loops.

  • -vec-report: For information specific to the vectorizer. Intel Xeon 7500 series processors can perform short vector operations which provides a powerful performance boost.

  • -fast: equivalent to writing: -ipo -O3 -no-prec-div -static -xHos

  • -xHost: Can generate instructions for the highest instruction set and processor available on the compilation host.

  • Specific processor architecture to compile for: -xSSE4.2 for Nehalem EP/EX Useful if compiling in a different system than an Altix UV.

  • -xSSE4.2: Can generate Intel® SSE4 Efficient Accelerated String and Text Processing instructions supported by Intel® Core i7 processors. Can generate Intel® SSE4 Vectorizing Compiler and Media Accelerator, Intel® SSSE3, SSE3, SSE2, and SSE instructions and it can optimize for the Intel® Core™ processor family.

    Another important feature of new Intel compilers is the Source Checker, which is enabled using the flag -diag -enable + options. The source checker is a compiler feature that provides advanced diagnostics based on detailed analysis of source code. It performs static global analysis to find errors in software that go undetected by the compiler itself. It is a general source code analysis tool that provides an additional diagnostic capability to help you debug your programs. You can use source code analysis options to detect potential errors in your compiled code.

Specific processor architecture to compile for: -xSSE4.2 for Nehalem EP/EX Useful if compiling in a different system than an Altix UV.

-xSSE4.2: Can generate Intel® SSE4 Efficient Accelerated String and Text Processing instructions supported by Intel® Core i7 processors. Can generate Intel® SSE4 Vectorizing Compiler and Media Accelerator, Intel® SSSE3, SSE3, SSE2, and SSE instructions and it can optimize for the Intel® Core™ processor family. Another important feature of new Intel compilers is the Source Checker, which is enabled using the flag -diag-enable + options The source checker is a compiler feature that provides advanced diagnostics based on detailed analysis of source code. It performs static global analysis to find errors in software that go undetected by the compiler itself. general source code analysis tool that provides an additional diagnostic capability to help you debug your programs. You can use source code analysis options to detect potential errors in your compiled code including the following:

  • Incorrect usage of OpenMP directives

  • Inconsistent object declarations in different program units

  • Boundary violations

  • Uninitialized memory

  • Memory corruptions

  • Memory Leaks

  • Incorrect usage of pointers and allocatable arrays

  • Dead code and redundant executions

  • Typographical errors or uninitialized variables

  • Dangerous usage of unchecked input

Source checker analysis performs a general overview check of a program for all possible values simultaneously. This is in contrast to run-time checking tools that execute a program with a fixed set of values for input variables; such checking tools cannot easily check all edge effects. By not using a fixed set of input values, the source checker analysis can check for all possible corner cases. In fact, you do not need to run the program for Source Checker, the analysis is performed at compilation time. Only requirement is a successful compilation. Important caveat: Limitations of Source Checker Analysis: Since the source checker does not perform full interpretation of analyzed programs, it can generate so called false-positive messages. This is a fundamental difference between the compiler and source checker generated errors; in the case of the source checker, you decide whether the generated error is legitimate and needs to be fixed.

Tuning the Cache Performance

The processor cache stores recently-used information in a place where it can be accessed extremely fast. For more information, see “Cache Terminology”.

There are several actions you can take to help tune cache performance:

  • Avoid large power-of-2 (and multiples thereof) strides and dimensions that cause cache thrashing. Cache thrashing occurs when multiple memory accesses require use of the same cache line. This can lead to an unnecessary number of cache misses.

    To prevent cache thrashing, redimension your vectors so that the size is not a power of two. Space the vectors out in memory so that concurrently accessed elements map to different locations in the cache. When working with two-dimensional arrays, make the leading dimension an odd number; for multidimensional arrays, change two or more dimensions to an odd number.

    Consider the following example: a cache in the hierarchy has a size of 256 KB (or 65536 4--byte words). A Fortran program contains the following loop:

    real data(655360,24)
    ...
    do i=1,23
       do j=1,655360
          diff=diff+data(j,i)-data(j,i+1)
       enddo
    enddo

    The two accesses to data are separated in memory by 655360*4 bytes, which is a simple multiple of the cache size; they consequently load to the same location in the cache. Because both data items cannot simultaneously coexist in that cache location, a pattern of replace on reload occurs that considerably reduces performance.

  • Use a memory stride of 1 wherever possible. A loop over an array should access array elements from adjacent memory addresses. When the loop iterates through memory by consecutive word addresses, it uses every word of every cache line in sequence and does not return to a cache line after finishing it.

    If memory strides other than 1 are used, cache lines could be loaded multiple times if an array is too large to be held in memory at one time.

  • Cache bank conflicts can occur if there are two accesses to the same 16-byte-wide bank at the same time.

    A maximum of four performance monitoring events can be counted simultaneously.

  • Group together data that is used at the same time and do not use vectors in your code, if possible. If elements that are used in one loop iteration are contiguous in memory, it can reduce traffic to the cache and fewer cache lines will be fetched for each iteration of the loop.

  • Try to avoid the use of temporary arrays and minimize data copies.

Managing Memory

Nonuniform memory access (NUMA) uses hardware with memory and peripherals distributed among many CPUs. This allows scalability for a shared memory system but a side effect is the time it takes for a CPU to access a memory location. Because memory access times are nonuniform, program optimization is not always straightforward.

Codes that frequently allocate and deallocate memory through glibc malloc/free calls may accrue significant system time due to memory management overhead. By default, glibc strives for system-wide memory efficiency at the expense of performance.

In compilers up to and including version 7.1.x, to enable the higher-performance memory management mode, set the following environment variables:

% setenv MALLOC_TRIM_THRESHOLD_ -1
% setenv MALLOC_MMAP_MAX_ 0

Because allocations in ifort using the malloc intrinsic use the glibc  malloc internally, these environment variables are also applicable in Fortran codes using, for example, Cray pointers with malloc /free. But they do not work for Fortran 90 allocatable arrays, which are managed directly through Fortran library calls and placed in the stack instead of the heap. The example, above, applies only to the csh shell and the tcsh shell.

Memory Use Strategies

This section describes some general memory use strategies, as follows:

  • Register reuse: do a lot of work on the same data before working on new data

  • Cache reuse: the program is much more efficient if all of the data and instructions fit in cache; if not, try to use what is in cache a lot before using anything that is not in cache.

  • Data locality: try to access data that is near each other in memory before data that is far.

  • I/O efficiency: do a bunch of I/O all at once rather than a little bit at a time; do not mix calculations and I/O.

Cache Terminology

Cache line is the minimum unit of transfer from next-higher cache into this one. Cache hit is reference to a cache line which is present in the cache. Cache miss is reference to a cache line which is not present in this cache level and must be retrieved from a higher cache (or memory or swap space). Hit time is the time to access the upper level of the memory hierarchy, which includes the time needed to determine whether the access is a hit or a miss. Miss penalty is the time to replace a block in the upper level with the corresponding block from the lower level, plus the time to deliver this block to the processor. The time to access the next level in the hierarchy is the major component of the miss penalty.

Memory Hierarchy Latencies

Programmers tend to think of memory as a flat, random access storage device. It is critical to understand that memory is a hierarchy to get good performance. Memory latency differs within the hierarchy. Performance is affected by where the data resides. Registers: 0 cycles latency (cycle = 1/freq) L1 cache: 1 cycle L2 cache: 5-6 cycles L3 cache: 12-17 cycles Main memory: 130-1000+ cycles. CPUs which are waiting for memory are not doing useful work. Software should be "hierarchy-aware" to achieve best performance:

  • Perform as many operations as possible on data in registers

  • Perform as many operations as possible on data in the cache(s)

  • Keep data uses spatially and temporally local

  • Consider temporal locality and spatial locality

    Memory hierarchies take advantage of temporal locality by keeping more recently accessed data items closer to the processor. Memory hierarchies take advantage of spatial locality by moving contiguous words in memory to upper levels of the hierarchy.

Multiprocessor Code Tuning

Before beginning any multiprocessor tuning, first perform single processor tuning. This can often obtain good results in multiprocessor codes also. For details, see “Single Processor Code Tuning”.

Multiprocessor tuning consists of the following major steps:

Data Decomposition

In order to efficiently use multiple processors on a system, tasks have to be found that can be performed at the same time. There are two basic methods of defining these tasks:

  • Functional parallelism

    Functional parallelism is achieved when different processors perform different functions. This is a known approach for programmers trained in modular programming. Disadvantages to this approach include the difficulties of defining functions as the number of processors grow and finding functions that use an equivalent amount of CPU power. This approach may also require large amounts of synchronization and data movement.

  • Data parallelism

    Data parallelism is achieved when different processors perform the same function on different parts of the data. This approach takes advantage of the large cumulative memory. One requirement of this approach, though, is that the problem domain be decomposed . There are two steps in data parallelism:

    1. Data decomposition

      Data decomposition is breaking up the data and mapping data to processors. Data can be broken up explicitly by the programmer by using message passing (with MPI) and data passing (using the SHMEM library routines) or can be done implicitly using compiler-based MP directives to find parallelism in implicitly decomposed data.

      There are advantages and disadvantages to implicit and explicit data decomposition:

      • Implicit decomposition advantages: No data resizing is needed; all synchronization is handled by the compiler; the source code is easier to develop and is portable to other systems with OpenMP or High Performance Fortran (HPF) support.

      • Implicit decomposition disadvantages : The data communication is hidden by the user

      • Explicit decomposition advantages : The programmer has full control over insertion of communication and synchronization calls; the source code is portable to other systems; code performance can be better than implicitly parallelized codes.

      • Explicit decomposition disadvantages : Harder to program; the source code is harder to read and the code is longer (typically 40% more).

    2. The final step is to divide the work among processors.

Parallelizing Your Code

The first step in multiprocessor performance tuning is to choose the parallelization methodology that you want to use for tuning. This section discusses those options in more detail.

You should first determine the amount of code that is parallelized. Use the following formula to calculate the amount of code that is parallelized:

p=N(T(1)-T(N)) / T(1)(N-1)

In this equation, T(1) is the time the code runs on a single CPU and T(N) is the time it runs on N CPUs. Speedup is defined as T(1)/T(N).

If speedup/N is less than 50% (that is, N>(2-p)/(1- p)), stop using more CPUs and tune for better scalability.

CPU activity can be displayed with the top or vmstat commands or accessed by using the Performance Co-Pilot tools (for example, pmval kernel.percpu.cpu.user) or by using the Performance Co-Pilot visualization tools pmchart .

Next you should focus on a parallelization methodology, as discussed in the following subsections.

Use MPT

You can use the Message Passing Interface (MPI) from the SGI Message Passing Toolkit (MPT). MPI is optimized and more scalable for SGI Altix series systems than generic MPI libraries. It takes advantage of the SGI Altix architecture and SGI Linux NUMA features. MPT is included with the SGI MPI, part of the SGI Performance Suite software.

Use the -lmpi compiler option to use MPI. For a list of environment variables that are supported, see the mpi man page.

MPIO_DIRECT_READ and MPIO_DIRECT_WRITE are supported under Linux for local XFS filesystems in SGI MPT version 1.6.1 and beyond.

MPI provides the MPI-2 standard MPI I/O functions that provide file read and write capabilities. A number of environment variables are available to tune MPI I/O performance. See the mpi_io(3) man page for a description of these environment variables.

Performance tuning for MPI applications is described in more detail in Chapter 6 of the Message Passing Toolkit (MPT) User's Guide .

Use OpenMP

OpenMP is a shared memory multiprocessing API, which standardizes existing practice. It is scalable for fine or coarse grain parallelism with an emphasis on performance. It exploits the strengths of shared memory and is directive-based. The OpenMP implementation also contains library calls and environment variables.

To use OpenMP directives with C, C++, or Fortran codes, you can use the following compiler options:

  • ifort -openmp or icc -openmp : These options use the OpenMP front-end that is built into the Intel compilers. The latest Intel compiler OpenMP runtime name is libiomp5.so. The latest Intel compiler also supports the GNU OpenMP OpenMP library as an either/or option (not to be mixed-and-matched with the Intel version).

For details about OpenMP usage see the OpenMP standard, available at http://www.openmp.org/specs .

OpenMP Nested Parallelism

This section describes OpenMP nested parallelism. For additional information, see the dplace(1) man page.

Here is a simple example for OpenMP nested parallelism with 2 "top" threads and 4 "bottom" threads that are called master/nested below:

% cat place_nested
firsttask cpu=0
thread name=a.out oncpu=0 cpu=4 noplace=1 exact onetime thread name=a.out oncpu=0
cpu=1-3 exact thread name=a.out oncpu=4 cpu=5-7 exact

% dplace -p place_nested a.out
Master thread 0 running on cpu 0
Master thread 1 running on cpu 4
Nested thread 0 of master 0 gets task 0 on cpu 0 Nested thread 1 of master 0 gets task 1 on cpu 1
Nested thread 2 of master 0 gets task 2 on cpu 2 Nested thread 3 of master 0 gets task 3 on cpu 3
Nested thread 0 of master 1 gets task 0 on cpu 4 Nested thread 1 of master 1 gets task 1 on cpu 5
Nested thread 2 of master 1 gets task 2 on cpu 6 Nested thread 3 of master 1 gets task 3 on cpu 7

Use Compiler Options

Use the compiler to invoke automatic parallelization. Use the -parallel and -par_report option to the ifort or icc compiler. These options show which loops were parallelized and the reasons why some loops were not parallelized. If a source file contains many loops, it might be necessary to add the -override_limits flag to enable automatic parallelization. The code generated by -parallel is based on the OpenMP API; the standard OpenMP environment variables and Intel extensions apply.

There are some limitations to automatic parallelization:

  • For Fortran codes, only DO loops are analyzed

  • For C/C++ codes, only for loops using explicit array notation or those using pointer increment notation are analyzed. In addition, for loops using pointer arithmetic notation are not analyzed nor are while or do/while loops. The compiler also does not check for blocks of code that can be run in parallel.

Identifying Parallel Opportunities in Existing Code

Another parallelization optimization technique is to identify loops that have a potential for parallelism, such as the following:

  • Loops without data dependencies; a data dependency conflict occurs when a loop has results from one loop pass that are needed in future passes of the same loop.

  • Loops with data dependencies because of temporary variables, reductions, nested loops, or function calls or subroutines.

Loops that do not have a potential for parallelism are those with premature exits, too few iterations, or those where the programming effort to avoid data dependencies is too great.

Fixing False Sharing

If the parallel version of your program is slower than the serial version, false sharing might be occurring. False sharing occurs when two or more data items that appear not to be accessed by different threads in a shared memory application correspond to the same cache line in the processor data caches. If two threads executing on different CPUs modify the same cache line, the cache line cannot remain resident and correct in both CPUs, and the hardware must move the cache line through the memory subsystem to retain coherency. This causes performance degradation and reduction in the scalability of the application. If the data items are only read, not written, the cache line remains in a shared state on all of the CPUs concerned. False sharing can occur when different threads modify adjacent elements in a shared array. When two CPUs share the same cache line of an array and the cache is decomposed, the boundaries of the chunks split at the cache line.

You can use the following methods to verify that false sharing is happening:

  • Use the performance monitor to look at output from pfmon and the BUS_MEM_READ_BRIL_SELF and BUS_RD_INVAL_ALL_HITM events.

  • Use pfmon to check DEAR events to track common cache lines.

  • Use the Performance Co-Pilot pmshub utility to monitor cache traffic and CPU utilization. You can also use the shubstats(1) tool to monitor Altix cache and directory traffic.

If false sharing is a problem, try the following solutions:

  • Use the hardware counter to run a profile that monitors storage to shared cache lines. This will show the location of the problem.

  • Revise data structures or algorithms.

  • Check shared data, static variables, common blocks, and private and public variables in shared objects.

  • Use critical regions to identify the part of the code that has the problem.

Using dplace and taskset

The dplace command binds processes to specified CPUs in a round-robin fashion. Once bound to a process, they do not migrate. dplace numbering is done in the context of the current CPU memory set. See Chapter 4, “Monitoring Tools” for details about dplace .

The taskset command restricts execution to the listed set of CPUs; however, processes are still free to move among listed CPUs.

Environment Variables for Performance Tuning

You can use several different environment variables to assist in performance tuning. For details about environment variables used to control the behavior of MPI, see the mpi(1) man page.

Several OpenMP environment variables can affect the actions of the OpenMP library. For example, some environment variables control the behavior of threads in the application when they have no work to perform or are waiting for other threads to arrive at a synchronization semantic; other variables can specify how the OpenMP library schedules iterations of a loop across threads. The following environment variables are part of the OpenMP standard:

  • OMP_NUM_THREADS (The default is the number of CPUs in the system.)

  • OMP_SCHEDULE (The default is static.)

  • OMP_DYNAMIC (The default is false.)

  • OMP_NESTED (The default is false.)

In addition to the preceding environment variables, Intel provides several OpenMP extensions, two of which are provided through the use of the KMP_LIBRARY variable.

The KMP_LIBRARY variable sets the run-time execution mode, as follows:

  • If set to serial, single-processor execution is used.

  • If set to throughput, CPUs yield to other processes when waiting for work. This is the default and is intended to provide good overall system performance in a multiuser environment.

  • If set to turnaround, worker threads do not yield while waiting for work. Setting KMP_LIBRARY to turnaround may improve the performance of benchmarks run on dedicated systems, where multiple users are not contending for CPU resources.

If your program gets a segmentation fault immediately upon execution, you may need to increase KMP_STACKSIZE. This is the private stack size for threads. The default is 4 MB. You may also need to increase your shell stacksize limit.

Understanding Parallel Speedup and Amdahl's Law

There are two ways to obtain the use of multiple CPUs. You can take a conventional program in C, C++, or Fortran, and have the compiler find the parallelism that is implicit in the code.

You can write your source code to use explicit parallelism, stating in the source code which parts of the program are to execute asynchronously, and how the parts are to coordinate with each other.

When your program runs on more than one CPU, its total run time should be less. But how much less? What are the limits on the speedup? That is, if you apply 16 CPUs to the program, should it finish in 1/16th the elapsed time?

This section covers the following topics:

Adding CPUs to Shorten Execution Time

You can distribute the work your program does over multiple CPUs. However, there is always some part of the program's logic that has to be executed serially, by a single CPU. This sets the lower limit on program run time.

Suppose there is one loop in which the program spends 50% of the execution time. If you can divide the iterations of this loop so that half of them are done in one CPU while the other half are done at the same time in a different CPU, the whole loop can be finished in half the time. The result: a 25% reduction in program execution time.

The mathematical treatment of these ideas is called Amdahl's law, for computer pioneer Gene Amdahl, who formalized it. There are two basic limits to the speedup you can achieve by parallel execution:

  • The fraction of the program that can be run in parallel, p, is never 100%.

  • Because of hardware constraints, after a certain point, there is less and less benefit from each added CPU.

Tuning for parallel execution comes down to doing the best that you are able to do within these two limits. You strive to increase the parallel fraction, p, because in some cases even a small change in p (from 0.8 to 0.85, for example) makes a dramatic change in the effectiveness of added CPUs.

Then you work to ensure that each added CPU does a full CPU's work, and does not interfere with the work of other CPUs. In the SGI Altix architectures this means:

  • Spreading the workload equally among the CPUs

  • Eliminating false sharing and other types of memory contention between CPUs

  • Making sure that the data used by each CPU are located in a memory near that CPU's node

Understanding Parallel Speedup

If half the iterations of a DO-loop are performed on one CPU, and the other half run at the same time on a second CPU, the whole DO-loop should complete in half the time. For example, consider the typical C loop in Example 6-1.

Example 6-1. Typical C Loop

for (j=0; j<MAX; ++j) {
   z[j] = a[j]*b[j];
} 

The compiler can automatically distribute such a loop over n CPUs (with n decided at run time based on the available hardware), so that each CPU performs MAX/n iterations.

The speedup gained from applying n CPUs, Speedup(n), is the ratio of the one-CPU execution time to the n-CPU execution time: Speedup(n ) = T(1) ÷ T(n). If you measure the one-CPU execution time of a program at 100 seconds, and the program runs in 60 seconds with two CPUs, Speedup(2) = 100 ÷ 60 = 1.67.

This number captures the improvement from adding hardware. T(n) ought to be less than T(1); if it is not, adding CPUs has made the program slower, and something is wrong! So Speedup(n) should be a number greater than 1.0, and the greater it is, the better. Intuitively you might hope that the speedup would be equal to the number of CPUs (twice as many CPUs, half the time) but this ideal can seldom be achieved.

Understanding Superlinear Speedup

You expect Speedup(n) to be less than n, reflecting the fact that not all parts of a program benefit from parallel execution. However, it is possible, in rare situations, for Speedup(n) to be larger than n. When the program has been sped up by more than the increase of CPUs it is known as superlinear speedup.

A superlinear speedup does not really result from parallel execution. It comes about because each CPU is now working on a smaller set of memory. The problem data handled by any one CPU fits better in cache, so each CPU executes faster than the single CPU could do. A superlinear speedup is welcome, but it indicates that the sequential program was being held back by cache effects.

Understanding Amdahl's Law

There are always parts of a program that you cannot make parallel, where code must run serially. For example, consider the DO-loop. Some amount of code is devoted to setting up the loop, allocating the work between CPUs. This housekeeping must be done serially. Then comes parallel execution of the loop body, with all CPUs running concurrently. At the end of the loop comes more housekeeping that must be done serially; for example, if n does not divide MAX evenly, one CPU must execute the few iterations that are left over.

The serial parts of the program cannot be speeded up by concurrency. Let p be the fraction of the program's code that can be made parallel (p is always a fraction less than 1.0.) The remaining fraction (1-p) of the code must run serially. In practical cases, p ranges from 0.2 to 0.99.

The potential speedup for a program is proportional to p divided by the CPUs you can apply, plus the remaining serial part, 1-p. As an equation, this appears as Example 6-2.

Example 6-2. Amdahl's law: Speedup(n) Given p

                  1 
Speedup(n) = ----------- 
             (p/n)+(1-p) 

Suppose p = 0.8; then Speedup (2) = 1 / (0.4 + 0.2) = 1.67, and Speedup(4)= 1 / (0.2 + 0.2) = 2.5. The maximum possible speedup (if you could apply an infinite number of CPUs) would be 1 / (1-p). The fraction p has a strong effect on the possible speedup.

The reward for parallelization is small unless p is substantial (at least 0.8); or to put the point another way, the reward for increasing p is great no matter how many CPUs you have. The more CPUs you have, the more benefit you get from increasing p. Using only four CPUs, you need only p= 0.75 to get half the ideal speedup. With eight CPUs, you need p= 0.85 to get half the ideal speedup.

There is a slightly more sophisticated version of Amdahl's law which includes communication overhead, showing also that if the program has no serial part that as we increase the number of cores the amount of computation per core diminishes and the communication overhead (unless there is not communication and we have trivial parallelization) increases, also diminishing the efficiency of the code and the speedup. The equation is: Speedup(n) = n/(1+ a*(n-1) + n*( tc/ts)) Where: n: number of processes a: the fraction of the given task not dividable into concurrent subtasks ts: time to execute the task in a single processor tc: communication overhead If a=0 and tc=0 (no serial part and no communications) like in a trivial parallelization program, you will get linear speedup.

Calculating the Parallel Fraction of a Program

You do not have to guess at the value of p for a given program. Measure the execution times T(1) and T(2) to calculate a measured Speedup (2) = T(1) / T(2). The Amdahl's law equation can be rearranged to yield p when Speedup (2) is known, as in Example 6-3.

Example 6-3. Amdahl's law: p Given Speedup(2)

     2    SpeedUp(2) - 1 
p = --- * -------------- 
     1      SpeedUp(2)  

Suppose you measure T(1) = 188 seconds and T(2) = 104 seconds.

SpeedUp(2) = 188/104 = 1.81 
p = 2 * ((1.81-1)/1.81) = 2*(0.81/1.81) = 0.895 

In some cases, the Speedup(2) = T(1)/T(2) is a value greater than 2; in other words, a superlinear speedup (“Understanding Superlinear Speedup”). When this occurs, the formula in Example 6-3 returns a value of p greater than 1.0, which is clearly not useful. In this case you need to calculate p from two other more realistic timings, for example T(2) and T(3). The general formula for p is shown in Example 6-4, where n and m are the two CPU counts whose speedups are known, n>m.

Example 6-4. Amdahl's Law: p Given Speedup( n) and Speedup( m)

                Speedup(n) - Speedup(m)
p  =  ------------------------------------------- 
   (1 - 1/n)*Speedup(n) - (1 - 1/m)*Speedup(m) 


Predicting Execution Time with n CPUs

You can use the calculated value of p to extrapolate the potential speedup with higher numbers of CPUs. The following example shows the expected time with four CPUs, if p=0.895 and T(1)=188 seconds:

Speedup(4)= 1/((0.895/4)+(1-0.895)) = 3.04 
T(4)= T(1)/Speedup(4) = 188/3.04 = 61.8 

The calculation can be made routine using the computer by creating a script that automates the calculations and extrapolates run times.

These calculations are independent of most programming issues such as language, library, or programming model. They are not independent of hardware issues, because Amdahl's law assumes that all CPUs are equal. At some level of parallelism, adding a CPU no longer affects run time in a linear way. For example, on some architectures, cache-friendly codes scale closely with Amdahl's law up to the maximum number of CPUs, but scaling of memory intensive applications slows as the system bus approaches saturation. When the bus bandwidth limit is reached, the actual speedup is less than predicted.

Gustafson's Law

Gustafson's law proposes that programmers set the size of problems to use the available equipment to solve problems within a practical fixed time. Therefore, if faster (more parallel) equipment is available, larger problems can be solved in the same time. Amdahl's law is based on fixed workload or fixed problem size. It implies that the sequential part of a program does not change with respect to machine size (for example, the number of processors). However, the parallel part is evenly distributed by n processors. The impact of Gustafson's law was to shift research goals to select or reformulate problems so that solving a larger problem in the same amount of time would be possible. In particular, the law redefines efficiency as a need to minimize the sequential part of a program, even if it increases the total amount of computation. The bottom line is that by running larger problems, it is hoped that the bulk of the calculation will increase faster than the serial part of the program, allowing for better scaling. There is a slightly more sophisticated version of Amdahl's law which includes communication overhead, showing also that if the program has no serial part that as we increase the number of cores the amount of computation per core diminishes and the communication overhead (unless there is not communication and you have trivial parallelization) increases, also diminishing the efficiency of the code and the speedup. The equation is:

Speedup(n) = n/(1+ a*(n-1) + n*(tc/ts))

Where: n: number of processes a: the fraction of the given task not dividable into concurrent subtasks ts: time to execute the task in a single processor tc: communication overhead If a=0 and tc=0 (no serial part and no communications) like in a trivial parallelization program, you will get linear speedup.

MPInside Profiling Tool

MPInside(3) is an MPI profiling tool which provides valuable information on optimizing your MPI application. It helps you determine where the MPI Send/Receive pairs are not executed synchronously. With non-synchronized Send/Receive, the MPI communications can be very slow, independent of the power of the underlying MPI library/hardware engine. For most MPI applications, the MPI communication times are more accountable to the lack of synchronizations of these Send/Receive pair than to the MPI/hardware engine. MPInside, among other valuable functions, measures this non-synchronized time for all the MPI ranks involved in the application, for all the MPI function activated. It allows you to tell at what actual speed the MPI engine did such communications, for example, the ratio Bytes received / (time of the MPI function minus the synchronization time). It provides this latter information, accumulated per CPU a well as in a CPUxCPU matrix. In addition, MPInside automatically and precisely reports the timing described above on a branch basis. A branch is an MPI function with all its ancestors in the calling sequence. MPInside provides the routine name and the source file line number for all the routines defining a branch. All branches are put in relation with the other CPU branches that had a Send/receive partnership with them. For any CPU, any Received branch performed by that CPU has partners. A partner set is described by four numbers: the Sending rank number, the Sending CPU branch identification, the percentage of time accounted to this partnership in regard to the total execution wait time of this Received branch and the percentage of time this last was to account to lack of synchronization. Even if the MPI/hardware engine performance may also be accountable to the non synchronized communications, most of it is accountable to the application itself and therefore is the developer's responsibility. MPInside tells you where and how much such non synchronized communication occurred in the application.

To load the MPInside module into your environment (for any shell), use the following command:

% module load MPInside/3.1

For details about using software modules, see the module(1) man page.

For more information on using MPInside, see the following documents located in the MPInside software module:

  • mpinside_3.1_ref_manual.pdf

    MPInside 3.1 Reference Manual describes how to use the MPI profiling tool.

  • MPInside_window_how_to.pdf

    HOW TO select a window of observation with MPInside slide set

  • MPInside.3

    The MPInside(3) man page describes how to use MPInside

SGI PerfBoost

SGI PerfBoost uses a wrapper library to run applications compiled against other MPI implementations under the SGI Message Passing Toolkit (MPT) product on SGI platforms. The PerfBoost software allows you to run SGI MPT which is a version of MPI optimized for SGI large, shared-memory systems and can take advantage of the Altix UV Hub. For more information, see “Performance Tuning Running MPI on Altix UV 100 and Altix UV 1000 Systems” and Chapter 6, “PerfBoost” in the Message Passing Toolkit (MPT) User's Guide available on the Tech Pubs Library at http://docs.sgi.com .

Perfcatcher

The simple-to-use Perfcatcher tool uses a wrapper library to return MPI and SHMEM function profiling information. Some analysis is done, and information like percent CPU time, total time spent per function, message size, and load imbalance are reported. See the perfcatch(1) man page and Chapter 8, “MPI Performance Profiling” in the Message Passing Toolkit (MPT) User's Guide available on the Tech Pubs Library at http://docs.sgi.com.

Performance Tuning Running MPI on Altix UV 100 and Altix UV 1000 Systems

The SGI Altix UV 100 and Altix UV 1000 series systems are scalable nonuniform memory access (NUMA) systems that support a single Linux image of thousands of processors distributed over many sockets and SGI Altix UV Hub application-specific integrated circuits (ASICs). The UV Hub is the heart of the SGI Altix UV 1000 or Altix UV 100 system compute blade. Each "processor" is a hyperthread on a particular core within a particular socket. Each Altix UV Hub normally connects to two sockets. All communication between the sockets and the UV Hub uses Intel QuickPath Interconnect (QPI) channels. The Altix UV Hub has four NUMAlink 5 ports that connect with the NUMAlink 5 interconnect fabric. The UV Hub acts as a crossbar between the processors, local SDRAM memory, and the network interface. The Hub ASIC enables any processor in the single-system image (SSI) to access the memory of all processors in the SSI. For more information on the SGI Altix UV hub, Altix UV compute blades, QPI, and NUMAlink 5, see the SGI Altix UV 1000 System User's Guide or the SGI Altix UV 100 System User's Guide, respectively.

When MPI communicates between processes, two transfer methods are possible on an Altix UV system:

  • By use of shared memory

  • By use of the global reference unit (GRU), part of the Altix UV Hub ASIC

MPI chooses the method depending on internal heuristics, the type of MPI communication that is involved, and some user-tunable variables. When using the GRU to transfer data and messages, the MPI library uses the GRU resources it allocates via the GRU resource allocator, which divides up the available GRU resources. It fairly allocates buffer space and control blocks between the logical processors being used by the MPI job.

General Considerations

Running MPI jobs optimally on Altix UV systems is not very difficult. It is best to pin MPI processes to CPUs and isolate multiple MPI jobs onto different sets of sockets and Hubs, and this is usually achieved by configuring a batch scheduler to create a cpuset for every MPI job. MPI pins its processes to the sequential list of logical processors within the containing cpuset by default, but you can control and alter the pinning pattern using MPI_DSM_CPULIST . See the MPI_DSM_CPULIST discussion in the Message Passing Toolkit (MPT) User's Guide, and the omplace(1) and dplace(1) man pages.

Job Performance Types

The MPI library chooses buffer sizes and communication algorithms in an attempt to deliver the best performance automatically to a wide variety of MPI applications. However, applications have different performance profiles and bottlenecks, and so user tuning may be of help in improving performance. Here are some application performance types and ways that MPI performance may be improved for them:

  • Odd HyperThreads are idle.

    Most high performance computing MPI programs run best using only one HyperThread per core. When an Altix UV system has multiple HyperThreads per core, logical CPUs are numbered such that odd HyperThreads are the high half of the logical CPU numbers. Therefore, the task of scheduling only on the even HyperThreads may be accomplished by scheduling MPI jobs as if only half the full number exist, leaving the high logical CPUs idle. You can use the cpumap(1) command to determine if cores have multiple HyperThreads on your Altix UV system. The output tells the number of physical and logical processors and if Hyperthreading is ON or OFF and how shared processors are paired (towards the bottom of the command's output).

    If an MPI job uses only half of the available logical CPUs, set GRU_RESOURCE_FACTOR to 2 so that the MPI processes can utilize all the available GRU resources on a Hub rather than reserving some of them for the idle HyperThreads. For more information about GRU resource tuning, see the gru_resource(3) man page.

  • MPI large message bandwidth is important.

    MPI can improve the bandwidth of large messages if MPI_GRU_CBS is set to 0. This favors large message bandwidth at the cost of suppressing asynchronous MPI message delivery. In addition, some programs transfer large messages via the MPI_Send function. To switch on the use of unbuffered, single copy transport in these cases you can set MPI_BUFFER_MAX to 0. See the MPI(1) man page for more details.

  • MPI small or near messages are very frequent.

    For small fabric hop counts, shared memory message delivery is faster than GRU messages. To deliver all messages within an Altix UV host via shared memory, set MPI_SHARED_NEIGHBORHOOD to " host". See the MPI(1) man page for more details.

Other ccNUMA Performance Issues

MPI application processes normally perform best if their local memory is allocated on the socket assigned to execute it. This cannot happen if memory on that socket is exhausted by the application or by other system consumption, for example, file buffer cache. Use the nodeinfo(1) command to view memory consumption on the nodes assigned to your job and use bcfree(1) to clear out excessive file buffer cache. PBS Professional batch scheduler installations can be configured to issue bcfreecommands in the job prologue. For more information, see PBS Professional documentation and the bcfree(1) man page.

For detailed information on MPI application tuning, see the Message Passing Toolkit (MPT) User's Guide.