The Intrinsics API

The benefit of coding with MMX™ technology intrinsics and the Streaming SIMD Extensions and Streaming SIMD Extensions 2 intrinsics is that you can use the syntax of C function calls and C variables instead of hardware registers. This frees you from managing registers and programming assembly. Further, the compiler optimizes the instruction scheduling so that your executable runs faster. For each computational and data manipulation instruction in the new instruction set, there is a corresponding C intrinsic that implements it directly. The intrinsics allow you to specify the underlying implementation (instruction selection) of an algorithm yet leave instruction scheduling and register allocation to the compiler.

MMX™ Technology Intrinsics

The MMX technology intrinsics are based on a new __m64 data type to represent the specific contents of an MMX technology register. You can specify values in bytes, short integers, 32-bit values, or a 64-bit object. The __m64 data type, however, is not a basic ANSI C data type, and therefore you must observe the following usage restrictions:

Use __m64 data only on the left-hand side of an assignment, as a return value, or as a parameter. You cannot use it with other arithmetic expressions ("+", ">>", and so on).
Use __m64 objects in aggregates, such as unions to access the byte elements and structures; the address of an __m64 object may be taken.
Use __m64 data only with the MMX technology intrinsics described in this guide and the Intel(R) C++ Compiler User's Guide With Support for the Streaming SIMD Extensions 2 (Order Number 718195-2001).

Streaming SIMD Extensions and Streaming SIMD Extensions 2 Intrinsics

The Streaming SIMD Extensions and Streaming SIMD Extensions 2 intrinsics all make use of the xmm registers of the Pentium(R) III and Pentium 4 processors. There are three data types supported by these intrinsics: __m128, __m128d, and __m128i.

The __m128 data type is used to represent the contents of a Streaming SIMD Extensions registers used by the Streaming SIMD Extension intrinsics. This is either four packed single-precision floating-point values or one scalar single-precision number.
The __m128d data type holds two 64-bit floating point (double-precision) values.
The __m128i data type can hold sixteen 8-bit, eight 16-bit, or four 32-bit, or two 64-bit integer values.

The compiler aligns __m128, __m128d, and __m128 local and global data to 16-byte boundaries on the stack. To align integer, float, or double arrays, you can use the declspec statement as described in the Intel(R) C++ Compiler User's Guide With Support for the Streaming SIMD Extensions 2 (Order Number 718195-2001).

The __m128 data types are not basic ANSI C data types and therefore some restrictions are placed on its usage:

Use __m128, __m128d, and __m128i only on the left-hand side of an assignment, as a return value, or as a parameter. Do not use it in other arithmetic expressions such as "+" and ">>". Do not initialize __m128, __m128d, and __m128i with literals; there is no way to express 128-bit constants. Use __m128, __m128d, and __m128i objects in aggregates, such as unions (for example, to access the float elements) and structures. The address of these objects may be taken. Use __m128, __m128d, and __m128i data only with the intrinsics described in this user's guide.

The compiler aligns __m128, __m128d, and __m128i local data to 16-byte boundaries on the stack. Global __m128 data is also aligned on 16-byte boundaries. (To align float arrays, you can use the alignment declspec described in the following section.) Because the new instruction set treats the SIMD floating-point registers in the same way whether you are using packed or scalar data, there is no __m32 data type to represent scalar data as you might expect. For scalar operations, you should use the __m128 objects and the "scalar" forms of the intrinsics; the compiler and the processor implement these operations with 32-bit memory references.

The suffixes ps and ss are used to denote "packed single" and "scalar single" precision operations. The packed floats are represented in right-to-left order, with the lowest word (right-most) being used for scalar operations: [z, y, x, w]. To explain how memory storage reflects this, consider the following example.

The operation

float a[4] { 1.0, 2.0, 3.0, 4.0 };
__m128 t _mm_load_ps(a);

produces the same result as follows:

__m128 t _mm_set_ps(4.0, 3.0, 2.0, 1.0);

In other words,

t [ 4.0, 3.0, 2.0, 1.0 ]

where the "scalar" element is 1.0.

Some intrinsics are "composites" because they require more than one instruction to implement them. You should be familiar with the hardware features provided by the Streaming SIMD Extensions, Streaming SIMD Extensions 2, and MMX technology when writing programs with the intrinsics.

Keep the following three important issues in mind:

Certain intrinsics, such as _mm_loadr_ps and _mm_cmpgt_ss, are not directly supported by the instruction set. While these intrinsics are convenient programming aids, be mindful of their implementation cost. Data loaded or stored as __m128 objects must generally be 16-byte-aligned. Some intrinsics require that their argument be immediates, that is, constant integers (literals), due to the nature of the instruction. The result of arithmetic operations acting on two NaN (Not a Number) arguments is undefined. Therefore, floating-point operations using NaN arguments may not match the expected behavior of the corresponding assembly instructions.

For a more detailed description of each intrinsic and additional information related to its usage, refer to the Intel C++ Compiler User's Guide With Support for the Streaming SIMD Extensions 2 (Order Number 718195-2001).

For details, see Volume 2A and Volume 2B of the Intel(R) 64 and IA-32 Intel Architecture Software Developer's Manual. For the latest updates on the instruction set information, go to the web site.