Intel ARCHITECTURE IA-32 Manual do Utilizador descarregar pdf

IA-32 Intel® ArchitectureOptimization ReferenceManualOrder Number: 248966-013USApril 2006

xHardware Prefetch ... 6-19Example of Effective

IA-32 Intel® Architecture Optimization2-28In this example, a loop that executes 100 times assigns x to every even-numbered element and y to every odd-

Página 4

General Optimization Guidelines 22-29Memory AccessesThis section discusses guidelines for optimizing code and data memory accesses. The most important

Página 5

IA-32 Intel® Architecture Optimization2-30Assembly/Compiler Coding Rule 16. (H impact, H generality) Align data on natural operand size address bounda

Página 6

General Optimization Guidelines 22-31 Alignment of code is less of an issue for the Pentium 4 processor. Alignment of branch targets to maximize ba

Página 7

IA-32 Intel® Architecture Optimization2-32Store ForwardingThe processor’s memory system only sends stores to memory (including cache) after store reti

Página 8

General Optimization Guidelines 22-33If a variable is known not to change between when it is stored and when it is used again, the register that was s

Página 9

IA-32 Intel® Architecture Optimization2-34The size and alignment restrictions for store forwarding are illustrated in Figure 2-2.Coding rules to help

Página 10

General Optimization Guidelines 22-35A load that forwards from a store must wait for the store’s data to be written to the store buffer before proceed

Página 11

IA-32 Intel® Architecture Optimization2-36Example 2-14 illustrates a stalled store-forwarding situation that may appear in compiler generated code. So

Página 12

General Optimization Guidelines 22-37When moving data that is smaller than 64 bits between memory locations, 64-bit or 128-bit SIMD register moves are

Página 13

xiKey Practices of System Bus Optimization ... 7-17Key Practices of Memory Optimizati

Página 14 - Appendix DStack Alignment

IA-32 Intel® Architecture Optimization2-38Store-forwarding Restriction on Data AvailabilityThe value to be stored must be available before the load op

Página 15 - Examples

General Optimization Guidelines 22-39An example of a loop-carried dependence chain is shown in Example 2-17.Data Layout OptimizationsUser/Source Codin

Página 16

IA-32 Intel® Architecture Optimization2-40Cache line size for Pentium 4 and Pentium M processors can impact streaming applications (for example, multi

Página 17

General Optimization Guidelines 22-41However, if the access pattern of the array exhibits locality, such as if the array index is being swept through,

Página 18

IA-32 Intel® Architecture Optimization2-42non-sequential manner, the automatic hardware prefetcher cannot prefetch the data. The prefetcher can recogn

Página 19

General Optimization Guidelines 22-43If for some reason it is not possible to align the stack for 64-bits, the routine should access the parameter and

Página 20

IA-32 Intel® Architecture Optimization2-44Capacity Limits in Set-Associative CachesCapacity limits may occur if the number of outstanding memory refer

Página 21

General Optimization Guidelines 22-45Aliasing Cases in the Pentium® 4 and Intel® Xeon® ProcessorsAliasing conditions that are specific to the Pentium

Página 22

IA-32 Intel® Architecture Optimization2-46Aliasing Cases in the Pentium M ProcessorPentium M, Intel Core Solo and Intel Core Duo processors have the f

Página 23 - Introduction

General Optimization Guidelines 22-47Mixing Code and DataThe Pentium 4 processor’s aggressive prefetching and pre-decoding of instructions has two rel

Página 24 - About This Manual

xiiSign Extension to Full 64-Bits... 8-3Alternate Coding Rules

Página 25

IA-32 Intel® Architecture Optimization2-48and cross-modifying code (when more than one processor in a multi-processor system are writing to a code pag

Página 26

General Optimization Guidelines 22-49write misses; only four write-combining buffers are guaranteed to be available for simultaneous use. Write combin

Página 27 - Related Documentation

IA-32 Intel® Architecture Optimization2-50be no RFO since the line is not cached, and there is no such delay. For details on write-combining, see the

Página 28 - Notational Conventions

General Optimization Guidelines 22-51Locality enhancement to the last level cache can be accomplished with sequencing the data access pattern to take

Página 29 - Processor Family Overview

IA-32 Intel® Architecture Optimization2-52Minimizing Bus LatencyThe system bus on Intel Xeon and Pentium 4 processors provides up to6.4 GB/sec bandwid

Página 30 - SIMD Technology

General Optimization Guidelines 22-53User/Source Coding Rule 8. (H impact, H generality) To achieve effective amortization of bus latency, software

Página 31 - OP OP OP OP

IA-32 Intel® Architecture Optimization2-54Example 2-21 Non-temporal Stores and 64-byte Bus Write TransactionsExample 2-22 Non-temporal Stores and Part

Página 32 - • inherently parallel

General Optimization Guidelines 22-55PrefetchingThe Pentium 4 processor has three prefetching mechanisms: • hardware instruction prefetcher• software

Página 33 - Summary of SIMD Technologies

IA-32 Intel® Architecture Optimization2-56access patterns to suit the hardware prefetcher is highly recommended, and should be a higher-priority consi

Página 34 - Streaming SIMD Extensions 3

General Optimization Guidelines 22-57• new cache line flush instruction• new memory fencing instructionsFor a detailed description of using cacheabili

Página 35

xiiiTime-based Sampling... A-9Event-based Sampling...

Página 36 - Microarchitecture

IA-32 Intel® Architecture Optimization2-58Guidelines for Optimizing Floating-point CodeUser/Source Coding Rule 10. (M impact, M generality) Enable the

Página 37

General Optimization Guidelines 22-59to early out). However, be careful of introducing more than a total of two values for the floating point control

Página 38 - /HVVIUHTXHQWO\XVHGSDWKV

IA-32 Intel® Architecture Optimization2-60desired numeric precision, the size of the look-up tableland taking advantage of the parallelism of the Stre

Página 39 - The Front End

General Optimization Guidelines 22-61executing SSE/SSE2/SSE3 instructions and when speed is more important than complying to IEEE standard. The follow

Página 40 - Retirement

IA-32 Intel® Architecture Optimization2-62Underflow exceptions and denormalized source operands are usually treated according to the IEEE 754 specific

Página 41 - Front End Pipeline Detail

General Optimization Guidelines 22-63FPU control word (FCW), such as when performing conversions to integers. On Pentium M, Intel Core Solo and Intel

Página 42 - Execution Trace Cache

IA-32 Intel® Architecture Optimization2-64Assembly/Compiler Coding Rule 31. (H impact, M generality) Minimize changes to bits 8-12 of the floating poi

Página 43 - Branch Prediction

General Optimization Guidelines 22-65If there is more than one change to rounding, precision and infinity bits and the rounding mode is not important

Página 44 - Execution Core Detail

IA-32 Intel® Architecture Optimization2-66Example 2-23 Algorithm to Avoid Changing the Rounding Mode_fto132proclea ecx,[esp-8]sub esp,16 ; allocate f

Página 45

General Optimization Guidelines 22-67Assembly/Compiler Coding Rule 32. (H impact, L generality) Minimize the number of changes to the rounding mode. D

Página 46

xivUsing Performance Metrics with Hyper-Threading Technology ... B-50Using Performance Events of Intel Core S

Página 47

IA-32 Intel® Architecture Optimization2-68Assembly/Compiler Coding Rule 33. (H impact, L generality) Minimize the number of changes to the precision m

Página 48

General Optimization Guidelines 22-69This in turn allows instructions to be reordered to make instructions available to be executed in parallel. Out-o

Página 49 - Data Prefetch

IA-32 Intel® Architecture Optimization2-70• Scalar floating-point registers may be accessed directly, avoiding fxch and top-of-stack restrictions. On

Página 50

General Optimization Guidelines 22-71Recommendation: Use the compiler switch to generate SSE2 scalar floating-point code over x87 code. When working w

Página 51

IA-32 Intel® Architecture Optimization2-72Floating-Point StallsFloating-point instructions have a latency of at least two cycles. But, because of the

Página 52 - • buffering of writes

General Optimization Guidelines 22-73Note that transcendental functions are supported only in x87 floating point, not in Streaming SIMD Extensions or

Página 53

IA-32 Intel® Architecture Optimization2-74Complex InstructionsAssembly/Compiler Coding Rule 40. (ML impact, M generality) Avoid using complex instruct

Página 54 - Pentium

General Optimization Guidelines 22-75Use of the inc and dec InstructionsThe inc and dec instructions modify only a subset of the bits in the flag regi

Página 55 - • instruction cache

IA-32 Intel® Architecture Optimization2-76CMPXCHG8B, various rotate instructions, STC, and STD. An example of assembly with a partial flag register st

Página 56

General Optimization Guidelines 22-77(model 9) does incur a penalty. This is because every operation on a partial register updates the whole register.

Página 57 - Data Prefetching

xvExamplesExample 2-1 Assembly Code with an Unpredictable Branch ... 2-17Example 2-2 Code Optimization to Eliminate Branches

Página 58 - Out-of-Order Core

IA-32 Intel® Architecture Optimization2-78Table 2-3 illustrates using movzx to avoid a partial register stall when packing three byte values into a re

Página 59 - Core™ Duo Processors

General Optimization Guidelines 22-79less delay than the partial register update problem mentioned above, but the performance gain may vary. If the ad

Página 60 - • Micro-op fusion

IA-32 Intel® Architecture Optimization2-80Prefixes and Instruction DecodingAn IA-32 instruction can be up to 15 bytes in length. Prefixes can change t

Página 61

General Optimization Guidelines 22-81• Processing an instruction with the 0x66 prefix that (i) has a modr/m byte in its encoding and (ii) the opcode b

Página 62

IA-32 Intel® Architecture Optimization2-82String move/store instructions have multiple data granularities. For efficient data movement, larger data gr

Página 63

General Optimization Guidelines 22-83• Cache eviction:If the amount of data to be processed by a memory routine approaches half the size of the last l

Página 64 - • operational fairness

IA-32 Intel® Architecture Optimization2-84improve address alignment, a small piece of prolog code using movsb/stosb with count less than 4 can be used

Página 65 - Shared Resources

General Optimization Guidelines 22-85Memory routines in the runtime library generated by Intel Compilers are optimized across wide range of address al

Página 66 - Front End Pipeline

IA-32 Intel® Architecture Optimization2-86In some situations, the byte count of the data to operate is known by the context (versus from a parameter p

Página 67 - Multi-Core Processors

General Optimization Guidelines 22-87Clearing RegistersPentium 4 processor provides special support to xor, sub, or pxor operations when executed with

Página 68

xviExample 3-4 Identification of SSE2 with cpuid ... 3-5Example 3-5 Identification of SSE2 by the OS

Página 69

IA-32 Intel® Architecture Optimization2-88Using test instruction between the instruction that may modify part of the flag register and the instruction

Página 70 - Load and Store Operations

General Optimization Guidelines 22-89Use movapd as an alternative; it writes all 128 bits. Even though this instruction has a longer latency, the μops

Página 71

IA-32 Intel® Architecture Optimization2-90Prolog SequencesAssembly/Compiler Coding Rule 57. (M impact, MH generality) In routines that do not need a f

Página 72

General Optimization Guidelines 22-91Using memory as a destination operand may further reduce register pressure at the slight risk of making trace cac

Página 73 - General Optimization

IA-32 Intel® Architecture Optimization2-92Spill SchedulingThe spill scheduling algorithm used by a code generator will be impacted by the Pentium 4 pr

Página 74

General Optimization Guidelines 22-93Because micro-ops are delivered from the trace cache in the common cases, decoding rules are not required. Schedu

Página 75

IA-32 Intel® Architecture Optimization2-94Data elements in parallel. The number of elements which can be operated on in parallel range from four singl

Página 76

General Optimization Guidelines 22-95User/Source Coding Rule 19. (M impact, ML generality) Avoid the use of conditional branches inside loops and cons

Página 77 - Optimize Memory Access

IA-32 Intel® Architecture Optimization2-96The other NOPs have no special hardware support. Their input and output registers are interpreted by the har

Página 78

General Optimization Guidelines 22-97User/Source Coding RulesUser/Source Coding Rule 1. (M impact, L generality) If an indirect branch has two or more

Página 79 - Enable Vectorization

xviiExample 4-20 Clipping to an Arbitrary Signed Range [high, low]...4-27Example 4-21 Simplified Clipping to an Arbitrary Signed

Página 80

IA-32 Intel® Architecture Optimization2-98User/Source Coding Rule 8. (H impact, H generality) To achieve effective amortization of bus latency, softwa

Página 81 - Performance Tools

General Optimization Guidelines 22-99look-up-table-based algorithm using interpolation techniques. It is possible to improve transcendental performanc

Página 82 - VTune™ Performance Analyzer

IA-32 Intel® Architecture Optimization2-100order engine. When tuning, note that all IA-32 based processors have very high branch prediction rates. Con

Página 83 - Processor Perspectives

General Optimization Guidelines 22-101Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four branches in 16-byte chunks.

Página 84

IA-32 Intel® Architecture Optimization2-102Assembly/Compiler Coding Rule 18. (H impact, M generality) A load that forwards from a store must have the

Página 85

General Optimization Guidelines 22-103first-level cache working set. Avoid having more than 8 cache lines that are some multiple of 64 KB apart in the

Página 86

IA-32 Intel® Architecture Optimization2-104Assembly/Compiler Coding Rule 32. (H impact, L generality) Minimize the number of changes to the rounding

Página 87

General Optimization Guidelines 22-105Assembly/Compiler Coding Rule 42. (M impact, H generality) inc and dec instructions should be replaced with an a

Página 88 - A and B. If the condition is

IA-32 Intel® Architecture Optimization2-106instead of a cmp of the register to zero, this saves the need to encode the zero and saves encoding space.

Página 89

General Optimization Guidelines 22-107Assembly/Compiler Coding Rule 56. (M impact, ML generality) For arithmetic or logical operations that have thei

Página 90 - Spin-Wait and Idle Loops

xviiiExample 6-12 Memory Copy Using Hardware Prefetch and Bus Segmentation..6-50Example 7-1 Serial Execution of Producer and Consumer Work Items ...

Página 91 - Static Prediction

IA-32 Intel® Architecture Optimization2-108Tuning SuggestionsTuning Suggestion 1. Rarely, a performance problem may be noted due to executing data on

Página 92

3-13Coding for SIMD ArchitecturesIntel Pentium 4, Intel Xeon and Pentium M processors include support for Streaming SIMD Extensions 2 (SSE2), Streamin

Página 93

IA-32 Intel® Architecture Optimization3-2Checking for Processor Support of SIMD TechnologiesThis section shows how to check whether a processor suppor

Página 94 - Inlining, Calls and Returns

Coding for SIMD Architectures 33-3For more information on cpuid see, Intel® Processor Identification with CPUID Instruction, order number 241618.Check

Página 95 - Branch Type Selection

IA-32 Intel® Architecture Optimization3-4To find out whether the operating system supports SSE, execute an SSE instruction and trap for an exception i

Página 96

Coding for SIMD Architectures 33-5Checking for Streaming SIMD Extensions 2 SupportChecking for support of SSE2 is like checking for SSE support. You m

Página 97

IA-32 Intel® Architecture Optimization3-6Checking for Streaming SIMD Extensions 3 SupportSSE3 includes 13 instructions, 11 of those are suited for SIM

Página 98 - Loop Unrolling

Coding for SIMD Architectures 33-7Example 3-6 Identification of SSE3 with cpuidSSE3 requires the same support from the operating system as SSE. To fin

Página 99

IA-32 Intel® Architecture Optimization3-8Example 3-7 Identification of SSE3 by the OSConsiderations for Code Conversion to SIMD ProgrammingThe VTune P

Página 100 - • inlining where appropriate

Coding for SIMD Architectures 33-9Figure 3-1 Converting to Streaming SIMD Extensions ChartOM15156Code benefitsfrom SIMDSTOPIdentify Hot Spots in CodeI

Página 101 - Memory Accesses

xixFiguresFigure 1-1 Typical SIMD Operations ... 1-3Figure 1-2 SIMD Instruction Regist

Página 102

IA-32 Intel® Architecture Optimization3-10To use any of the SIMD technologies optimally, you must evaluate the following situations in your code:• fra

Página 103 - Line 029e7140h

Coding for SIMD Architectures 33-11specific optimizations. Where appropriate, the coach displays pseudo-code to suggest the use of highly optimized in

Página 104 - Store Forwarding

IA-32 Intel® Architecture Optimization3-12costly application processing time. However, these routines have potential for increased performance when yo

Página 105 - Alignment

Coding for SIMD Architectures 33-13Coding MethodologiesSoftware developers need to compare the performance improvement that can be obtained from assem

Página 106 - Figure 2-2

IA-32 Intel® Architecture Optimization3-14The examples that follow illustrate the use of coding adjustments to enable the algorithm to benefit from th

Página 107

Coding for SIMD Architectures 33-15AssemblyKey loops can be coded directly in assembly language using an assembler or by using inlined assembly (C-asm

Página 108 - Example 2-14

IA-32 Intel® Architecture Optimization3-16SIMD Extensions 2 integer SIMD and __m128d is used for double precision floating-point SIMD. These types ena

Página 109

Coding for SIMD Architectures 33-17The intrinsic data types, however, are not a basic ANSI C data type, and therefore you must observe the following u

Página 110 - • parameter passing

IA-32 Intel® Architecture Optimization3-18Here, fvec.h is the class definition file and F32vec4 is the class representing an array of four floats. The

Página 111 - Data Layout Optimizations

Coding for SIMD Architectures 33-19The caveat to this is that only certain types of loops can be automatically vectorized, and in most cases user inte

Página 112

iiINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLE

Página 113

xxFigure 6-2 Memory Access Latency and Execution Without Prefetch ... 6-23Figure 6-3 Memory Access Latency and Execution With Prefetch ...

Página 114 - Stack Alignment

IA-32 Intel® Architecture Optimization3-20Stack and Data AlignmentTo get the most performance out of code written for SIMD technologies data should be

Página 115

Coding for SIMD Architectures 33-21By adding the padding variable pad, the structure is now 8 bytes, and if the first element is aligned to 8 bytes (6

Página 116

IA-32 Intel® Architecture Optimization3-22Assuming you have a 64-bit aligned data vector and a 64-bit aligned coefficients vector, the filter operatio

Página 117 - Processors

Coding for SIMD Architectures 33-23• Functions that use Streaming SIMD Extensions or Streaming SIMD Extensions 2 data need to provide a 16-byte aligne

Página 118

IA-32 Intel® Architecture Optimization3-24Another way to improve data alignment is to copy the data into locations that are aligned on 64-bit boundari

Página 119 - Mixing Code and Data

Coding for SIMD Architectures 33-25The __declspec(align(16)) specifications can be placed before data declarations to force 16-byte alignment. This is

Página 120 - Write Combining

IA-32 Intel® Architecture Optimization3-26In C++ (but not in C) it is also possible to force the alignment of a class/struct/union type, as in the cod

Página 121

Coding for SIMD Architectures 33-27Improving Memory UtilizationMemory performance can be improved by rearranging data and algorithms for SSE 2, SSE, a

Página 122 - Locality Enhancement

IA-32 Intel® Architecture Optimization3-28There are two options for computing data in AoS format: perform operation on the data as it stands in AoS fo

Página 123

Coding for SIMD Architectures 33-29Performing SIMD operations on the original AoS format can require more calculations and some of the operations do n

Página 124 - Minimizing Bus Latency

xxiTablesTable 1-1 Pentium 4 and Intel Xeon Processor Cache Parameters ... 1-20Table 1-3 Cache Parameters of Pentium M, Intel® Core™ So

Página 125

IA-32 Intel® Architecture Optimization3-30but is somewhat inefficient as there is the overhead of extra instructions during computation. Performing th

Página 126

Coding for SIMD Architectures 33-31Note that SoA can have the disadvantage of requiring more independent memory stream references. A computation that

Página 127 - • software prefetch for data

IA-32 Intel® Architecture Optimization3-32Strip MiningStrip mining, also known as loop sectioning, is a loop transformation technique for enabling SIM

Página 128 - Cacheability Instructions

Coding for SIMD Architectures 33-33The main loop consists of two functions: transformation and lighting. For each object, the main loop calls a transf

Página 129 - Applications

IA-32 Intel® Architecture Optimization3-34In Example 3-19, the computation has been strip-mined to a size strip_size. The value strip_size is chosen s

Página 130

Coding for SIMD Architectures 33-35For the first iteration of the inner loop, each access to array B will generate a cache miss. If the size of one ro

Página 131

IA-32 Intel® Architecture Optimization3-36This situation can be avoided if the loop is blocked with respect to the cache size. In Figure 3-3, a block_

Página 132 - • denormalized operand

Coding for SIMD Architectures 33-37As one can see, all the redundant cache misses can be eliminated by applying this loop blocking technique. If MAX i

Página 133

IA-32 Intel® Architecture Optimization3-38Note that this can be applied to both SIMD integer and SIMD floating-point code.If there are multiple consum

Página 134 - Floating-point Modes

Coding for SIMD Architectures 33-39Recommendation: When targeting code generation for Intel Core Solo and Intel Core Duo processors, favor instruction

Página 135

xxiiTable C-5 Streaming SIMD Extension 64-bit Integer Instructions... C-14Table C-7 IA-32 x87 Floating-point Instructions...

Página 136

IA-32 Intel® Architecture Optimization3-40

Página 137

4-14Optimizing for SIMD Integer ApplicationsThe SIMD integer instructions provide performance improvements in applications that are integer-intensive

Página 138

IA-32 Intel® Architecture Optimization4-2For planning considerations of using the new SIMD integer instructions, refer to “Checking for Streaming SIMD

Página 139

Optimizing for SIMD Integer Applications 44-3Using SIMD Integer with x87 Floating-pointAll 64-bit SIMD integer instructions use the MMX registers, whi

Página 140

IA-32 Intel® Architecture Optimization4-4Using emms clears all of the valid bits, effectively emptying the x87 floating-point stack and making it read

Página 141

Optimizing for SIMD Integer Applications 44-5• Don’t empty when already empty: If the next instruction uses an MMX register, _mm_empty() incurs a cost

Página 142 - Core Duo Processors

IA-32 Intel® Architecture Optimization4-6Data AlignmentMake sure that 64-bit SIMD integer data is 8-byte aligned and that 128-bit SIMD integer data is

Página 143 - Memory Operands

Optimizing for SIMD Integer Applications 44-7Signed UnpackSigned numbers should be sign-extended when unpacking the values. This is similar to the zer

Página 144 - Floating-Point Stalls

IA-32 Intel® Architecture Optimization4-8Interleaved Pack with SaturationThe pack instructions pack two values into the destination register in a pred

Página 145 - Instruction Selection

Optimizing for SIMD Integer Applications 44-9Figure 4-2 illustrates two values interleaved in the destination register, and Example 4-4 shows code tha

Página 146 - Use of the lea Instruction

xxiiiIntroductionThe IA-32 Intel® Architecture Optimization Reference Manual describes how to optimize software to take advantage of the performance c

Página 147 - Flag Register Accesses

IA-32 Intel® Architecture Optimization4-10The pack instructions always assume that the source operands are signed numbers. The result in the destinati

Página 148 - Integer Divide

Optimizing for SIMD Integer Applications 44-11Non-Interleaved UnpackThe unpack instructions perform an interleave merge of the data elements of the de

Página 149

IA-32 Intel® Architecture Optimization4-12The other destination register will contain the opposite combination illustrated in Figure 4-4. Code in the

Página 150 - Partial Register Stall

Optimizing for SIMD Integer Applications 44-13Extract WordThe pextrw instruction takes the word in the designated MMX register selected by the two lea

Página 151

IA-32 Intel® Architecture Optimization4-14Insert WordThe pinsrw instruction loads a word from the lower half of a 32-bit integer register or from memo

Página 152 - • Address size prefix (0x67)

Optimizing for SIMD Integer Applications 44-15 If all of the operands in a register are being replaced by a series of pinsrw instructions, it can be

Página 153 - REP Prefix and Data Movement

IA-32 Intel® Architecture Optimization4-16Move Byte Mask to IntegerThe pmovmskb instruction returns a bit mask formed from the most significant bits o

Página 154 - • Address alignment:

Optimizing for SIMD Integer Applications 44-17Figure 4-7 pmovmskb Instruction ExampleExample 4-10 pmovmskb Instruction Code; Input:; source value; Out

Página 155 - • Cache eviction:

IA-32 Intel® Architecture Optimization4-18Packed Shuffle Word for 64-bit RegistersThe pshuf instruction (see Figure 4-8, Example 4-11) uses the immedi

Página 156

Optimizing for SIMD Integer Applications 44-19Packed Shuffle Word for 128-bit RegistersThe pshuflw/pshufhw instruction performs a full shuffle of any

Página 157 - Destination

IA-32 Intel® Architecture Optimizationxxivtarget the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture.Tuning Your Applic

Página 158 - • scaled index register

IA-32 Intel® Architecture Optimization4-20Unpacking/interleaving 64-bit Data in 128-bit RegistersThe punpcklqdq/punpchqdq instructions interleave the

Página 159 - Compares

Optimizing for SIMD Integer Applications 44-21Data Movement There are two additional instructions to enable data movement from the 64-bit SIMD integer

Página 160 - Floating Point/SIMD Operands

IA-32 Intel® Architecture Optimization4-22pxor MM0, MM0pcmpeq MM1, MM1psubb MM0, MM1 [psubw MM0, MM1] (psubd MM0, MM1); three instructions above gen

Página 161

Optimizing for SIMD Integer Applications 44-23Building BlocksThis section describes instructions and algorithms which implement common code building b

Página 162 - Prolog Sequences

IA-32 Intel® Architecture Optimization4-24Absolute Difference of Signed NumbersChapter 4 computes the absolute difference of two signed numbers. The t

Página 163 - Instruction Scheduling

Optimizing for SIMD Integer Applications 44-25Absolute ValueUse Example 4-18 to compute |x|, where x is signed. This example assumes signed words to b

Página 164 - Spill Scheduling

IA-32 Intel® Architecture Optimization4-26Clipping to an Arbitrary Range [high, low]This section explains how to clip a values to a range [high, low].

Página 165 - Vectorization

Optimizing for SIMD Integer Applications 44-27Highly Efficient ClippingFor clipping signed words to an arbitrary range, the pmaxsw and pminsw instruct

Página 166 - • avoid global variables

IA-32 Intel® Architecture Optimization4-28The code above converts values to unsigned numbers first and then clips them to an unsigned range. The last

Página 167 - Miscellaneous

Optimizing for SIMD Integer Applications 44-29packed-subtract instructions with unsigned saturation, thus this technique can only be used on packed-by

Página 168

IntroductionxxvThe manual consists of the following parts:Introduction. Defines the purpose and outlines the contents of this manual.Chapter 1: IA-32

Página 169 - User/Source Coding Rules

IA-32 Intel® Architecture Optimization4-30Unsigned ByteThe pmaxub instruction returns the maximum between the eight unsigned bytes in either two SIMD

Página 170

Optimizing for SIMD Integer Applications 44-31The subtraction operation presented above is an absolute difference, that is, t = abs(x-y). The byte val

Página 171

IA-32 Intel® Architecture Optimization4-32The PAVGB instruction operates on packed unsigned bytes and the PAVGW instruction operates on packed unsigne

Página 172

Optimizing for SIMD Integer Applications 44-33Note that the output is a packed doubleword. If needed, a pack instruction can be used to convert the re

Página 173

IA-32 Intel® Architecture Optimization4-34Memory OptimizationsYou can improve memory accesses using the following techniques:• Avoiding partial memory

Página 174

Optimizing for SIMD Integer Applications 44-35Partial Memory AccessesConsider a case with large load after a series of small stores to the same area o

Página 175

IA-32 Intel® Architecture Optimization4-36Let us now consider a case with a series of small loads after a large store to the same area of memory (begi

Página 176

Optimizing for SIMD Integer Applications 44-37These transformations, in general, increase the number of instructions required to perform the desired o

Página 177

IA-32 Intel® Architecture Optimization4-38SSE3 provides an instruction LDDQU for loading from memory address that are not 16 byte aligned. LDDQU is a

Página 178

Optimizing for SIMD Integer Applications 44-39Increasing Bandwidth of Memory Fills and Video FillsIt is beneficial to understand how memory is accesse

Página 179 - PUSH, CALL, RET). 2-84

IA-32 Intel® Architecture OptimizationxxviChapter 7: Multiprocessor and Hyper-Threading Technology. Describes guidelines and techniques for optimizing

Página 180 - Tuning Suggestions

IA-32 Intel® Architecture Optimization4-40same DRAM page have shorter latencies than sequential accesses to different DRAM pages. In many systems the

Página 181 - Architectures

Optimizing for SIMD Integer Applications 44-41aligned versions; this can reduce the performance gains when using the 128-bit SIMD integer extensions.

Página 182 - Technologies

IA-32 Intel® Architecture Optimization4-42Packed SSE2 Integer versus MMX InstructionsIn general, 128-bit SIMD integer instructions should be favored o

Página 183

5-15Optimizing for SIMD Floating-point ApplicationsThis chapter discusses general rules of optimizing for the single-instruction, multiple-data (SIMD)

Página 184 - bool OSSupportCheck() {

IA-32 Intel® Architecture Optimization5-2• Use MMX technology instructions and registers or for copying data that is not used later in SIMD floating-p

Página 185

Optimizing for SIMD Floating-point Applications 55-3• Is the data arranged for efficient utilization of the SIMD floating-point registers?• Is this ap

Página 186

IA-32 Intel® Architecture Optimization5-4When using scalar floating-point instructions, it is not necessary to ensure that the data appears in vector

Página 187

Optimizing for SIMD Floating-point Applications 55-5For some applications, e.g., 3D geometry, the traditional data arrangement requires some changes t

Página 188 - Programming

IA-32 Intel® Architecture Optimization5-6simultaneously referred to as an xyz data representation, see the diagram below) are computed in parallel, an

Página 189

Optimizing for SIMD Floating-point Applications 55-7To utilize all 4 computation slots, the vertex data can be reorganized to allow computation on eac

Página 190 - Identifying Hot Spots

IntroductionxxviiRelated DocumentationFor more information on the Intel architecture, specific techniques, and processor architecture terminology refe

Página 191

IA-32 Intel® Architecture Optimization5-8Figure 5-2 shows how 1 result would be computed for 7 instructions if the data were organized as AoS and usin

Página 192 - Coding Techniques

Optimizing for SIMD Floating-point Applications 55-9Now consider the case when the data is organized as SoA. Example 5-2 demonstrates how 4 results ar

Página 193 - Coding Methodologies

IA-32 Intel® Architecture Optimization5-10To gather data from 4 different memory locations on the fly, follow steps:1. Identify the first half of the

Página 194

Optimizing for SIMD Floating-point Applications 55-11 y1 x1movhps xmm7, [ecx+16] // xmm7 = y2 x2 y1 x1movlps xmm0, [ecx+32] // xmm0 = -- -- y3 x3m

Página 195 - Intrinsics

IA-32 Intel® Architecture Optimization5-12Example 5-4 shows the same data -swizzling algorithm encoded using the Intel C++ Compiler’s intrinsics for S

Página 196

Optimizing for SIMD Floating-point Applications 55-13 Although the generated result of all zeros does not depend on the specific data contained in the

Página 197 - +”, “>>”)

IA-32 Intel® Architecture Optimization5-14Data DeswizzlingIn the deswizzle operation, we want to arrange the SoA format back into AoS format so the xx

Página 198 - Automatic Vectorization

Optimizing for SIMD Floating-point Applications 55-15You may have to swizzle data in the registers, but not in memory. This occurs when two different

Página 199

IA-32 Intel® Architecture Optimization5-16// Start deswizzling here movaps xmm7, xmm4 // xmm7= a1 a2 a3 a4 movhlps xmm7, xmm3 // xmm7= b3 b4 a

Página 200 - Stack and Data Alignment

Optimizing for SIMD Floating-point Applications 55-17Using MMX Technology Code for Copy or Shuffling FunctionsIf there are some parts in the code that

Página 201

IA-32 Intel® Architecture OptimizationxxviiiNotational ConventionsThis manual uses the following conventions:This type style Indicates an element of s

Página 202 - __m128* datatypes

IA-32 Intel® Architecture Optimization5-18Example 5-8 illustrates how to use MMX technology code for copying or shuffling.Horizontal ADD Using SSEAlth

Página 203 - __m128*

Optimizing for SIMD Floating-point Applications 55-19Figure 5-3 Horizontal Add Using movhlps/movlhpsExample 5-9 Horizontal Add Using movhlps/movlhpsvo

Página 204 - Compiler-Supported Alignment

IA-32 Intel® Architecture Optimization5-20 // START HORIZONTAL ADD movaps xmm5, xmm0 // xmm5= A1,A2,A3,A4movlhps xmm5, xmm1 // xmm5= A1,A2,B1,

Página 205

Optimizing for SIMD Floating-point Applications 55-21Use of cvttps2pi/cvttss2si InstructionsThe cvttps2pi and cvttss2si instructions encode the trunca

Página 206

IA-32 Intel® Architecture Optimization5-22avoided since there is a penalty associated with writing this register; typically, through the use of the cv

Página 207 - Improving Memory Utilization

Optimizing for SIMD Floating-point Applications 55-23 SSE3 and Complex ArithmeticsThe flexibility of SSE3 in dealing with AOS-type of data structure

Página 208 - SoA Data Structure

IA-32 Intel® Architecture Optimization5-24instructions to perform multiplications of single-precision complex numbers. Example 5-12 demonstrates using

Página 209

Optimizing for SIMD Floating-point Applications 55-25Example 5-12 Division of Two Pair of Single-precision Complex Number// Division of (ak + i bk ) /

Página 210

IA-32 Intel® Architecture Optimization5-26SSE3 and Horizontal ComputationSometimes the AOS type of data organization are more natural in many algebrai

Página 211

Optimizing for SIMD Floating-point Applications 55-27SIMD Optimizations and MicroarchitecturesPentium M, Intel Core Solo and Intel Core Duo processors

Página 212 - Strip Mining

1-11IA-32 Intel® Architecture Processor Family OverviewThis chapter gives an overview of the features relevant to software optimization for the curre

Página 213 - Example 3-19 Strip Mined Code

IA-32 Intel® Architecture Optimization5-28When targeting complex arithmetics on Intel Core Solo and Intel Core Duo processors, using single-precision

Página 214 - Loop Blocking

6-16Optimizing Cache UsageOver the past decade, processor speed has increased more than ten times. Memory access speed has increased at a slower pace.

Página 215 - A. Original Loop

IA-32 Intel® Architecture Optimization6-2• Memory Optimization Using Hardware Prefetching, Software Prefetch and Cacheability Instructions: discusses

Página 216 - Blocking

Optimizing Cache Usage 66-3• Facilitate compiler optimization: — Minimize use of global variables and pointers— Minimize use of complex control flow —

Página 217

IA-32 Intel® Architecture Optimization6-4• Optimize software prefetch scheduling distance:— Far ahead enough to allow interim computation to overlap m

Página 218

Optimizing Cache Usage 66-53. Follows only one stream per 4K page (load or store)4. Can prefetch up to 8 simultaneous independent streams from eight d

Página 219 - Tuning the Final Application

IA-32 Intel® Architecture Optimization6-6Data reference patterns can be classified as follows:Temporal data will be used again soonSpatial data will b

Página 220

Optimizing Cache Usage 66-7The prefetch instruction is implementation-specific; applications need to be tuned to each implementation to maximize perfo

Página 221 - Optimizing for SIMD Integer

IA-32 Intel® Architecture Optimization6-8The Prefetch Instructions – Pentium 4 Processor ImplementationStreaming SIMD Extensions include four flavors

Página 222

Optimizing Cache Usage 66-9Currently, the prefetch instruction provides a greater performance gain than preloading because it:• has no destination reg

Página 223 - Using the EMMS Instruction

iiiContentsIntroductionChapter 1 IA-32 Intel® Architecture Processor Family OverviewSIMD Technology...

Página 224

IA-32 Intel® Architecture Optimization1-2Intel Core Solo and Intel Core Duo processors incorporate microarchitectural enhancements for performance and

Página 225

IA-32 Intel® Architecture Optimization6-10The Non-temporal Store InstructionsThis section describes the behavior of streaming stores and reiterates so

Página 226 - Data Alignment

Optimizing Cache Usage 66-11• Reduce disturbance of frequently used cached (temporal) data, since they write around the processor caches.Streaming sto

Página 227 - Signed Unpack

IA-32 Intel® Architecture Optimization6-12evicting data from all processor caches). The Pentium M processor implements a combination of both approache

Página 228

Optimizing Cache Usage 66-13possible. This behavior should be considered reserved, and dependence on the behavior of any particular implementation ris

Página 229 - MM/M64 mm

IA-32 Intel® Architecture Optimization6-14In case the region is not mapped as WC, the streaming might update in-place in the cache and a subsequent sf

Página 230

Optimizing Cache Usage 66-15The maskmovq/maskmovdqu (non-temporal byte mask store of packed integer in an MMX technology or Streaming SIMD Extensions

Página 231 - Non-Interleaved Unpack

IA-32 Intel® Architecture Optimization6-16The degree to which a consumer of data knows that the data is weakly-ordered can vary for these cases. As a

Página 232

Optimizing Cache Usage 66-17The clflush InstructionThe cache line associated with the linear address specified by the value of byte address is invalid

Página 233 - Extract Word

IA-32 Intel® Architecture Optimization6-18Memory Optimization Using PrefetchThe Pentium 4 processor has two mechanisms for data prefetch: software-con

Página 234 - Insert Word

Optimizing Cache Usage 66-19Hardware PrefetchThe automatic hardware prefetch, can bring cache lines into the unified last-level cache based on prior d

Página 235 - Figure 4-6 pinsrw Instruction

IA-32 Intel® Architecture Processor Family Overview1-3each corresponding pair of data elements (X1 and Y1, X2 and Y2, X3 and Y3, and X4 and Y4). The r

Página 236 - Move Byte Mask to Integer

IA-32 Intel® Architecture Optimization6-20• May consume extra system bandwidth if the application’s memory traffic has significant portions with strid

Página 237 - 55 47 39 23 15 7

Optimizing Cache Usage 66-21Example 6-2 Populating an Array for Circular Pointer Chasing with Constant Strideregister char ** p;char *next; // Populat

Página 238 - X1 X2 X3 X4

IA-32 Intel® Architecture Optimization6-22 Example of Latency Hiding with S/W Prefetch InstructionAchieving the highest level of memory optimization u

Página 239

Optimizing Cache Usage 66-23execution units sit idle and wait until data is returned. On the other hand, the memory bus sits idle while the execution

Página 240

IA-32 Intel® Architecture Optimization6-24The performance loss caused by poor utilization of resources can be completely eliminated by correctly sched

Página 241 - Generating Constants

Optimizing Cache Usage 66-25• Balance single-pass versus multi-pass execution• Resolve memory bank conflict issues• Resolve cache management issuesThe

Página 242

IA-32 Intel® Architecture Optimization6-26lines of data per iteration. The PSD would need to be increased/decreased if more/less than two cache lines

Página 243 - Building Blocks

Optimizing Cache Usage 66-27This memory de-pipelining creates inefficiency in both the memory pipeline and execution pipeline. This de-pipelining effe

Página 244

IA-32 Intel® Architecture Optimization6-28Prefetch concatenation can bridge the execution pipeline bubbles between the boundary of an inner loop and i

Página 245 - Absolute Value

Optimizing Cache Usage 66-29Minimize Number of Software PrefetchesPrefetch instructions are not completely free in terms of bus cycles, machine cycles

Página 246 - 0x8000800080008000

IA-32 Intel® Architecture Optimization1-4SIMD improves the performance of 3D graphics, speech recognition, image processing, scientific applications a

Página 247 - Highly Efficient Clipping

IA-32 Intel® Architecture Optimization6-30Figure 6-5Figure demonstrates the effectiveness of software prefetches in latency hiding. The X axis indica

Página 248

Optimizing Cache Usage 66-31Figure 6-5 Memory Access Latency and Execution With Prefetch2 Load streams, 1 store stream5010015020025030035054 108 144

Página 249 - Signed Word

IA-32 Intel® Architecture Optimization6-32Mix Software Prefetch with Computation InstructionsIt may seem convenient to cluster all of the prefetch ins

Página 250 - Packed Multiply High Unsigned

Optimizing Cache Usage 66-33 Example 6-6 Spread Prefetch InstructionsNOTE. To avoid instruction execution stalls due to the over-utilization of the

Página 251 - Packed Average (Byte/Word)

IA-32 Intel® Architecture Optimization6-34Software Prefetch and Cache Blocking TechniquesCache blocking techniques, such as strip-mining, are used to

Página 252

Optimizing Cache Usage 66-35In the temporally-adjacent scenario, subsequent passes use the same data and find it already in second-level cache. Prefet

Página 253 - 128-bit Shifts

IA-32 Intel® Architecture Optimization6-36Figure 6-7 shows how prefetch instructions and strip-mining can be applied to increase performance in both o

Página 254 - Memory Optimizations

Optimizing Cache Usage 66-37In scenario to the right, in Figure 6-7, keeping the data in one way of the second-level cache does not improve cache loca

Página 255 - Partial Memory Accesses

IA-32 Intel® Architecture Optimization6-38Without strip-mining, all the x,y,z coordinates for the four vertices must be re-fetched from memory in the

Página 256

Optimizing Cache Usage 66-39Table 6-1 summarizes the steps of the basic usage model that incorporates only software prefetch with strip-mining. The st

Página 257

IA-32 Intel® Architecture Processor Family Overview1-5SSE and SSE2 instructions also introduced cacheability and memory ordering instructions that can

Página 258

IA-32 Intel® Architecture Optimization6-40happen to be powers of 2, aliasing condition due to finite number of way-associativity (see “Capacity Limits

Página 259 - Instruction

Optimizing Cache Usage 66-41references enables the hardware prefetcher to initiate bus requests to read some cache lines before the code actually refe

Página 260

IA-32 Intel® Architecture Optimization6-42selected to ensure that the batch stays within the processor caches through all passes. An intermediate cach

Página 261

Optimizing Cache Usage 66-43The choice of single-pass or multi-pass can have a number of performance implications. For instance, in a multi-pass pipel

Página 262

IA-32 Intel® Architecture Optimization6-44a line burst transaction. To achieve the best possible performance, it is recommended to align data along th

Página 263 - Floating-point Applications

Optimizing Cache Usage 66-45The following examples of using prefetching instructions in the operation of video encoder and decoder as well as in simpl

Página 264 - Planning Considerations

IA-32 Intel® Architecture Optimization6-46Later, the processor re-reads the data using prefetchnta, which ensures maximum bandwidth, yet minimizes dis

Página 265 - Scalar Floating-point Code

Optimizing Cache Usage 66-47The memory copy algorithm can be optimized using the Streaming SIMD Extensions with these considerations:• alignment of da

Página 266

IA-32 Intel® Architecture Optimization6-48Using the 8-byte Streaming Stores and Software PrefetchExample 6-11 presents the copy algorithm that uses se

Página 267

Optimizing Cache Usage 66-49In Example 6-11, eight _mm_load_ps and _mm_stream_ps intrinsics are used so that all of the data prefetched (a 128-byte ca

Página 268

IA-32 Intel® Architecture Optimization1-6SSE instructions are useful for 3D geometry, 3D rendering, speech recognition, and video encoding and decodin

Página 269

IA-32 Intel® Architecture Optimization6-50The instruction, temp = a[kk+CACHESIZE], is used to ensure the page table entry for array, and a is entered

Página 270

Optimizing Cache Usage 66-51prefetch_loop:movaps xmm0, [esi+ecx]movaps xmm0, [esi+ecx+64]add ecx,128cmp ecx,BLOCK_SIZEjne prefetch_loopxor ecx,ecxalig

Página 271 - Data Swizzling

IA-32 Intel® Architecture Optimization6-52Performance Comparisons of Memory Copy RoutinesThe throughput of a large-region, memory copy routine depends

Página 272 - Example 5-3 Swizzling Data

Optimizing Cache Usage 66-53The baseline for performance comparison is the throughput (bytes/sec) of 8-MByte region memory copy on a first-generation

Página 273

IA-32 Intel® Architecture Optimization6-54query each level of the cache hierarchy. Enumeration of each cache level is by specifying an index value (st

Página 274

Optimizing Cache Usage 66-55• Determine multi-threading resource topology in an MP system (See Section 7.10 of IA-32 Intel® Architecture Software Deve

Página 275

IA-32 Intel® Architecture Optimization6-56platform, software can extract information on the number and the identities of each logical processor sharin

Página 276 - Data Deswizzling

7-17Multi-Core and Hyper-Threading TechnologyThis chapter describes software optimization techniques for multithreaded applications running in an envi

Página 277 - Instructions

IA-32 Intel® Architecture Optimization7-2cores but shared by two logical processors in the same core if Hyper-Threading Technology is enabled. This ch

Página 278 - Instructions (continued)

Multi-Core and Hyper-Threading Technology 77-3Figure 7-1 illustrates how performance gains can be realized for any workload according to Amdahl’s law.

Página 279 - Functions

IA-32 Intel® Architecture Processor Family Overview1-7Intel® Extended Memory 64 Technology (Intel®EM64T)Intel EM64T is an extension of the IA-32 Intel

Página 280 - Horizontal ADD Using SSE

IA-32 Intel® Architecture Optimization7-4When optimizing application performance in a multithreaded environment, control flow parallelism is likely to

Página 281 - C1 C2 D1 D2 C3 C4 D3 D4

Multi-Core and Hyper-Threading Technology 77-5terms of time of completion relative to the same task when in a single-threaded environment) will vary,

Página 282

IA-32 Intel® Architecture Optimization7-6When two applications are employed as part of a multi-tasking workload, there is little synchronization overh

Página 283 - MXCSR register should be

Multi-Core and Hyper-Threading Technology 77-7Parallel Programming ModelsTwo common programming models for transforming independent task requirements

Página 284

IA-32 Intel® Architecture Optimization7-8Functional DecompositionApplications usually process a wide variety of tasks with diverse functions and many

Página 285 - SSE3 and Complex Arithmetics

Multi-Core and Hyper-Threading Technology 77-9overhead when buffers are exchanged between the producer and consumer. To achieve optimal scaling with t

Página 286

IA-32 Intel® Architecture Optimization7-10Producer-Consumer Threading Models Figure 7-3 illustrates the basic scheme of interaction between a pair of

Página 287

Multi-Core and Hyper-Threading Technology 77-11It is possible to structure the producer-consumer model in an interlaced manner such that it can minimi

Página 288

IA-32 Intel® Architecture Optimization7-12corresponding task to use its designated buffer. Thus, the producer and consumer tasks execute in parallel i

Página 289

Multi-Core and Hyper-Threading Technology 77-13Example 7-3 Thread Function for an Interlaced Producer Consumer Model// master thread starts the first

Página 290

IA-32 Intel® Architecture Optimization1-8Intel NetBurst® MicroarchitectureThe Pentium 4 processor, Pentium 4 processor Extreme Edition supporting Hype

Página 291 - Optimizing Cache Usage

IA-32 Intel® Architecture Optimization7-14Tools for Creating Multithreaded ApplicationsProgramming directly to a multithreading application programmin

Página 292

Multi-Core and Hyper-Threading Technology 77-15Automatic Parallelization of Code. While OpenMP directives allow programmers to quickly transform seria

Página 293 - Optimizing Cache Usage 6

IA-32 Intel® Architecture Optimization7-16Optimization GuidelinesThis section summarizes optimization guidelines for tuning multithreaded applications

Página 294 - Hardware Prefetching of Data

Multi-Core and Hyper-Threading Technology 77-17• Place each synchronization variable alone, separated by 128 bytes or in a separate cache line. See “T

Página 295

IA-32 Intel® Architecture Optimization7-18• Adjust the private stack of each thread in an application so the spacing between these stacks is not offse

Página 296 - Prefetch

Multi-Core and Hyper-Threading Technology 77-19• For each processor supporting Hyper-Threading Technology, consider adding functionally uncorrelated t

Página 297

IA-32 Intel® Architecture Optimization7-20The best practice to reduce the overhead of thread synchronization is to start by reducing the application’s

Página 298 - Implementation

Multi-Core and Hyper-Threading Technology 77-21the white paper “Developing Multi-threaded Applications: A Platform Consistent Approach” (referenced in

Página 299 - Cacheability Control

IA-32 Intel® Architecture Optimization7-22Synchronization for Short PeriodsThe frequency and duration that a thread needs to synchronize with other th

Página 300 - Streaming Non-temporal Stores

Multi-Core and Hyper-Threading Technology 77-23the processor must guarantee no violations of memory order occur. The necessity of maintaining the orde

Página 301 - WC semantics)

IA-32 Intel® Architecture Processor Family Overview1-9• to operate at high clock rates and to scale to higher performance and clock rates in the futur

Página 302 - Write-Combining

IA-32 Intel® Architecture Optimization7-24Example 7-4 Spin-wait Loop and PAUSE Instructions (a) An un-optimized spin-wait loop experiences performan

Página 303 - Streaming Store Usage Models

Multi-Core and Hyper-Threading Technology 77-25User/Source Coding Rule 21. (M impact, H generality) Insert the PAUSE instruction in fast spin loops an

Página 304

IA-32 Intel® Architecture Optimization7-26To reduce the performance penalty, one approach is to reduce the likelihood of many threads competing to acq

Página 305 - • hand-crafted code

Multi-Core and Hyper-Threading Technology 77-27If an application thread must remain idle for a long time, the application should use a thread blocking

Página 306 - The mfence Instruction

IA-32 Intel® Architecture Optimization7-28Avoid Coding Pitfalls in Thread SynchronizationSynchronization between multiple threads must be designed and

Página 307 - The clflush Instruction

Multi-Core and Hyper-Threading Technology 77-29In general, OS function calls should be used with care when synchronizing threads. When using OS-suppor

Página 308 - Software-controlled Prefetch

IA-32 Intel® Architecture Optimization7-30Prevent Sharing of Modified Data and False-SharingOn an Intel Core Duo processor, sharing of modified data i

Página 309 - Hardware Prefetch

Multi-Core and Hyper-Threading Technology 77-31User/Source Coding Rule 24. (H impact, M generality) Beware of false sharing within a cache line (64 by

Página 310

IA-32 Intel® Architecture Optimization7-32• Objects allocated dynamically by different threads may share cache lines. Make sure that the variables use

Página 311 - Constant Stride

Multi-Core and Hyper-Threading Technology 77-33• In managed environments that provide automatic object allocation, the object allocators and garbage

Página 312

IA-32 Intel® Architecture Optimization1-10The out-of-order core aggressively reorders µops so that µops whose inputs are ready (and have execution res

Página 313

IA-32 Intel® Architecture Optimization7-34Conserve Bus BandwidthIn a multi-threading environment, bus bandwidth may be shared by memory traffic origin

Página 314

Multi-Core and Hyper-Threading Technology 77-35reads. An approximate working guideline for software to operate below bus saturation is to check if bus

Página 315

IA-32 Intel® Architecture Optimization7-36Avoid Excessive Software PrefetchesPentium 4 and Intel Xeon Processors have an automatic hardware prefetcher

Página 316

Multi-Core and Hyper-Threading Technology 77-37latency of scattered memory reads can be improved by issuing multiple memory reads back-to-back to over

Página 317

IA-32 Intel® Architecture Optimization7-38Frequently, multiple partial writes to WC memory can be combined into full-sized writes using a software wri

Página 318

Multi-Core and Hyper-Threading Technology 77-39block size for loop blocking should be determined by dividing the target cache size by the number of lo

Página 319

IA-32 Intel® Architecture Optimization7-40User/Source Coding Rule 33. (H impact, M generality) Minimize the sharing of data between threads that execu

Página 320

Multi-Core and Hyper-Threading Technology 77-41Example 7-8 shows the batched implementation of the producer and consumer thread functions.Example 7-8

Página 321

IA-32 Intel® Architecture Optimization7-42Eliminate 64-KByte Aliased Data AccessesThe 64 KB aliasing condition is discussed in Chapter 2. Memory acces

Página 322

Multi-Core and Hyper-Threading Technology 77-43Preventing Excessive Evictions in First-Level Data CacheCached data in a first-level data cache are ind

Página 323

IA-32 Intel® Architecture Processor Family Overview1-11The Front EndThe front end of the Intel NetBurst microarchitecture consists of two parts:• fetc

Página 324

IA-32 Intel® Architecture Optimization7-44Per-thread Stack OffsetTo prevent private stack accesses in concurrent threads from thrashing the first-leve

Página 325

Multi-Core and Hyper-Threading Technology 77-45Example 7-9 Adding an Offset to the Stack Pointer of Three ThreadsVoid Func_thread_entry(DWORD *pArg){D

Página 326 - Non-Adjacent Passes Loops

IA-32 Intel® Architecture Optimization7-46Per-instance Stack OffsetEach instance an application runs in its own linear address space; but the address

Página 327

Multi-Core and Hyper-Threading Technology 77-47However, the buffer space does enable the first-level data cache to be shared cooperatively when two co

Página 328

IA-32 Intel® Architecture Optimization7-48Front-end OptimizationIn the Intel NetBurst microarchitecture family of processors, the instructions are dec

Página 329

Multi-Core and Hyper-Threading Technology 77-49On Hyper-Threading-Technology-enabled processors, excessive loop unrolling is likely to reduce the Trac

Página 330

IA-32 Intel® Architecture Optimization7-50initial APIC_ID (See Section 7.10 of IA-32 Intel Architecture Software Developer’s Manual, Volume 3A for mor

Página 331

Multi-Core and Hyper-Threading Technology 77-51Affinity masks can be used to optimize shared multi-threading resources. Example 7-11 Assembling 3-le

Página 332 - 60 invis

IA-32 Intel® Architecture Optimization7-52Arrangements of affinity-binding can benefit performance more than other arrangements. This applies to: • Sc

Página 333 - • write-once (non-temporal)

Multi-Core and Hyper-Threading Technology 77-53first to the primary logical processor of each processor core. This example is also optimized to the si

Página 334 - Cache Management

ivOut-of-Order Core... 1-30In-Order Retirement...

Página 335 - Video Decoder

IA-32 Intel® Architecture Optimization1-12The execution trace cache and the translation engine have cooperating branch prediction hardware. Branch tar

Página 336

IA-32 Intel® Architecture Optimization7-54Example 7-12 Assembling a Look up Table to Manage Affinity Masks and Schedule Threads to Each Core First AFF

Página 337 - • cache size

Multi-Core and Hyper-Threading Technology 77-55Example 7-13 Discovering the Affinity Masks for Sibling Logical Processors Sharing the Same Cache // Lo

Página 338

IA-32 Intel® Architecture Optimization7-56 PackageID[ProcessorNUM] = PACKAGE_ID;CoreID[ProcessorNum] = CORE_ID;SmtID[Processor

Página 339

Multi-Core and Hyper-Threading Technology 77-57For (ProcessorNum = 1; ProcessorNum < NumStartedLPs; ProcessorNum++) {ProcessorMask << = 1;For

Página 340

IA-32 Intel® Architecture Optimization7-58Optimization of Other Shared ResourcesResource optimization in multi-threaded application depends on the cac

Página 341

Multi-Core and Hyper-Threading Technology 77-59seldom reaches 50% of peak retirement bandwidth. Thus, improving single-thread execution throughput sho

Página 342

IA-32 Intel® Architecture Optimization7-60throughput of a physical processor package. The non-halted CPI metric can be interpreted as the inverse of t

Página 343

Multi-Core and Hyper-Threading Technology 77-61Using a function decomposition threading model, a multithreaded application can pair up a thread with c

Página 344 - Bit Location Name Meaning

IA-32 Intel® Architecture Optimization7-62Write-combining buffers are another example of execution resources shared between two logical processors. Wi

Página 345 - • Determine prefetch stride

8-1864-bit Mode Coding GuidelinesIntroductionThis chapter describes coding guidelines for application software written to run in 64-bit mode. These gu

Página 346 - Parameters

IA-32 Intel® Architecture Processor Family Overview1-13correct execution, the results of IA-32 instructions must be committed in original program orde

Página 347 - Hyper-Threading Technology

IA-32 Intel® Architecture Optimization8-2This optimization holds true for the lower 8 general purpose registers: EAX, ECX, EBX, EDX, ESP, EBP, ESI, ED

Página 348 - Performance and Usage Models

64-bit Mode Coding Guidelines 88-3If the compiler can determine at compile time that the result of a multiply will not exceed 64 bits, then the compil

Página 349 - Multi-Thread on MP

IA-32 Intel® Architecture Optimization8-4Can be replaced with: movsx r8, r9w ;If bits 63:8 do not need to be;preserved. movsx r8, r10b ;If bits 63:

Página 350 - Multitasking Environment

64-bit Mode Coding Guidelines 88-5IMUL RAX, RCXThe 64-bit version above is more efficient than using the following 32-bit version:MOV EAX, DWORD PTR[

Página 351

IA-32 Intel® Architecture Optimization8-6Use 32-Bit Versions of CVTSI2SS and CVTSI2SD When PossibleThe CVTSI2SS and CVTSI2SD instructions convert a si

Página 352 - • hardware utilization

9-19Power Optimization for Mobile UsagesOverviewMobile computing allows computers to operate anywhere, anytime. Battery life is a key factor in delive

Página 353 - • functional decomposition

IA-32 Intel® Architecture Optimization9-2Pentium M, Intel Core Solo and Intel Core Duo processors implement features designed to enable the reduction

Página 354 - Functional Decomposition

Power Optimization for Mobile Usages 99-3to accommodate demand and adapt power consumption. The interaction between the OS power management policy and

Página 355 - P(1)P(1) C(1)C(1)P(1)

IA-32 Intel® Architecture Optimization9-4ACPI C-StatesWhen computational demands are less than 100%, part of the time the processor is doing useful wo

Página 356 - C: consumer

Power Optimization for Mobile Usages 99-5The index of a C-state type designates the depth of sleep. Higher numbers indicate a deeper sleep state and l

Página 357

IA-32 Intel® Architecture Optimization1-14• a mechanism fetches data only and includes two distinct components: (1) a hardware mechanism to fetch the

Página 358 - Thread 1

IA-32 Intel® Architecture Optimization9-6Figure 9-3 Application of C-states to Idle TimeConsider that a processor is in lowest frequency (LFM- low fre

Página 359

Power Optimization for Mobile Usages 99-7• In an Intel Core Solo or Duo processor, after staying in C4 for an extended time, the processor may enter i

Página 360

IA-32 Intel® Architecture Optimization9-8Adjust Performance to Meet Quality of FeaturesWhen a system is battery powered, applications can extend batte

Página 361

Power Optimization for Mobile Usages 99-9• GetActivePwrScheme: Retrieves the active power scheme (current system power scheme) index. An application c

Página 362 - Optimization Guidelines

IA-32 Intel® Architecture Optimization9-10workload (usually that equates to reducing the number of instructions that the processor needs to execute, o

Página 363

Power Optimization for Mobile Usages 99-11disk operations over time. Use the GetDevicePowerState() Windows API to test disk state and delay the disk

Página 364

IA-32 Intel® Architecture Optimization9-12Using Enhanced Intel SpeedStep® TechnologyUse Enhanced Intel SpeedStep Technology to adjust the processor to

Página 365 - Thread Synchronization

Power Optimization for Mobile Usages 99-13The same application can be written in such a way that work units are divided into smaller granularity, but

Página 366

IA-32 Intel® Architecture Optimization9-14An additional positive effect of continuously operating at a lower frequency is that frequent changes in pow

Página 367

Power Optimization for Mobile Usages 99-15Eventually, if the interval is large enough, the processor will be able to enter deeper sleep and save a con

Página 368

IA-32 Intel® Architecture Processor Family Overview1-15Branch PredictionBranch prediction is important to the performance of a deeply pipelined proces

Página 369

IA-32 Intel® Architecture Optimization9-16thread enables the physical processor to operate at lower frequency relative to a single-threaded version. T

Página 370

Power Optimization for Mobile Usages 99-17demands only 50% of processor resources (based on idle history). The processor frequency may be reduced by s

Página 371 - Optimization with Spin-Locks

IA-32 Intel® Architecture Optimization9-18processor to enter the lowest possible C-state type (lower-numbered C state has less power saving). For exam

Página 372 - PAUSE instruction in the

Power Optimization for Mobile Usages 99-19imbalance can be accomplished using performance monitoring events. Intel Core Duo processor provides an even

Página 373 - Example 7-5

IA-32 Intel® Architecture Optimization9-20

Página 374

A-1AApplication Performance ToolsIntel offers an array of application performance tools that are optimized to take advantage of the Intel architecture

Página 375

IA-32 Intel® Architecture OptimizationA-2• Intel Performance LibrariesThe Intel Performance Library family consists of a set of software libraries opt

Página 376

Application Performance Tools AA-3family. Vectorization, processor dispatch, inter-procedural optimization, profile-guided optimization and OpenMP par

Página 377

IA-32 Intel® Architecture OptimizationA-4default, and targets the Intel Pentium 4 processor and subsequent processors. Code produced will run on any I

Página 378

Application Performance Tools AA-5Vectorizer Switch OptionsThe Intel C++ and Fortran Compiler can vectorize your code using the vectorizer switch opti

Página 379 - System Bus Optimization

IA-32 Intel® Architecture Optimization1-16To take advantage of the forward-not-taken and backward-taken static predictions, code should be arranged so

Página 380 - Conserve Bus Bandwidth

IA-32 Intel® Architecture OptimizationA-6Multithreading with OpenMP*Both the Intel C++ and Fortran Compilers support shared memory parallelism via Ope

Página 381

Application Performance Tools AA-7The -Qrcd option disables the change to truncation of the rounding mode in floating-point-to-integer conversions. Fo

Página 382

IA-32 Intel® Architecture OptimizationA-8When you use PGO, consider the following guidelines:• Minimize the changes to your program after instrumented

Página 383

Application Performance Tools AA-9SamplingSampling allows you to profile all active software on your system, including operating system, device driver

Página 384 - Memory Optimization

IA-32 Intel® Architecture OptimizationA-10Figure A-1 provides an example of a hotspots report by location. Event-based SamplingEvent-based sampling (

Página 385 - Shared-Memory Optimization

Application Performance Tools AA-11different events at a time. The number of the events that the VTune analyzer can collect at once on the Pentium 4 a

Página 386

IA-32 Intel® Architecture OptimizationA-12duration of read traffic compared to the duration of the workload is significantly less than unity, it indic

Página 387

Application Performance Tools AA-13stride inefficiency is most prominent on memory traffic. A useful indicator for large-stride inefficiency in a work

Página 388 - 4 KB in each thread

IA-32 Intel® Architecture OptimizationA-14The Call Graph View depicts the caller / callee relationships. Each thread in the application is the root of

Página 389

Application Performance Tools AA-15(SSE), Streaming SIMD Extensions 2 (SSE2) and Streaming SIMD Extensions 3 (SSE3). The library set includes the Inte

Página 390 - Per-thread Stack Offset

IA-32 Intel® Architecture Processor Family Overview1-17Some parts of the core may speculate that a common condition holds to allow faster execution. I

Página 391

IA-32 Intel® Architecture OptimizationA-16• Performance: Highly-optimized routines with a C interface that give Assembly-level performance in a C/C++

Página 392 - Per-instance Stack Offset

Application Performance Tools AA-17developed with the Intel Performance Libraries benefit from new architectural features of future generations of Int

Página 393

IA-32 Intel® Architecture OptimizationA-18The Intel Thread Checker product is an Intel VTune Performance Analyzer plug-in data collector that executes

Página 394 - Front-end Optimization

Application Performance Tools AA-19Figure A-2 shows Intel Thread Checker displaying the source code of the selected instance from a list of detected d

Página 395 - Resources

IA-32 Intel® Architecture OptimizationA-20Intel® Software CollegeThe Intel® Software College is a valuable resource for classes on Streaming SIMD Exte

Página 396

B-1BUsing Performance Monitoring EventsPerformance monitoring events provides facilities to characterize the interaction between programmed sequences

Página 397 - Processor

IA-32 Intel® Architecture OptimizationB-2The performance metrics listed n Tables B-1 through Table B-5 may be applicable to processors that support Hy

Página 398 - Processor (Contd.)

Using Performance Monitoring Events BB-3ReplayIn order to maximize performance for the common case, the Intel NetBurst microarchitecture sometimes ag

Página 399

IA-32 Intel® Architecture OptimizationB-4miss more than once during its life time, but a Misses Retired metric (for example, 1st-Level Cache Misses Re

Página 400

Using Performance Monitoring Events BB-5The first two metrics use performance counters, and thus can be used to cause interrupt upon overflow for samp

Página 401 - Sharing the Same Cache

IA-32 Intel® Architecture Optimization1-18execution units are not pipelined (meaning that µops cannot be dispatched in consecutive cycles and the thro

Página 402

IA-32 Intel® Architecture OptimizationB-6Non-Sleep Clockticks The performance monitoring counters can also be configured to count clocks whenever the

Página 403

Using Performance Monitoring Events BB-7that logical processor is not halted (it may include some portion of the clock cycles for that logical process

Página 404

IA-32 Intel® Architecture OptimizationB-8Microarchitecture NotesTrace Cache EventsThe trace cache is not directly comparable to an instruction cache.

Página 405

Using Performance Monitoring Events BB-9There is a simplified block diagram below of the sub-systems connected to the IOQ unit in the front side bus s

Página 406

IA-32 Intel® Architecture OptimizationB-10 Figure B-1 Relationships Between the Cache Hierarchy, IOQ, BSQ and Front Side BusChip SetSystem Memory1st

Página 407

Using Performance Monitoring Events BB-11Core references are nominally 64 bytes, the size of a 1st-level cache line. Smaller sizes are called partial

Página 408

IA-32 Intel® Architecture OptimizationB-12• IOQ_allocation, IOQ_active_entries: 64 bytes for hits or misses, smaller for partials' hits or misses

Página 409 - Guidelines

Using Performance Monitoring Events BB-13transactions of the writeback (WB) memory type for the FSB IOQ and the BSQ can be an indication of how often

Página 410 - Only When Necessary

IA-32 Intel® Architecture OptimizationB-14Current implementations of the BSQ_cache_reference event do not distinguish between programmatic read and wr

Página 411 - Assembly/Compiler Coding rule

Using Performance Monitoring Events BB-15Usage Notes on Bus ActivitiesA number of performance metrics in Table B-1 are based on IOQ_active_entries and

Página 412 - 64-Bit Arithmetic

IA-32 Intel® Architecture Processor Family Overview1-19CachesThe Intel NetBurst microarchitecture supports up to three levels of on-chip cache. At lea

Página 413 - Assembly/Compiler Coding Rule

IA-32 Intel® Architecture OptimizationB-16accesses (i.e., are also 3rd-level misses). This can decrease the average measured BSQ latencies for worklo

Página 414 - Using Software Prefetch

Using Performance Monitoring Events BB-17an expression built up from other metrics; for example, IPC is derived from two single-event metrics.• Column

Página 415 - Mobile Usages

IA-32 Intel® Architecture OptimizationB-18Table B-1 Pentium 4 Processor Performance MetricsMetric DescriptionEvent Name or Metric ExpressionEvent Mask

Página 416 - Mobile Usage Scenarios

Using Performance Monitoring Events BB-19Speculative Uops Retired Number of uops retired (include both instructions executed to completion and specula

Página 417

IA-32 Intel® Architecture OptimizationB-20Mispredicted returns The number of mispredicted returns including all causes. retired_mispred_branch_typeRET

Página 418 - ACPI C-States

Using Performance Monitoring Events BB-21TC Flushes Number of TC flushes (The counter will count twice for each occurrence. Divide the count by 2 to g

Página 419

IA-32 Intel® Architecture OptimizationB-22Logical Processor 1 Deliver ModeThe number of cycles that the trace and delivery engine (TDE) is delivering

Página 420

Using Performance Monitoring Events BB-23Logical Processor 0 Build ModeThe number of cycles that the trace and delivery engine (TDE) is building trace

Página 421

IA-32 Intel® Architecture OptimizationB-24Trace Cache MissesThe number of times that significant delays occurred in order to decode instructions and b

Página 422

Using Performance Monitoring Events BB-25Memory MetricsPage Walk DTLB All MissesThe number of page walk requests due to DTLB misses from either load o

Página 423 - Reducing Amount of Work

IA-32 Intel® Architecture Optimization1-20Levels in the cache hierarchy are not inclusive. The fact that a line is in level i does not imply that it i

Página 424 - • Switch off unused devices

IA-32 Intel® Architecture OptimizationB-2664K Aliasing Conflicts1The number of 64K aliasing conflicts. A memory reference causing 64K aliasing conflic

Página 425

Using Performance Monitoring Events BB-27MOB Load ReplaysThe number of replayed loads related to the Memory Order Buffer (MOB). This metric counts onl

Página 426 - Technology

IA-32 Intel® Architecture OptimizationB-282nd-Level Cache Reads Hit Shared The number of 2nd-level cache read references (loads and RFOs) that hit the

Página 427

Using Performance Monitoring Events BB-293rd-Level Cache Reads Hit Modified The number of 3rd-level cache read references (loads and RFOs) that hit th

Página 428 - Enhanced Deeper Sleep

IA-32 Intel® Architecture OptimizationB-30All WCB Evictions The number of times a WC buffer eviction occurred due to any causes (This can be used to d

Página 429 - Multi-Core Considerations

Using Performance Monitoring Events BB-31Bus MetricsBus Accesses from the Processor The number of all bus transactions that were allocated in the IO Q

Página 430

IA-32 Intel® Architecture OptimizationB-32Prefetch Ratio Fraction of all bus transactions (including retires) that were for HW or SW prefetching.(Bus

Página 431

Using Performance Monitoring Events BB-33Writes from the Processor The number of all write transactions on the bus that were allocated in IO Queue fro

Página 432

IA-32 Intel® Architecture OptimizationB-34All WC from the Processor The number of Write Combining memory transactions on the bus that originated from

Página 433 - (C1-C4)

Using Performance Monitoring Events BB-35Bus Accesses from All Agents The number of all bus transactions that were allocated in the IO Queue by all ag

Página 434

IA-32 Intel® Architecture Processor Family Overview1-21back within the processor, and 6-12 bus cycles to access memory if there is no bus congestion.

Página 435 - Application Performance

IA-32 Intel® Architecture OptimizationB-36Bus Reads Underway from the processor7 This is an accrued sum of the durations of all read (includes RFOs) t

Página 436 - Compilers

Using Performance Monitoring Events BB-37All UC Underway from the processor7 This is an accrued sum of the durations of all UC transactions by this pr

Página 437 - Code Optimization Options

IA-32 Intel® Architecture OptimizationB-38Bus Writes Underway from the processor7 This is an accrued sum of the durations of all write transactions by

Página 438

Using Performance Monitoring Events BB-39Write WC Full (BSQ)The number of write (but neither writeback nor RFO) transactions to WC-type memory. BSQ_al

Página 439 - Vectorizer Switch Options

IA-32 Intel® Architecture OptimizationB-40Reads Non-prefetch Full (BSQ) The number of read (excludes RFOs and HW|SW prefetches) transactions to WB-typ

Página 440 - Multithreading with OpenMP*

Using Performance Monitoring Events BB-41UC Write Partial (BSQ) The number of UC write transactions. Beware of granularity issues between BSQ and FSB

Página 441

IA-32 Intel® Architecture OptimizationB-42WB Writes Full Underway (BSQ)8 This is an accrued sum of the durations of writeback (evicted from cache) tra

Página 442 - VTune™ Performance Analyzer

Using Performance Monitoring Events BB-43Write WC Partial Underway (BSQ)8This is an accrued sum of the durations of partial write transactions to WC-t

Página 443 - Sampling

IA-32 Intel® Architecture OptimizationB-44SSE Input AssistsThe number of occurrences of SSE/SSE2 floating-point operations needing assistance to handl

Página 444 - Event-based Sampling

Using Performance Monitoring Events BB-451. A memory reference causing 64K aliasing conflict can be counted more than once in this stat. The resulting

Página 445 - Workload Characterization

vBranch Prediction... 2-15Eliminating B

Página 446

IA-32 Intel® Architecture Optimization1-22• avoids the need to access off-chip caches, which can increase the realized bandwidth compared to a normal

Página 447 - Call Graph

IA-32 Intel® Architecture OptimizationB-464. Most commonly used x87 instructions (e.g., fmul, fadd, fdiv, fsqrt, fstp, etc.) decode into a singleμop.

Página 448 - Performance Libraries

Using Performance Monitoring Events BB-47Table B-2 Metrics That Utilize Replay Tagging MechanismReplay Metric Tags1Bit field to set:IA32_PEBS_ENABLE B

Página 449 - Benefits Summary

IA-32 Intel® Architecture OptimizationB-48Tags for front_end_eventTable B-3 provides a list of the tags that are used by various metrics derived from

Página 450 - Optimizations with the Intel

Using Performance Monitoring Events BB-49Table B-4 Metrics That Utilize the Execution Tagging MechanismExecution Metric Tags Upstream ESCRTag Value in

Página 451 - Threading Tools

IA-32 Intel® Architecture OptimizationB-50Table B-5 New Metrics for Pentium 4 Processor (Family 15, Model 3)Using Performance Metrics with Hyper-Threa

Página 452

Using Performance Monitoring Events BB-51The performance metrics listed in Table B-1 fall into three categories:• Logical processor specific and suppo

Página 453 - Thread Profiler

IA-32 Intel® Architecture OptimizationB-52Branching Metrics Branches RetiredTagged Mispredicted Branches RetiredMispredicted Branches RetiredAll retur

Página 454 - Software College

Using Performance Monitoring Events BB-53Memory Metrics Split Load Replays1Split Store Replays1MOB Load Replays164k Aliasing Conflicts1st-Level Cache

Página 455 - Using Performance Monitoring

IA-32 Intel® Architecture OptimizationB-54Bus Metrics Bus Accesses from the Processor1Non-prefetch Bus Accesses from the Processor1Reads from the Proc

Página 456 - Bus Ratio

Using Performance Monitoring Events BB-55Characterization Metrics x87 Input Assistsx87 Output AssistsMachine Clear CountMemory Order Machine ClearSelf

Página 457

IA-32 Intel® Architecture Processor Family Overview1-23Hardware prefetching for Pentium 4 processor has the following characteristics:• works with exi

Página 458 - Counting Clocks

IA-32 Intel® Architecture OptimizationB-56Using Performance Events of Intel Core Solo and Intel Core Duo processorsThere are performance events specif

Página 459 - Non-Halted Clockticks

Using Performance Monitoring Events BB-57There are three cycle-counting events which will not progress on a halted core, even if the halted core is be

Página 460 - Non-Sleep Clockticks

IA-32 Intel® Architecture OptimizationB-58• Some events, such as writebacks, may have non-deterministic behavior for different runs. In such a case, o

Página 461 - Time Stamp Counter

Using Performance Monitoring Events BB-59• Serial_Execution_Cycles, event number 3C, unit mask 02HThis event counts the bus cycles during which the co

Página 462 - Microarchitecture Notes

IA-32 Intel® Architecture OptimizationB-60

Página 463

C-1CIA-32 Instruction Latency and ThroughputThis appendix contains tables of the latency, throughput and execution units that are associated with more

Página 464 - Side Bus

IA-32 Intel® Architecture OptimizationC-2OverviewThe current generation of IA-32 family of processors use out-of-order execution with dynamic scheduli

Página 465 - Reads due to program loads

IA-32 Instruction Latency and Throughput CC-3While several items on the above list involve selecting the right instruction, this appendix focuses on t

Página 466 - Writebacks (dirty evictions)

IA-32 Intel® Architecture OptimizationC-4DefinitionsThe IA-32 instruction performance data are listed in several tables. The tables contain the follow

Página 467

IA-32 Instruction Latency and Throughput CC-5accurately predict realistic performance of actual code sequences based on adding instruction latency dat

Página 468

IA-32 Intel® Architecture Optimization1-24Thus, software optimization of a data access pattern should emphasize tuning for hardware prefetch first to

Página 469 - Usage Notes on Bus Activities

IA-32 Intel® Architecture OptimizationC-6 Latency and Throughput with Register OperandsIA-32 instruction latency and throughput data are presented in

Página 470

IA-32 Instruction Latency and Throughput CC-7Table C-2 Streaming SIMD Extension 2 128-bit Integer InstructionsInstruction Latency1ThroughputExecution

Página 471

IA-32 Intel® Architecture OptimizationC-8PCMPGTB/PCMPGTD/PCMPGTW xmm, xmm2 2 1 2 2 1 MMX_ALUPEXTRW r32, xmm, imm8 7 7 3 2 2 2 MMX_SHFT,FP_MISCPINSRW x

Página 472

IA-32 Instruction Latency and Throughput CC-9 PSUBB/PSUBW/PSUBD xmm, xmm2 2 1 2 2 1 MMX_ALU PSUBSB/PSUBSW/PSUBUSB/PSUBUSW xmm, xmm2 2 1 2 2 1 MMX_A

Página 473

IA-32 Intel® Architecture OptimizationC-10COMISD xmm, xmm 7 6 1 2 2 1 FP_ADD, FP_MISCCVTDQ2PD xmm, xmm 8 8 4+1 3 3 4 FP_ADD, MMX_SHFTCVTPD2PI mm, xmm

Página 474

IA-32 Instruction Latency and Throughput CC-11DIVPD xmm, xmm 70 69 32+31 70 69 62 FP_DIVDIVSD xmm, xmm 39 38 32 39 38 31 FP_DIVMAXPD xmm, xmm 5 4 4 2

Página 475

IA-32 Intel® Architecture OptimizationC-12Table C-4 Streaming SIMD Extension Single-precision Floating-point Instructions Instruction Latency1Throughp

Página 476

IA-32 Instruction Latency and Throughput CC-13MOVLHPS3 xmm, xmm 44 22 MMX_SHFTMOVMSKPS r32, xmm 6 6 2 2 FP_MISCMOVSS xmm, xmm 4 4 2 2 MMX_SHFTMOVUPS x

Página 477

IA-32 Intel® Architecture OptimizationC-14 Table C-5 Streaming SIMD Extension 64-bit Integer Instructions Instruction Latency1Throughput Execution Uni

Página 478

IA-32 Instruction Latency and Throughput CC-15 PCMPGTB/PCMPGTD/PCMPGTW mm, mm22 11 MMX_ALUPMADDWD3 mm, mm 98 11 FP_MULPMULHW/PMULLW3 mm, mm98 11 FP_M

Página 479

IA-32 Intel® Architecture Processor Family Overview1-25Reordering loads with respect to each other can prevent a load miss from stalling later loads.

Página 480

IA-32 Intel® Architecture OptimizationC-16Table C-7 IA-32 x87 Floating-point Instructions Instruction Latency1ThroughputExecution Unit2CPUID 0F3n 0F2n

Página 481

IA-32 Instruction Latency and Throughput CC-17 FSCALE460 7FRNDINT430 11FXCH501FP_MOVEFLDZ60FINCSTP/FDECSTP60See “Table Footnotes”Table C-8 IA-32 Gener

Página 482

IA-32 Intel® Architecture OptimizationC-18Jcc7Not Appli-cable0.5 ALULOOP 8 1.5 ALUMOV 1 0.5 0.5 0.5 ALUMOVSB/MOVSW 1 0.5 0.5 0.5 ALUMOVZB/MOVZW 1 0.5

Página 483

IA-32 Instruction Latency and Throughput CC-19Table FootnotesThe following footnotes refer to all tables in this appendix.1. Latency information for m

Página 484

IA-32 Intel® Architecture OptimizationC-204. Latency and Throughput of transcendental instructions can vary substantially in a dynamic execution envir

Página 485

IA-32 Instruction Latency and Throughput CC-21For the sake of simplicity, all data being requested is assumed to reside in the first level data cache

Página 486

IA-32 Intel® Architecture OptimizationC-22

Página 487

D-1DStack AlignmentThis appendix details on the alignment of the stacks of data for Streaming SIMD Extensions and Streaming SIMD Extensions 2.Stack Fr

Página 488

IA-32 Intel® Architecture OptimizationD-2alignment for __m64 and double type data by enforcing that these 64-bit data items are at least eight-byte al

Página 489

Stack Alignment DD-3As an optimization, an alternate entry point can be created that can be called when proper stack alignment is provided by the call

Página 490

IA-32 Intel® Architecture Optimization1-26Intel® Pentium® M Processor MicroarchitectureLike the Intel NetBurst microarchitecture, the pipeline of the

Página 491

Stack Alignment DD-4Example D-1 in the following sections illustrate this technique. Note the entry points foo and foo.aligned, the latter is the alte

Página 492

Stack Alignment DD-5Example D-1 Aligned esp-Based Stack Framesvoid _cdecl foo (int k){ int j; foo: // See Note A push

Página 493

Stack Alignment DD-6Aligned ebp-Based Stack FramesIn ebp-based frames, padding is also inserted immediately before the return address. However, this f

Página 494

Stack Alignment DD-7Example D-2 Aligned ebp-based Stack Framesvoid _stdcall foo (int k){ int j; foo: push ebxmov ebx, espsub esp, 0x00000008and esp, 0

Página 495

Stack Alignment DD-8// the goal is to make esp and ebp// (0 mod 16) herej = k;mov edx, [ebx + 8] // k is (0 mod 16) if caller aligned// its stackmov [

Página 496

Stack Alignment DD-9Stack Frame OptimizationsThe Intel C++ Compiler provides certain optimizations that may improve the way aligned frames are set up

Página 497

IA-32 Intel® Architecture OptimizationD-10Inlined Assembly and ebxWhen using aligned frames, the ebx register generally should not be modified in inli

Página 498

E-1EMathematics of Prefetch Scheduling DistanceThis appendix discusses how far away to insert prefetch instructions. It presents a mathematical model

Página 499

IA-32 Intel® Architecture OptimizationE-2Ninstis the number of instructions in the scope of one loop iteration.Consider the following example of a heu

Página 500 - Tags for replay_event

Mathematics of Prefetch Scheduling Distance EE-3Tbdata transfer latency which is equal to number of lines per iteration * line burst latencyNote that

Página 501

IA-32 Intel® Architecture Processor Family Overview1-27The Intel Pentium M processor microarchitecture is designed for lower power consumption. There

Página 502 - Tags for execution_event

IA-32 Intel® Architecture OptimizationE-4Memory access plays a pivotal role in prefetch scheduling. For more understanding of a memory subsystem, cons

Página 503

Mathematics of Prefetch Scheduling Distance EE-5Tl varies dynamically and is also system hardware-dependent. The static variants include the core-to-f

Página 504 - Technology

IA-32 Intel® Architecture OptimizationE-6No Preloading or PrefetchThe traditional programming approach does not perform data preloading or prefetch. I

Página 505 - Parallel Counting

Mathematics of Prefetch Scheduling Distance EE-7The iteration latency is approximately equal to the computation latency plus the memory leadoff latenc

Página 506 - Parallel Counting (continued)

IA-32 Intel® Architecture OptimizationE-8The following formula shows the relationship among the parameters:It can be seen from this relationship that

Página 507

Mathematics of Prefetch Scheduling Distance EE-9For this particular example the prefetch scheduling distance is greater than 1. Data being prefetched

Página 508

IA-32 Intel® Architecture OptimizationE-10Memory Throughput Bound (Case: Tb >= Tc)When the application or loop is memory throughput bound, the memo

Página 509

Mathematics of Prefetch Scheduling Distance EE-11memory to you cannot do much about it. Typically, data copy from one space to another space, for exam

Página 510 - Intel Core Duo processors

IA-32 Intel® Architecture OptimizationE-12Now for the case Tl =18, Tb =8 (2 cache lines are needed per iteration) examine the following graph. Conside

Página 511 - Ratio Interpretation

Mathematics of Prefetch Scheduling Distance EE-13In reality, the front-side bus (FSB) pipelining depth is limited, that is, only four transactions are

Página 512 - Notes on Selected Events

IA-32 Intel® Architecture Optimization1-28The fetch and decode unit includes a hardware instruction prefetcher and three decoders that enable parallel

Página 513

IA-32 Intel® Architecture OptimizationE-14

Página 514

Index-1Index64-bit modedefault operand size, 8-1introduction, 8-1legacy instructions, 8-1multiplication notes, 8-2register usage, 8-2, 8-4sign-extensi

Página 515 - Throughput

IA-32 Intel® Architecture OptimizationIndex-2coding methodologies, 3-13coding techniques, 3-12absolute difference of signed numbers, 4-24absolute diff

Página 516 - Overview

IndexIndex-3floating-point stalls, 2-72flow dependency, E-7flush to zero, 5-22FXCH instruction, 2-70Ggeneral optimization techniques, 2-1branch predic

Página 517 - PADDQ and PMULUDQ, each have

IA-32 Intel® Architecture OptimizationIndex-4Llarge load stalls, 2-37latency, 2-72, 6-5lea instruction, 2-74loading and storing to and from the same D

Página 518 - Latency and Throughput

IndexIndex-5Ooptimizing cache utilizationcache management, 6-44examples, 6-15non-temporal store instructions, 6-10prefetch and load, 6-9prefetch Instr

Página 519

IA-32 Intel® Architecture OptimizationIndex-6Rreciprocal instructions, 5-2rounding control option, A-6Ssamplingevent-based, A-10Self-modifying code, 2

Página 520 - See “Table Footnotes”

INTEL SALES OFFICESASIA PACIFICAustraliaIntel Corp.Level 2448 St Kilda Road Melbourne VIC3004AustraliaFax:613-9862 5599ChinaIntel Corp.Rm 709, Shaanxi

Página 521

Intel Corp.999 CANADA PLACE, Suite 404,#11Vancouver BCV6C 3E2CanadaFax:604-844-2813Intel Corp.2650 Queensview Drive, Suite 250Ottawa ONK2B 8H6CanadaFa

Página 522

IA-32 Intel® Architecture Processor Family Overview1-29• Micro-ops (µops) fusion. Some of the most frequent pairs of µops derived from the same instru

Página 523

IA-32 Intel® Architecture Optimization1-30Data is fetched 64 bytes at a time; the instruction and data translation lookaside buffers support 128 entri

Página 524 - Instructions (continued)

IA-32 Intel® Architecture Processor Family Overview1-31In-Order RetirementThe retirement unit in the Pentium M processor buffers completed µops is the

Página 525

viFloating-Point Stalls... 2-72x87 Floating-point

Página 526

IA-32 Intel® Architecture Optimization1-32• Power-optimized busThe system bus is optimized for power efficiency; increased bus speed supports 667 MHz.

Página 527

IA-32 Intel® Architecture Processor Family Overview1-33Data PrefetchingIntel Core Solo and Intel Core Duo processors provide hardware mechanisms to pr

Página 528

IA-32 Intel® Architecture Optimization1-34The two logical processors each have a complete set of architectural registers while sharing one single phys

Página 529

IA-32 Intel® Architecture Processor Family Overview1-35In the first implementation of HT Technology, the physical execution resources are shared and t

Página 530

IA-32 Intel® Architecture Optimization1-36Processor Resources and Hyper-Threading TechnologyThe majority of microarchitecture resources in a physical

Página 531

IA-32 Intel® Architecture Processor Family Overview1-37For example: a cache miss, a branch misprediction, or instruction dependencies may prevent a lo

Página 532

IA-32 Intel® Architecture Optimization1-38Microarchitecture Pipeline and Hyper-Threading TechnologyThis section describes the HT Technology microarchi

Página 533 - Table Footnotes

IA-32 Intel® Architecture Processor Family Overview1-39Execution CoreThe core can dispatch up to six µops per cycle, provided the µops are ready to ex

Página 534

IA-32 Intel® Architecture Optimization1-40Pentium Processor Extreme Edition provide four logical processors in a physical package that has two executi

Página 535

IA-32 Intel® Architecture Processor Family Overview1-41Figure 1-7 Pentium D Processor, Pentium Processor Extreme Edition and Intel Core Duo ProcessorS

Página 536

viiConsiderations for Code Conversion to SIMD Programming... 3-8Identifying Hot Spots ...

Página 537

IA-32 Intel® Architecture Optimization1-42Microarchitecture Pipeline and Multi-Core ProcessorsIn general, each core in a multi-core processor resemble

Página 538

IA-32 Intel® Architecture Processor Family Overview1-43that the cache line that contains the memory location is owned by the first-level data cache of

Página 539 - Stack Alignment D

IA-32 Intel® Architecture Optimization1-44when data is written back to memory, the eviction consumes cache bandwidth and bus bandwidth. For multiple c

Página 540

2-12General Optimization GuidelinesThis chapter discusses general optimization techniques that can improve the performance of applications running on

Página 541

IA-32 Intel® Architecture Optimization2-2The following sections describe practices, tools, coding rules and recommendations associated with these fact

Página 542 - & 0x0f) == 0x08

General Optimization Guidelines 22-3* Streaming SIMD Extensions (SSE)** Streaming SIMD Extensions 2 (SSE2)General Practices and Coding GuidelinesThi

Página 543

IA-32 Intel® Architecture Optimization2-4Use Available Performance Tools• Current-generation compiler, such as the Intel C++ Compiler:— Set this compi

Página 544

General Optimization Guidelines 22-5Optimize Branch Predictability• Improve branch predictability and optimize instruction prefetching by arranging co

Página 545 - Stack Frame Optimizations

IA-32 Intel® Architecture Optimization2-6• Minimize use of global variables and pointers.• Use the const modifier; use the static modifier for global

Página 546 - Inlined Assembly and ebx

General Optimization Guidelines 22-7• Avoid longer latency instructions: integer multiplies and divides. Replace them with alternate code sequences (e

Página 547 - Scheduling Distance

viiiPacked Shuffle Word for 64-bit Registers ... 4-18Packed Shuffle Word for 128-bi

Página 548 - Mathematical Model for PSD

IA-32 Intel® Architecture Optimization2-8• Avoid the use of conditionals.• Keep induction (loop) variable expressions simple.• Avoid using pointers, t

Página 549

General Optimization Guidelines 22-9Performance ToolsIntel offers several tools that can facilitate optimizing your application’s performance.Intel® C

Página 550 - L2 lookup miss latency

IA-32 Intel® Architecture Optimization2-10General Compiler RecommendationsA compiler that has been extensively tuned for the target microarchitec-ture

Página 551 - • Optimize T

General Optimization Guidelines 22-11The VTune Performance Analyzer also enables engineers to use these counters to measure a number of workload chara

Página 552 - No Preloading or Prefetch

IA-32 Intel® Architecture Optimization2-12Intel Core Solo and Intel Core Duo processors have enhanced front end that is less sensitive to the 4-1-1 te

Página 553 - Execution cycles

General Optimization Guidelines 22-13• On the Pentium 4 and Intel Xeon processors, the primary code size limit of interest is imposed by the trace cac

Página 554 - Compute Bound (Case: T

IA-32 Intel® Architecture Optimization2-14Transparent Cache-Parameter StrategyIf CPUID instruction supports function leaf 4, also known as determinist

Página 555

General Optimization Guidelines 22-15Branch PredictionBranch optimizations have a significant impact on performance. By understanding the flow of bran

Página 556 - >= T

IA-32 Intel® Architecture Optimization2-16Assembly/Compiler Coding Rule 1. (MH impact, H generality) Arrange code to make basic blocks contiguous and

Página 557

General Optimization Guidelines 22-17See Example 2-2. The optimized code sets ebx to zero, then compares A and B. If A is greater than or equal to B,

Página 558

ixData Alignment... 5-4Data Arran

Página 559

IA-32 Intel® Architecture Optimization2-18The cmov and fcmov instructions are available on the Pentium II and subsequent processors, but not on Pentiu

Página 560

General Optimization Guidelines 22-19Static PredictionBranches that do not have a history in the BTB (see the “Branch Prediction” section) are predict

Página 561

IA-32 Intel® Architecture Optimization2-20Assembly/Compiler Coding Rule 3. (M impact, H generality) Arrange code to be consistent with the static bra

Página 562

General Optimization Guidelines 22-21Examples 2-6, Example 2-7 provide basic rules for a static prediction algorithm.In Example 2-6, the backward bran

Página 563

IA-32 Intel® Architecture Optimization2-22Inlining, Calls and ReturnsThe return address stack mechanism augments the static and dynamic predictors to

Página 564

General Optimization Guidelines 22-23Assembly/Compiler Coding Rule 6. (H impact, M generality) Do not inline a function if doing so increases the work

Página 565

IA-32 Intel® Architecture Optimization2-24Placing data immediately following an indirect branch can cause a performance problem. If the data consist o

Página 566

General Optimization Guidelines 22-25indirect branch into a tree where one or more indirect branches are preceded by conditional branches to those tar

Página 567 - INTEL SALES OFFICES

IA-32 Intel® Architecture Optimization2-26best performance from a coding effort. An example of peeling out the most favored target of an indirect bran

Página 568

General Optimization Guidelines 22-27• The Pentium 4 processor can correctly predict the exit branch for an inner loop that has 16 or fewer iterations

Intel ARCHITECTURE IA-32 Manual do Utilizador

Consulte online ou descarregue Manual do Utilizador para Acessórios para Computador Intel ARCHITECTURE IA-32. Intel ARCHITECTURE IA-32 User Manual Manual do Utilizador

Resumo do Conteúdo

Comentários a estes Manuais

Intel ARCHITECTURE IA-32 Manual do Utilizador

Consulte online ou descarregue Manual do Utilizador para Acessórios para Computador Intel ARCHITECTURE IA-32. Intel ARCHITECTURE IA-32 User Manual Manual do Utilizador

Resumo do Conteúdo

Comentários a estes Manuais

Manuais e produtos relacionados com Acessórios para Computador Intel ARCHITECTURE IA-32