Case Study: Using an Out of Order Execution for Code

int func1(int x) {
  return x * x * x * x; 
}

mov %edi,%eax
imul %edi,%eax 
imul %edi,%eax 
imul %edi,%eax 
retq // Total Cycles = latenency_mult*3; (3 cycles)

int func2(int x, int y, int z) { 
  return x*x + y*y + z*z; 
}

imul %edi,%edi
imul %esi,%esi 
add %esi,%edi 
imul %edx,%edx 
lea (%rdi,%rdx,1),%eax 
retq

// Total Cycles =  Latency-of-mult + Latency-of-add * 2 (3 cycles)

Untitled

Critical Path

Critical Paths and Loops

for(int i = 0; i < n; ++i) 
	c[i] = a[i] * b[i];

mov    (%rdi,%rax,4),%ecx
imul   (%rsi,%rax,4),%ecx
mov    %ecx,(%rdx,%rax,4)
mov    %rax,%rcx
add    $0x1,%rax
cmp    %r8,%rcx
jne    <first instruction>

Untitled

Identifying the Critical Path for Loops

Untitled

<aside> πŸ’‘ Identifying the Critical Path:

  1. Keep in mind the dependencies that are needed for the cycle to complete or execute.
  2. For loops, follow the repeatable paths rather than the single time executable paths. </aside>

The Bounds of Performance Limitations

Untitled

Calculating Bounds

Initial Example:

Untitled

Haswell Execution Units:

Untitled

<aside> πŸ’‘ The reason why many of the FU’s have low throughput is that we can go ahead and split up complicated instructions into smaller stages that can be pipelined.

In the following example, consider that the mult takes 3 time units in that operation.

</aside>