Case Study: Using an Out of Order Execution for Code

int func1(int x) {
  return x * x * x * x; 
}

mov %edi,%eax
imul %edi,%eax 
imul %edi,%eax 
imul %edi,%eax 
retq // Total Cycles = latenency_mult*3; (3 cycles)

int func2(int x, int y, int z) { 
  return x*x + y*y + z*z; 
}

imul %edi,%edi
imul %esi,%esi 
add %esi,%edi 
imul %edx,%edx 
lea (%rdi,%rdx,1),%eax 
retq

// Total Cycles =  Latency-of-mult + Latency-of-add * 2 (3 cycles)

Untitled

Critical Path

this is the critical path for the second function which is the longest path down the execution time
the critical path is the path in the dependency graph that takes the longest time to execute from one variable back to itself!
- we ignore the pipeline and drain when dealing with the critical path

Critical Paths and Loops

for(int i = 0; i < n; ++i) 
	c[i] = a[i] * b[i];

mov    (%rdi,%rax,4),%ecx
imul   (%rsi,%rax,4),%ecx
mov    %ecx,(%rdx,%rax,4)
mov    %rax,%rcx
add    $0x1,%rax
cmp    %r8,%rcx
jne    <first instruction>

Untitled

Identifying the Critical Path for Loops

although we would think it is the load, mult, store → the execution ends there and there are no dependencies below it
add1 is the only one that has dependencies below it (rax keeps changing) and thus that would be the critical path
total cycles: n-iters*lat_add
critical pathsofr loops must go through the repeatable paths
Cycle per element is how long it takes to execute one iteration of the loop

Untitled

<aside> 💡 Identifying the Critical Path:

Keep in mind the dependencies that are needed for the cycle to complete or execute.
For loops, follow the repeatable paths rather than the single time executable paths. </aside>

The Bounds of Performance Limitations

Untitled

Limited by instruction dependences → no instruction level parallelism
- characterized usually by latency bound: the length in cycles of the critical path through the instructions
Limited by execution resources → lots of instruction-level parallelism but not enough hardware to actually conduct the task
- characterized usually by throughput bound: ignoring the dependences, what is the execution time given your limited resources
The overall bound is the slower of the two bounds.

Calculating Bounds

Initial Example:

Untitled

consider you have the dorms, with inifinate washers and dryers, and your parents house with just one washer and dryer
- in the dorms, the slowest thing will be the throughput bound because you have infinate resources (washer dryers)
  - in each washer dryer, you wait 1 cycle for wash, and 2 cycles for dry so latency bound is 3 cycles
- in the parent’s house, you only have 2 washers and dryers so you can pipeline, but the throughput is long
  - Cycle 1: W: L1 D: N/A → Cycle 2: W: L2 D: L1 → Cycle 3: W: L3 D: L1 → Cycle 4: W: L4 D: L2 → Cycle 5: W: L4 D: L2 → Cycle 6: W: N/A D: L3 → Cycle 7: W: L2 D: L3 → Cycle 8: W: L2 D: L4 →
  - Basically it is 2 times the slowest resource that you own

Haswell Execution Units:

Untitled

<aside> 💡 The reason why many of the FU’s have low throughput is that we can go ahead and split up complicated instructions into smaller stages that can be pipelined.

In the following example, consider that the mult takes 3 time units in that operation.

</aside>