Case Study: Using an Out of Order Execution for Code
int func1(int x) {
return x * x * x * x;
}
mov %edi,%eax
imul %edi,%eax
imul %edi,%eax
imul %edi,%eax
retq // Total Cycles = latenency_mult*3; (3 cycles)
int func2(int x, int y, int z) {
return x*x + y*y + z*z;
}
imul %edi,%edi
imul %esi,%esi
add %esi,%edi
imul %edx,%edx
lea (%rdi,%rdx,1),%eax
retq
// Total Cycles = Latency-of-mult + Latency-of-add * 2 (3 cycles)
Critical Path
- this is the critical path for the second function which is the longest path down the execution time
- the critical path is the path in the dependency graph that takes the longest time to execute from one variable back to itself!
- we ignore the pipeline and drain when dealing with the critical path
Critical Paths and Loops
for(int i = 0; i < n; ++i)
c[i] = a[i] * b[i];
mov (%rdi,%rax,4),%ecx
imul (%rsi,%rax,4),%ecx
mov %ecx,(%rdx,%rax,4)
mov %rax,%rcx
add $0x1,%rax
cmp %r8,%rcx
jne <first instruction>
Identifying the Critical Path for Loops
- although we would think it is the load, mult, store β the execution ends there and there are no dependencies below it
- add1 is the only one that has dependencies below it (rax keeps changing) and thus that would be the critical path
- total cycles: n-iters*lat_add
- critical pathsofr loops must go through the repeatable paths
- Cycle per element is how long it takes to execute one iteration of the loop
<aside>
π‘ Identifying the Critical Path:
- Keep in mind the dependencies that are needed for the cycle to complete or execute.
- For loops, follow the repeatable paths rather than the single time executable paths.
</aside>
The Bounds of Performance Limitations
- Limited by instruction dependences β no instruction level parallelism
- characterized usually by latency bound: the length in cycles of the critical path through the instructions
- Limited by execution resources β lots of instruction-level parallelism but not enough hardware to actually conduct the task
- characterized usually by throughput bound: ignoring the dependences, what is the execution time given your limited resources
- The overall bound is the slower of the two bounds.
Calculating Bounds
Initial Example:
- consider you have the dorms, with inifinate washers and dryers, and your parents house with just one washer and dryer
- in the dorms, the slowest thing will be the throughput bound because you have infinate resources (washer dryers)
- in each washer dryer, you wait 1 cycle for wash, and 2 cycles for dry so latency bound is 3 cycles
- in the parentβs house, you only have 2 washers and dryers so you can pipeline, but the throughput is long
- Cycle 1: W: L1 D: N/A β Cycle 2: W: L2 D: L1 β Cycle 3: W: L3 D: L1 β
Cycle 4: W: L4 D: L2 β Cycle 5: W: L4 D: L2 β Cycle 6: W: N/A D: L3 β
Cycle 7: W: L2 D: L3 β Cycle 8: W: L2 D: L4 β
- Basically it is 2 times the slowest resource that you own
Haswell Execution Units:
<aside>
π‘ The reason why many of the FUβs have low throughput is that we can go ahead and split up complicated instructions into smaller stages that can be pipelined.
In the following example, consider that the mult takes 3 time units in that operation.
</aside>