VLIW Instruction Scheduling and Software Pipelining
You are given a 4-issue VLIW machine with these functional units and latencies:
-
2 ALUs (ALU latency = 1 cycle)
-
1 Load/Store (LS) unit (load latency = 3 cycles; assume stores commit in 1 cycle and have no consumer)
-
1 Special-Function Unit (SFU) (latency = 2 cycles)
Goal: Schedule a basic block to maximize parallel issue and minimize stalls. Explain hazard handling (data and structural), discuss register pressure, and then describe a software-pipelined schedule for a looped version. Finally, explain how the schedule changes if average memory latency increases by one cycle (loads: 4 cycles).
Assumptions (to make the problem concrete):
-
Below is the basic block as a dependency DAG (each node lists op type, output register, and inputs). The DAG is typical of a numeric kernel computing two addresses, loading two operands, combining them, applying a special function, and storing the result. All registers are virtual (renaming allowed); memory addresses are independent except where stated.
Nodes (operation → latency → dependencies):
-
A1: r5 = r1 + r2 [ALU→1] deps: —
-
A2: r6 = r3 - r4 [ALU→1] deps: —
-
L1: r7 = LD [r5] [LS→3] deps: A1
-
L2: r8 = LD [r6 + 8] [LS→3] deps: A2
-
A3: r9 = r7 + r8 [ALU→1] deps: L1, L2
-
S1: r10 = SFU(r9) [SF→2] deps: A3
-
A4: r11 = r10 * r12 [ALU→1] deps: S1
-
ST1: ST [r13] = r11 [LS→1] deps: A4
-
A5: r14 = r15 + r16 [ALU→1] deps: — (independent work to hide latency)
-
S2: r17 = SFU(r14) [SF→2] deps: A5 (independent special op)
Constraints per cycle (issue width): at most 2 ALUs, 1 LS, and 1 SFU can be issued in the same cycle.
Tasks:
-
Produce a legal cycle-by-cycle instruction schedule for the basic block that maximizes parallel issue and minimizes stalls.
-
Explain how you handled hazards (RAW/WAR/WAW, functional-unit conflicts) and how you managed register pressure.
-
Suppose this basic block is the loop body. Propose a software-pipelined (modulo) schedule for the steady-state kernel; explain your chosen initiation interval (II) and provide a kernel mapping.
-
Re-do the analysis when average memory latency increases by one cycle (load latency = 4). Explain the changes to the basic-block schedule and to the loop kernel (including any change in II).