Assume valid 1‑D convolution produces output y of length M = N − K + 1 from input x (length N) and kernel h (length K):
Design and implement a multithreaded CPU version. For each case, describe work partitioning, scheduling, synchronization, cache locality, false‑sharing avoidance, vectorization (SIMD), and how to combine partial results. Provide pseudocode or an API‑level design.
Cases:
(a) input length N = 1,000,000; kernel length K = 3.
(b) input length N = 1,000,000; kernel length K = 1,000,000.
(c) maximum worker threads = 100.
Login required