Optimize CPU Multithreaded 1D "Valid" Convolution
Context
You are given a 1D input array x of length N and a kernel h of length K. The "valid" convolution produces an output array y of length L_out = N − K + 1, where:
-
y[i] = sum_{j=0}^{K-1} x[i + j] * h[K − 1 − j]
-
Index i ranges from 0 to L_out − 1.
Assume a typical CPU with a shared last-level cache, private L1/L2 per core, 64-byte cache lines, and support for multiple threads.
Task
Optimize the valid 1D convolution for CPU hardware using multithreading. For each case, describe and implement (code or pseudocode) how to:
-
Partition work,
-
Assign threads,
-
Combine results,
-
Address load balancing, cache locality, and avoiding false sharing,
-
Synchronize/merge outputs,
-
Choose chunk sizes and the number of threads.
Consider these cases:
-
N = 1,000,000; K = 3,
-
N = 1,000,000; K = 1,000,000,
-
Maximum available threads = 100.