You are given a 1D input array x of length N and a kernel h of length K. The "valid" convolution produces an output array y of length L_out = N − K + 1, where:
Assume a typical CPU with a shared last-level cache, private L1/L2 per core, 64-byte cache lines, and support for multiple threads.
Optimize the valid 1D convolution for CPU hardware using multithreading. For each case, describe and implement (code or pseudocode) how to:
Consider these cases:
Login required