You are given a mocked “core kernel” function (similar in spirit to a GPU kernel / tight compute loop) that is functionally correct but slow.
Task
-
Optimize the kernel to improve performance as much as possible within a fixed timebox (e.g., ~2 hours).
-
You may use typical low-level optimization techniques such as:
-
loop unrolling
-
memory access optimization (e.g., coalescing / cache-friendly access)
-
reducing allocations and copies
-
operator fusion / reducing intermediate buffers
-
vectorization (SIMD) and/or parallelism where applicable
-
Provide:
-
Your optimized implementation
-
A short write-up explaining what you changed and why
-
Benchmarks showing speedup vs baseline
-
Evidence you preserved correctness (tests or checks)
Constraints / expectations
-
Maintain identical output semantics.
-
Optimize for end-to-end runtime (not just micro-benchmarks of one line).
-
Explain tradeoffs (readability vs performance, portability, precision, etc.).