Design a Parallel Image Processor
Company: Anthropic
Role: Software Engineer
Category: Software Engineering Fundamentals
Difficulty: medium
Interview Round: Technical Screen
Design an **image-processing component** that applies one or more filters to an image and produces a new image. An image is a 2D grid of pixels (with one or more channels per pixel, e.g. RGB or RGBA). The filters of interest include:
- **Pixel-wise transforms** (e.g. grayscale, brightness, threshold), where each output pixel depends only on the corresponding input pixel.
- **Neighborhood / convolution-style transforms** (e.g. Gaussian blur, box blur, edge detection, sharpen), where each output pixel depends on a small window of surrounding input pixels.
Walk through your design in two stages:
1. **Single-processor design** — a correct, clean sequential implementation: the data structures, the filter interface, how filters chain, and how boundary pixels are handled.
2. **Parallel design** — how you extend the same design to run efficiently across multiple threads or processors on a single machine, and how you validate that it is both correct and faster.
This is an open-ended design discussion: there is no single right answer, but the interviewer is looking for a clear API, correct handling of neighborhoods and boundaries, a race-free parallelization strategy, and a credible story for measuring speedup.
### Constraints & Assumptions
State your assumptions explicitly; reasonable defaults for this problem:
- **Single machine, shared memory.** "Multiple processors" means threads/cores in one process (not a distributed cluster). Design so the core abstraction could later extend to multiple machines, but optimize for shared memory first.
- **Image sizes** range from small thumbnails ($\sim 256 \times 256$) up to large images ($\sim 8000 \times 8000$, tens to hundreds of MB). The full image fits in RAM.
- **Channels:** 1 (grayscale), 3 (RGB), or 4 (RGBA), 8 bits per channel; intermediate computation may use a wider type (e.g. `float`/`int32`) to avoid overflow.
- **Filters** are provided as plug-ins implementing a common interface. A request may chain several filters in sequence.
- **Goal:** correctness first, then throughput; output must be deterministic and identical regardless of thread count.
### The Problem
Address, at minimum:
- Core data structures and the **filter interface** (how a filter is invoked, what it reads and writes).
- How a neighborhood filter is expressed (kernel) and how **boundary pixels** are handled at the image edges.
- How parallel **work is partitioned** across threads.
- How you **avoid race conditions** and incorrect shared-state updates.
- **Synchronization, scheduling, and chaining** of multiple filters.
- **Performance trade-offs**: memory locality, false sharing, memory-bandwidth limits, scalability.
- How you **test correctness** and **measure speedup**.
```hint Where to start
Separate the two hard questions. (1) *Correctness*: an output pixel that reads a neighborhood must read the **original** input, never partially-filtered values — so keep a read-only input buffer and a separate output buffer rather than filtering in place. (2) *Parallelism*: which pixels can be computed independently?
```
```hint Think about what's safe to share
During a single filter pass, is the input ever modified? If not, what does that let many workers do with it at the same time — and what must be true about *where* each worker is allowed to write so that two of them never collide?
```
```hint One word, two meanings
"Boundary" could mean two different things here: a pixel at the edge of the *image*, and a pixel at the edge of one *worker's* region. Are they the same problem? Does it matter whether all workers read from one shared input buffer or each copies its own slice — and for which of the two does that choice actually change anything?
```
```hint Splitting the work and chaining stages
When you carve the image up among workers, what shape of piece keeps memory access fast, and how do you keep every core busy if some pieces turn out more expensive than others? And when one filter feeds into the next, what has to be true before the second filter may start reading — and where does its output go?
```
```hint When more cores stop helping
Suppose you parallelize a trivially cheap filter and speedup stalls far below the core count. What shared resource might you be hitting that adding threads can't fix? And what subtle interaction between threads writing *near* each other (rather than *to* the same pixel) could quietly destroy throughput?
```
### Clarifying Questions to Ask
- Are filters strictly pixel-wise, or must I support arbitrary convolution kernels (which size — $3\times3$, $5\times5$, separable)?
- Is the deployment single-machine shared-memory, or do I need to scale across machines later?
- What pixel format(s) must I support (channels, bit depth), and must output be bit-exact deterministic across thread counts?
- What is the latency vs. throughput priority — one large image fast, or many images concurrently?
- Which edge-handling policy is required, or is it configurable per request?
- Are filters trusted in-process plug-ins, or untrusted code that needs isolation?
### What a Strong Answer Covers
- **Clear data model**: contiguous buffer with `(width, height, channels, stride)` metadata; separate immutable input and writable output buffers.
- **Clean filter abstraction** that works for both pixel-wise and neighborhood filters (e.g. a per-region `apply(input, output, region)` or per-pixel kernel), composable into a chain.
- **Correct boundary semantics** — both image-edge policy and partition-edge handling — kept separate and explicit.
- **Race-free parallel model**: read-only input + disjoint output partitions ⇒ no locks in the inner loop; per-thread metric accumulation merged at the end.
- **Sound scheduling**: thread pool, row/tile tasks, barrier between pipeline stages, buffer ping-pong, work-stealing for load balance.
- **Performance reasoning**: cache locality, false sharing, bandwidth limits, Amdahl/overhead on small inputs, SIMD as an orthogonal axis.
- **A real test & measurement plan**: equality against a sequential reference, edge/format/determinism tests, and a speedup curve vs. thread count with an explanation of where and why it flattens.
### Follow-up Questions
- Your blur is a separable Gaussian. How does separating it into a horizontal then vertical 1-D pass change the parallelization, the buffering, and the cost from $O(k^2)$ to $O(k)$ per pixel?
- A worker finishes its rows much earlier than the others on a filter whose cost varies across the image. How do you keep all cores busy, and what's the trade-off vs. static partitioning?
- How would you extend this to images too large to fit in RAM, or to a cluster of machines? What changes in the data model, partitioning, and boundary handling?
- You observe only $3\times$ speedup on 16 cores for a grayscale filter. Walk through how you'd diagnose whether it's bandwidth, false sharing, synchronization, or scheduling.
Quick Answer: This question evaluates a candidate's competency in designing a correct, deterministic image-processing component for shared-memory systems, including data structures, filter interfaces, neighborhood and boundary handling, and thread-safe parallelization.