How do I approach Software Engineering Fundamentals interview questions?

Software Engineering Fundamentals questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master software engineering fundamentals interviews.

What difficulty level is this interview question?

This is a medium difficulty Software Engineering Fundamentals question, commonly asked during Technical Screen rounds at Anthropic.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Anthropic during technical interviews.

Design a Parallel Image Processor | Anthropic Interview Question

Q: Design a Parallel Image Processor

This question evaluates a candidate's competency in designing a correct, deterministic image-processing component for shared-memory systems, including data structures, filter interfaces, neighborhood and boundary handling, and thread-safe parallelization.

Design an image-processing component that applies one or more filters to an image and produces a new image. An image is a 2D grid of pixels (with one or more channels per pixel, e.g. RGB or RGBA). The filters of interest include:

Pixel-wise transforms (e.g. grayscale, brightness, threshold), where each output pixel depends only on the corresponding input pixel.
Neighborhood / convolution-style transforms (e.g. Gaussian blur, box blur, edge detection, sharpen), where each output pixel depends on a small window of surrounding input pixels.

Walk through your design in two stages:

Single-processor design — a correct, clean sequential implementation: the data structures, the filter interface, how filters chain, and how boundary pixels are handled.
Parallel design — how you extend the same design to run efficiently across multiple threads or processors on a single machine, and how you validate that it is both correct and faster.

This is an open-ended design discussion: there is no single right answer, but the interviewer is looking for a clear API, correct handling of neighborhoods and boundaries, a race-free parallelization strategy, and a credible story for measuring speedup.

Constraints & Assumptions

State your assumptions explicitly; reasonable defaults for this problem:

Single machine, shared memory. "Multiple processors" means threads/cores in one process (not a distributed cluster). Design so the core abstraction could later extend to multiple machines, but optimize for shared memory first.
Image sizes range from small thumbnails ( $\sim 256 \times 256$ ) up to large images ( $\sim 8000 \times 8000$ , tens to hundreds of MB). The full image fits in RAM.
Channels: 1 (grayscale), 3 (RGB), or 4 (RGBA), 8 bits per channel; intermediate computation may use a wider type (e.g. float / int32 ) to avoid overflow.
Filters are provided as plug-ins implementing a common interface. A request may chain several filters in sequence.
Goal: correctness first, then throughput; output must be deterministic and identical regardless of thread count.

The Problem

Address, at minimum:

Core data structures and the filter interface (how a filter is invoked, what it reads and writes).
How a neighborhood filter is expressed (kernel) and how boundary pixels are handled at the image edges.
How parallel work is partitioned across threads.
How you avoid race conditions and incorrect shared-state updates.
Synchronization, scheduling, and chaining of multiple filters.
Performance trade-offs : memory locality, false sharing, memory-bandwidth limits, scalability.
How you test correctness and measure speedup .

Clarifying Questions to Ask

Are filters strictly pixel-wise, or must I support arbitrary convolution kernels (which size — $3\times3$ , $5\times5$ , separable)?
Is the deployment single-machine shared-memory, or do I need to scale across machines later?
What pixel format(s) must I support (channels, bit depth), and must output be bit-exact deterministic across thread counts?
What is the latency vs. throughput priority — one large image fast, or many images concurrently?
Which edge-handling policy is required, or is it configurable per request?
Are filters trusted in-process plug-ins, or untrusted code that needs isolation?

What a Strong Answer Covers

Clear data model : contiguous buffer with (width, height, channels, stride) metadata; separate immutable input and writable output buffers.
Clean filter abstraction that works for both pixel-wise and neighborhood filters (e.g. a per-region apply(input, output, region) or per-pixel kernel), composable into a chain.
Correct boundary semantics — both image-edge policy and partition-edge handling — kept separate and explicit.
Race-free parallel model : read-only input + disjoint output partitions ⇒ no locks in the inner loop; per-thread metric accumulation merged at the end.
Sound scheduling : thread pool, row/tile tasks, barrier between pipeline stages, buffer ping-pong, work-stealing for load balance.
Performance reasoning : cache locality, false sharing, bandwidth limits, Amdahl/overhead on small inputs, SIMD as an orthogonal axis.
A real test & measurement plan : equality against a sequential reference, edge/format/determinism tests, and a speedup curve vs. thread count with an explanation of where and why it flattens.

Follow-up Questions

Your blur is a separable Gaussian. How does separating it into a horizontal then vertical 1-D pass change the parallelization, the buffering, and the cost from $O(k^2)$ to $O(k)$ per pixel?
A worker finishes its rows much earlier than the others on a filter whose cost varies across the image. How do you keep all cores busy, and what's the trade-off vs. static partitioning?
How would you extend this to images too large to fit in RAM, or to a cluster of machines? What changes in the data model, partitioning, and boundary handling?
You observe only $3\times$ speedup on 16 cores for a grayscale filter. Walk through how you'd diagnose whether it's bandwidth, false sharing, synchronization, or scheduling.

Pixel-wise transforms (e.g. grayscale, brightness, threshold), where each output pixel depends only on the corresponding input pixel.
Neighborhood / convolution-style transforms (e.g. Gaussian blur, box blur, edge detection, sharpen), where each output pixel depends on a small window of surrounding input pixels.

Walk through your design in two stages:

Single-processor design — a correct, clean sequential implementation: the data structures, the filter interface, how filters chain, and how boundary pixels are handled.
Parallel design — how you extend the same design to run efficiently across multiple threads or processors on a single machine, and how you validate that it is both correct and faster.

Constraints & Assumptions

State your assumptions explicitly; reasonable defaults for this problem:

Single machine, shared memory. "Multiple processors" means threads/cores in one process (not a distributed cluster). Design so the core abstraction could later extend to multiple machines, but optimize for shared memory first.
Image sizes range from small thumbnails ( $\sim 256 \times 256$ ) up to large images ( $\sim 8000 \times 8000$ , tens to hundreds of MB). The full image fits in RAM.
Channels: 1 (grayscale), 3 (RGB), or 4 (RGBA), 8 bits per channel; intermediate computation may use a wider type (e.g. float / int32 ) to avoid overflow.
Filters are provided as plug-ins implementing a common interface. A request may chain several filters in sequence.
Goal: correctness first, then throughput; output must be deterministic and identical regardless of thread count.

The Problem

Address, at minimum:

Core data structures and the filter interface (how a filter is invoked, what it reads and writes).
How a neighborhood filter is expressed (kernel) and how boundary pixels are handled at the image edges.
How parallel work is partitioned across threads.
How you avoid race conditions and incorrect shared-state updates.
Synchronization, scheduling, and chaining of multiple filters.
Performance trade-offs : memory locality, false sharing, memory-bandwidth limits, scalability.
How you test correctness and measure speedup .

Clarifying Questions to Ask

Are filters strictly pixel-wise, or must I support arbitrary convolution kernels (which size — $3\times3$ , $5\times5$ , separable)?
Is the deployment single-machine shared-memory, or do I need to scale across machines later?
What pixel format(s) must I support (channels, bit depth), and must output be bit-exact deterministic across thread counts?
What is the latency vs. throughput priority — one large image fast, or many images concurrently?
Which edge-handling policy is required, or is it configurable per request?
Are filters trusted in-process plug-ins, or untrusted code that needs isolation?

What a Strong Answer Covers

Clear data model : contiguous buffer with (width, height, channels, stride) metadata; separate immutable input and writable output buffers.
Clean filter abstraction that works for both pixel-wise and neighborhood filters (e.g. a per-region apply(input, output, region) or per-pixel kernel), composable into a chain.
Correct boundary semantics — both image-edge policy and partition-edge handling — kept separate and explicit.
Race-free parallel model : read-only input + disjoint output partitions ⇒ no locks in the inner loop; per-thread metric accumulation merged at the end.
Sound scheduling : thread pool, row/tile tasks, barrier between pipeline stages, buffer ping-pong, work-stealing for load balance.
Performance reasoning : cache locality, false sharing, bandwidth limits, Amdahl/overhead on small inputs, SIMD as an orthogonal axis.
A real test & measurement plan : equality against a sequential reference, edge/format/determinism tests, and a speedup curve vs. thread count with an explanation of where and why it flattens.

Follow-up Questions

Your blur is a separable Gaussian. How does separating it into a horizontal then vertical 1-D pass change the parallelization, the buffering, and the cost from $O(k^2)$ to $O(k)$ per pixel?
A worker finishes its rows much earlier than the others on a filter whose cost varies across the image. How do you keep all cores busy, and what's the trade-off vs. static partitioning?
How would you extend this to images too large to fit in RAM, or to a cluster of machines? What changes in the data model, partitioning, and boundary handling?
You observe only $3\times$ speedup on 16 cores for a grayscale filter. Walk through how you'd diagnose whether it's bandwidth, false sharing, synchronization, or scheduling.

Design a Parallel Image Processor

Quick Overview

Constraints & Assumptions

The Problem

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Design a Parallel Image Processor

Quick Overview

Constraints & Assumptions

The Problem

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP