Build a Custom CompletableFuture: Async Primitive and Parallel Array Processing
Company: Google
Role: Software Engineer
Category: Software Engineering Fundamentals
Interview Round: Technical Screen
You are asked to design and implement, from scratch, a simplified asynchronous "future / promise" abstraction in the spirit of Java's `CompletableFuture` — call it `AsyncFuture<T>`. Go deep into the underlying mechanics, not just the public API: the thread-pool execution model, the completion state machine, how callbacks behave when they are registered *before* versus *after* completion, and how you guarantee thread safety and memory visibility across threads.
Then use that primitive to build a parallel data-processing routine that splits a large input array into chunks, processes the chunks concurrently on the same primitive, and combines the per-chunk results into a single final result — without blocking a worker thread.
The interviewer is probing two things: (1) your grasp of the multi-threading architecture and concurrency-control scheme behind a future, and (2) your ability to write correct array-splitting / fan-out-fan-in code on top of it.
### Constraints & Assumptions
- Language is Java. You may **not** use `java.util.concurrent.CompletableFuture` itself, but you may use lower-level primitives (`ExecutorService`, `ReentrantLock`/`Condition`, `synchronized`, atomics, queues).
- Single JVM, in-memory; no distributed concerns.
- A future is completed **at most once** (either with a value or with a throwable). Every registered callback must run **exactly once**, regardless of whether it was registered before or after completion.
- Composition must be **non-blocking**: transforming or combining futures must not block a pool thread waiting on `get()`.
- The array-processing reduce operation is **associative**; element order is not significant to the final reduced value.
- Input arrays can be large (think millions of elements); the per-chunk work may be CPU-bound or IO-bound.
- The implementation must be correct under concurrent callback registration and completion from multiple threads.
### Clarifying Questions to Ask
- Do we need cancellation and timeouts, or only completion plus composition (`thenApply` / `thenCompose` / combine)?
- On which thread should callbacks run — the thread that completes the future, the pool, or a caller-supplied executor? Are there ordering guarantees among multiple callbacks on the same future?
- Is the chunk reduce operation guaranteed associative, and does element order matter for the final result?
- What is the expected input size, and is the per-chunk work CPU-bound or IO-bound (this drives pool sizing)?
- If one chunk fails, should the whole computation fail fast, or should we collect partial results?
- May I rely on JDK concurrency utilities (`ExecutorService`, `ReentrantLock`, atomics), or must this be built from raw threads and `synchronized` only?
### Part 1 — The core async future primitive
Design and implement `AsyncFuture<T>` with at least: a static factory `supplyAsync(Supplier<T>, Executor)` that runs the supplier on the pool and completes the future; the composition operators `thenApply(Function<T,U>)`, `thenCompose(Function<T, AsyncFuture<U>>)`, and `thenCombine(AsyncFuture<U>, BiFunction<T,U,R>)`; a blocking `get()`; and exception propagation (`completeExceptionally`, with errors flowing through the composition chain). Explain the completion state machine and how you resolve the race between registering a callback and the future completing.
```hint State machine
Model exactly two observable phases — PENDING and COMPLETED — backed by a single result slot that holds *either* a value *or* a throwable. The transition to COMPLETED must happen exactly once; gate it with a lock (or a CAS) so a second `complete*` call is a no-op.
```
```hint The core race
A callback may be added before *or* after completion. Resolve it under the same lock that guards the state: if still PENDING, enqueue the callback; if already COMPLETED, run it immediately with the stored result. Never lose a callback and never run one twice. Run user callbacks *outside* the lock to avoid holding it during arbitrary code.
```
```hint Composition
`thenApply` creates a fresh downstream `AsyncFuture`, then registers a callback on `this` that — on success — applies the function and completes the downstream, and on failure propagates the throwable. `thenCompose` does the same but the function returns a future, so you must *flatten*: register a second callback on the inner future to complete the downstream.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Array-splitting parallel processing
Using `AsyncFuture`, implement a routine `parallelProcess(int[] data, int chunkSize, Function<int[], Long> mapChunk, BinaryOperator<Long> reduce, Executor pool)` that splits `data` into chunks of size `chunkSize`, processes each chunk asynchronously (one `supplyAsync` per chunk), and asynchronously folds the per-chunk results into a single `AsyncFuture<Long>`. Walk through how you split, dispatch, and combine, and how you choose `chunkSize` relative to the pool size.
```hint Splitting
Compute chunk boundaries by index: chunk `start` covers `[start, min(start + chunkSize, n))`. Guard the last partial chunk, the empty-input case, and `chunkSize >= n`. Decide whether to copy each slice (self-contained tasks) or pass offsets to avoid copying.
```
```hint Combine without blocking
Launch one `supplyAsync` per chunk to get a `List<AsyncFuture<Long>>`, then fold that list with your `thenCombine` + `reduce` so the aggregate future completes only when every chunk is done. Do **not** call `get()` inside a pool task — a blocked worker waiting on work that needs another worker can starve or deadlock the pool.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- How would you add cancellation and timeouts? What should happen to a callback that is already mid-execution when cancellation arrives?
- Your callbacks currently run on whichever thread completes the future. What are the risks (deep recursion through a long chain, a slow callback blocking the completer), and how would you offload callbacks to an executor?
- How exactly does `thenCompose` differ from `thenApply` when the function returns a future, and why does flattening matter for correctness?
- How would you implement `allOf` over N futures using a single atomic counter instead of nesting `thenCombine` calls pairwise, and how does that change the critical-path depth?
Quick Answer: This question evaluates a candidate's understanding of concurrency primitives, specifically the internal mechanics of future/promise abstractions such as thread-pool execution, completion state machines, and thread-safe callback registration. It tests practical software engineering skill by requiring a non-blocking fan-out/fan-in implementation for parallel array processing built on that primitive, commonly used to assess depth of multi-threading knowledge beyond basic API usage.