How do I practice coding and algorithm questions?

Use PracHub's coding console to write, test, and debug your solutions in Python or JavaScript. View hints, test against sample inputs, and compare with official solutions.

What difficulty level is this coding question?

This is a Medium difficulty Coding & Algorithms question, commonly asked during Technical Screen rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at OpenAI during technical interviews.

Optimize C++ Performance with Provided Concurrency

Quick Overview

This question evaluates a developer's skills in C++ performance engineering and concurrent systems, covering profiling, bottleneck identification (e.g., cache locality and lock contention), and code-level performance reasoning to improve throughput and latency.

Company: OpenAI

Role: Software Engineer

Category: Coding & Algorithms

Difficulty: Medium

Interview Round: Technical Screen

Given a C++ codebase where threading components (threads, work queues, and synchronization primitives) are already provided, profile and optimize the program for throughput and latency. Identify likely bottlenecks (cache locality, memory allocation patterns, unnecessary copying vs. moving, branch misprediction, false sharing) and propose concrete code-level optimizations (container selection, preallocation/reservations, small-buffer optimization, move semantics, RAII, avoiding needless virtual dispatch). Explain how you would minimize lock contention and ensure correctness without implementing the threading primitives, including the use of lock-free data access patterns when appropriate. Outline the profiling tools and metrics you would use, how you would measure impact, and how you would validate both performance and correctness under concurrency.

Quick Answer: This question evaluates a developer's skills in C++ performance engineering and concurrent systems, covering profiling, bottleneck identification (e.g., cache locality and lock contention), and code-level performance reasoning to improve throughput and latency.

The threading primitives in a C++ pipeline are already implemented. Your job is to model a high-level optimization decision: how many workers to activate and how to split the work. You are given an array `tasks` where `tasks[i]` is the cost of the i-th task in execution order. To preserve cache locality, each active worker must receive exactly one non-empty contiguous block of tasks, and the blocks must cover the full array in order. If you activate `m` workers, the program's total runtime is defined as `max(block_sum) + overhead * m`, where `max(block_sum)` is the largest total task cost assigned to any worker and `overhead` is a fixed global cost per active worker representing scheduling, locking, and synchronization overhead. Using more workers can reduce the heaviest block, but increases overhead. Return the minimum possible runtime using at most `k` workers. If `tasks` is empty, return `0`.

Constraints

0 <= len(tasks) <= 300
0 <= tasks[i] <= 10^6
1 <= k <= 50
0 <= overhead <= 10^6

Examples

Input: ([10, 3, 3], 2, 8)

Expected Output: 24

Explanation: Using 1 worker gives runtime 16 + 8 = 24. Using 2 workers gives best partition [10] and [3,3], so runtime is 10 + 16 = 26. The minimum is 24.

Input: ([5, 1, 2], 10, 2)

Expected Output: 9

Explanation: Even though up to 10 workers are allowed, only up to 3 can be used because blocks must be non-empty. The best choice is 2 workers: [5] and [1,2], giving max block sum 5 and runtime 5 + 2*2 = 9.

Input: ([6, 2, 4, 7], 4, 5)

Expected Output: 21

Explanation: With 2 workers, partition as [6,2] and [4,7]. The largest block sum is 11, so runtime is 11 + 10 = 21. Using more workers lowers the max block sum but increases overhead too much.

Input: ([7, 2, 5, 10, 8], 3, 2)

Expected Output: 20

Explanation: The best choice is 3 workers: [7,2,5], [10], [8]. The largest block sum is 14, so runtime is 14 + 6 = 20.

Input: ([], 5, 7)

Expected Output: 0

Explanation: There are no tasks, so no workers are needed and the runtime is 0.

Input: ([0, 0, 0], 2, 3)

Expected Output: 3

Explanation: The maximum block sum is 0 no matter how tasks are split. Since each worker adds overhead, using 1 worker is best: 0 + 3 = 3.

Solution

def solution(tasks, k, overhead):
    n = len(tasks)
    if n == 0:
        return 0

    limit = min(k, n)
    max_task = max(tasks)
    total = sum(tasks)

    def workers_needed(max_block_sum):
        workers = 1
        current = 0
        for cost in tasks:
            if current + cost <= max_block_sum:
                current += cost
            else:
                workers += 1
                current = cost
        return workers

    best = float('inf')

    for m in range(1, limit + 1):
        lo, hi = max_task, total
        while lo < hi:
            mid = (lo + hi) // 2
            if workers_needed(mid) <= m:
                hi = mid
            else:
                lo = mid + 1

        runtime = lo + overhead * m
        if runtime < best:
            best = runtime

    return best

Time complexity: O(min(k, n) * n * log(sum(tasks))). Space complexity: O(1).

Hints

For a fixed number of active workers, this becomes partitioning the array into contiguous blocks while minimizing the largest block sum.
Use prefix sums to get any block sum in O(1), then apply dynamic programming over the number of workers and the last split point.

Quick Overview