Debug a Concurrent Job Scheduler
Company: OpenAI
Role: Machine Learning Engineer
Category: Software Engineering Fundamentals
Difficulty: medium
Interview Round: Technical Screen
You are given a buggy Python job scheduler that runs many independent jobs concurrently. Each job has an ID, a callable, a maximum retry count, and a terminal status: succeeded or failed. The scheduler maintains pending, running, completed, and failed job sets; uses worker threads or asynchronous tasks; enforces a rate limit of at most R job starts per second; and records metrics such as job start time, finish time, latency, retry count, and final status.
Your task is to debug and improve the scheduler.
Address the following:
1. Find possible data races, deadlocks, and lock-contention hot spots.
2. Verify whether the rate limiter is correct under concurrency.
3. Write tests that prove the scheduler works correctly under success, failure, retry, cancellation, and high-concurrency scenarios.
4. Compute the total time needed to schedule a batch of jobs and the final job success rate.
5. Explain what instrumentation or logs you would add to make future debugging easier.
Quick Answer: This question evaluates debugging and design skills for concurrent systems, covering identification of data races, deadlocks, lock contention, correctness of rate limiting under concurrency, testing for retries and failures, and adding instrumentation for observability.