You are given a small codebase for a GPU resource manager. The repository includes a README, logs, and unit tests. Your task is to use the failing tests and logs to identify root causes, make minimal fixes, and explain your changes.
The system assigns jobs to GPUs. Each job has fields such as required GPU type, memory requirement, priority, and optional user preferences for specific GPU IDs or GPU types. Each GPU has fields such as ID, type, available memory, health status, current lease owner, and lease expiration time.
The failing tests cover three areas:
-
GPU scoring and preference handling: the selected GPU does not respect the user's preference when multiple GPUs are otherwise valid.
-
Resource allocation and preemption: a GPU can be assigned without creating or updating a lease, which can lead to double allocation.
-
Smarter preemption: when preemption is necessary, candidate GPUs should be ranked by score before choosing which running job to preempt.
Explain how you would debug the issue, what code areas you would inspect, what minimal fixes you would make, and how you would validate the result with tests and logs.