Query Machines and Mark Them Offline
Company: Coreweave
Role: Site Reliability Engineer
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Onsite
Quick Answer: This question evaluates competency in interacting with RESTful HTTP APIs, JSON parsing and filtering, automation of operational workflows, resilient error handling (including retries and malformed responses), and command-line configuration for production-like services.
Constraints
- 0 <= number of machine records <= 100000
- 0 <= max_retries <= 10
- The total number of simulated POST status codes across all machines is at most 200000
Examples
Input: ((200, '{"rack":"rack-17","machine_type":"gpu","minimum_error_count":3}'), (200, '[{"id":"m1","rack":"rack-17","machine_type":"gpu","error_count":5,"state":"online"},{"id":"m2","rack":"rack-17","machine_type":"cpu","error_count":4,"state":"online"},{"id":"m3","rack":"rack-17","machine_type":"gpu","error_count":4,"state":"offline"},{"id":"m4","rack":"rack-17","machine_type":"gpu","error_count":3,"state":"online"}]'), {'m1': [503, 204], 'm4': [204]}, 2)
Expected Output: {'offline_marked': ['m1', 'm4'], 'failed': [], 'errors': []}
Explanation: Machines `m1` and `m4` match the task and are online. `m1` succeeds after one transient failure, and `m4` succeeds immediately.
Input: ((500, '{}'), (200, '[]'), {}, 1)
Expected Output: {'offline_marked': [], 'failed': [], 'errors': ['GET /task returned status 500']}
Explanation: A non-200 response from `GET /task` is a fatal error, so processing stops immediately.
Input: ((200, '{"rack":"rack-1","machine_type":"cpu","minimum_error_count":2}'), (200, 'not-json'), {}, 1)
Expected Output: {'offline_marked': [], 'failed': [], 'errors': ['Malformed JSON from /machines']}
Explanation: The machines body is not valid JSON, so the function returns an error summary without attempting any POST requests.
Input: ((200, '{"rack":"rack-1","machine_type":"cpu","minimum_error_count":2}'), (200, '[]'), {}, 2)
Expected Output: {'offline_marked': [], 'failed': [], 'errors': []}
Explanation: The machines list is empty, so there is nothing to mark offline. This is a valid edge case.
Input: ((200, '{"rack":"rack-17","machine_type":"gpu","minimum_error_count":3}'), (200, '[{"id":"m1","rack":"rack-17","machine_type":"gpu","error_count":6,"state":"online"},{"id":"m2","rack":"rack-17","machine_type":"gpu","error_count":4,"state":"online"}]'), {'m1': [503, 503, 503], 'm2': [409]}, 2)
Expected Output: {'offline_marked': [], 'failed': ['m1', 'm2'], 'errors': ['POST /machines/m1/offline failed after 3 attempts with status 503', 'POST /machines/m2/offline returned permanent status 409']}
Explanation: `m1` keeps returning a transient error until retries are exhausted. `m2` returns a permanent failure code, so it is not retried.
Input: ((200, '{"rack":"rack-2","machine_type":"cpu","minimum_error_count":1}'), (200, '[{"id":"a","rack":"rack-2","machine_type":"cpu","error_count":1,"state":"online"},{"id":"bad","rack":"rack-2"}]'), {'a': [200]}, 1)
Expected Output: {'offline_marked': ['a'], 'failed': [], 'errors': ['Skipped malformed machine record']}
Explanation: The first machine is valid and successfully marked offline. The second record is missing required fields, so it is skipped and logged as malformed.
Hints
- Stop early if `/task` or `/machines` cannot be parsed correctly; without both responses, you cannot safely choose machines to update.
- For each matching machine, simulate POST attempts from left to right and stop as soon as you hit success or the first permanent failure.