Build a GPU VM Fleet CLI
Company: Heygen
Role: Software Engineer
Category: Software Engineering Fundamentals
Difficulty: medium
Interview Round: Technical Screen
You are given a repository containing three mock cloud-provider servers that simulate GPU VM providers — **Crusoe Cloud**, **Lambda Cloud**, and **Nebius AI Cloud**. This is a live AI-assisted coding exercise: build a provider-agnostic CLI tool for requesting and managing GPU virtual machines and **fleets** of machines across all three providers.
## Background
Your company rents GPU VMs from three providers, each with a different API shape and lifecycle model:
| Provider | Protocol | Resource model | Operations | Reservations |
|----------|----------|----------------|------------|--------------|
| **Crusoe Cloud** | REST | Project-scoped | Asynchronous (operation IDs) | Auto-placed into cheapest matching reservation; explicit `reservation_id` optional; stop releases capacity, start reclaims it. Lifecycle includes reboot/reset/restart semantics. |
| **Lambda Cloud** | REST | Flat / simple | Mostly synchronous | Instances flagged `is_reserved: true`; reserved instances **cannot be terminated** via the API; launching with `reservation_id` uses reserved capacity. |
| **Nebius AI Cloud** | gRPC | Parent-scoped | Asynchronous (operation IDs) | A `ReservationPolicy` in the instance spec: `AUTO` (try reservation first), `FORBID` (always on-demand), `STRICT` (must use a specific reservation, else fail). |
The deliverable is split into two layers (Part 1 and Part 2 below) plus a follow-up discussion. The interviewer cares far more about your design — how you isolate provider differences, model state, and handle partial failure — than about exhaustively wiring every endpoint.
### Constraints & Assumptions
- **Mock servers are provided.** You do not have to handle real auth/billing, but you must call the three mock APIs (two REST, one gRPC) through their real interfaces.
- **Scale of the exercise:** a fleet is on the order of single digits to a few dozen VMs; the CLI runs as a short-lived process invoked repeatedly from a shell.
- **Time budget (guidance):** Layer 1 ~30–40 min, Layer 2 ~40–50 min. Favor a clean, extensible design over feature completeness.
- **Durability:** fleet membership must survive process exit (the CLI is invoked once per command), so state lives outside the process.
- **Mixed sync/async:** Crusoe and Nebius return operation IDs that must be polled; Lambda is mostly synchronous. Higher layers should not care which is which.
- **GPU types / regions** are passed as opaque strings (e.g. `h100`, `us-east`); a provider may not support a given type or region.
### Clarifying Questions to Ask
- What is the source of truth for "all instances" — do we list only VMs this CLI created, or every VM in each provider account?
- For `vm fleet create`, what allocation policy is expected (cheapest-first, spread evenly, reservation-first), and is the count a hard requirement or best-effort?
- If a fleet can only partially fill (e.g. 6 of 10), should it roll back to zero, or keep the partial fleet and report a shortfall?
- Where should fleet state live — local file/SQLite for the exercise, or are we expected to design for a shared server-side store?
- What output format(s) must commands support (human table, JSON, both)?
- How are credentials / provider endpoints supplied (env vars, config file)?
### Part 1 — Unified CLI (Layer 1)
Build a CLI that presents one consistent interface for managing individual VMs across all three providers. Each command targets a single provider (except `list`, which may span all). Required commands:
```bash
# List all instances across all providers, or filter by provider
vm list [--provider <name>]
# Create new instance(s)
vm create --provider <name> --gpu <type> --count <n> [--name <name>] [--region <region>]
# Get instance details
vm get <instance_id> --provider <name>
# Stop an instance
vm stop <instance_id> --provider <name>
# Start an instance
vm start <instance_id> --provider <name>
# Destroy / terminate an instance
vm destroy <instance_id> --provider <name>
```
The hard part is not argument parsing — it is designing a single abstraction (a `ProviderClient`-style interface plus a normalized `Instance` model) so that the CLI layer never branches on provider, and so a fourth provider could be added by writing one new adapter.
```hint Where to start
Define one provider-agnostic interface (`list / create / get / stop / start / destroy` + a `capabilities()` probe) and a normalized `Instance` model. Implement three **adapters** behind it. The CLI dispatches by provider name to an adapter; it must contain zero provider-specific `if` branches.
```
```hint Hiding sync vs async
Crusoe and Nebius return an operation ID, not a finished resource. Put a `wait_for_operation(op_id)` poller **inside** each async adapter so `create`/`stop`/etc. return a settled `Instance` to the caller — the CLI shouldn't know which providers are async. Lambda's adapter just returns the synchronous result directly.
```
```hint Cross-provider list
For `vm list` with no `--provider`, fan out to all three providers (concurrently is fine since they're independent), normalize each result into the common `Instance` shape, and merge. Decide up front whether one provider failing fails the whole command or yields partial results with a warning — and state that choice.
```
#### What This Part Should Cover
- A clean provider abstraction (common interface + normalized model + per-provider adapters) with no provider branching above the adapter layer.
- Correct mapping of each command to each provider's real API, including project-/parent-scoping for Crusoe/Nebius and the gRPC vs REST split.
- Normalization of heterogeneous provider states (e.g. `pending`/`provisioning`/`running`) into a single canonical state enum.
- Consistent, ideally machine-readable output (a table plus `--output json`), and sensible argument validation.
### Part 2 — Fleet Manager (Layer 2)
Build on top of Part 1 to manage a **fleet** — a logical group of VMs of one GPU type that may be spread across multiple providers and must be tracked as a unit. Required commands:
```bash
# Request N machines of a given GPU type, spread across providers
vm fleet create --gpu <type> --count <n> [--name <fleet_name>]
# List all fleets
vm fleet list
# Show fleet status (which VMs, which providers, which states)
vm fleet status <fleet_name>
# Destroy an entire fleet
vm fleet destroy <fleet_name>
```
`fleet create` is the centerpiece: it must allocate `N` machines across providers, persist membership durably as it goes, and behave sanely when it can only partially fill the request or fails midway.
```hint Allocation loop
Walk providers in a defined order (or by a pluggable policy: cheapest-first, reservation-first, spread). At each step, request only `remaining = count - created_so_far`, skip providers that don't support the GPU type, and stop once the count is met. Encapsulate the policy so it can be swapped without touching the loop.
```
```hint Persist as you go
Write a fleet record (status `CREATING`) **before** allocating, and persist each VM into the fleet's membership the instant the provider confirms it — not at the end. Cleanup and `fleet status` can only act on what was durably recorded, so an in-memory list that dies with the process is not enough.
```
```hint Partial failure & rollback
If you can't reach `count`, you must decide and implement one explicit policy: best-effort destroy of everything created so far (true rollback), or keep the partial fleet and report the shortfall. Either way, run cleanup against the **persisted** membership, make destroys best-effort + retryable, and never touch VMs outside this fleet. Watch the Lambda reserved-instance "cannot terminate" case.
```
#### Clarifying Questions for this Part
- Is `--name` optional, and if so how are unnamed fleets identified (generated name, sequence)?
- Should `fleet create` be idempotent on retry (re-running with the same name resumes vs. creates a second fleet)?
- Does `fleet destroy` need to handle a fleet that's already partially destroyed or has cleanup-failed members?
#### What This Part Should Cover
- A durable fleet store (membership survives process exit) with the right records: fleet metadata + per-VM `(provider, provider_instance_id, state, reservation info)`.
- An explicit, encapsulated allocation strategy and a clear definition of success vs. partial fill.
- Correct partial-failure handling: incremental persistence, a stated rollback-vs-keep policy, best-effort retryable cleanup, and not destroying out-of-fleet VMs.
- A coherent fleet state machine (`CREATING → ACTIVE / PARTIAL / FAILED → DESTROYED`) reflected consistently in `fleet status`.
### What a Strong Answer Covers
Across both parts, the interviewer is watching for the design instincts that separate a thin wrapper from a maintainable tool:
- **Separation of concerns:** provider differences (protocol, scoping, sync/async, reservation semantics) are quarantined inside adapters; the CLI and fleet layers speak only the normalized model.
- **Idempotency & duplicate-creation safety:** retries of expensive GPU `create` calls must not silently double-allocate — via request IDs, idempotency keys, or tagging/naming instances with fleet metadata so an interrupted create can be reconciled.
- **Durable, recoverable state:** state is written before/after each meaningful transition so a crash mid-create leaves a recoverable record, not orphaned VMs.
- **Honest failure reporting:** structured error types (e.g. `InsufficientCapacity`, `ReservedInstanceCannotBeTerminated`, `OperationTimeout`) surfaced as clear human-readable messages, never silent success.
- **Testability:** the adapter seam makes state-mapping, reservation-policy translation, and partial-failure rollback unit-testable against the mock servers.
### Follow-up Questions
1. What are the major API differences between the three providers (protocol, scoping, operation style, reservation semantics), and how does your code keep them from leaking past the adapter layer?
2. How do you store the final set of machines that belong to a fleet, and what schema makes `status`, cleanup, and idempotent retry possible?
3. How do you clean up partially created machines when `fleet create` fails midway — including the cases where destroy is async, a Lambda member is reserved/non-terminable, or the CLI crashes during cleanup?
4. What problems could occur in implementation or production (duplicate allocation, rate limits, lost responses, stale local state), and how would you mitigate each?
Quick Answer: This question evaluates competency in designing provider-agnostic tooling for managing GPU virtual machines and fleets, covering API integration, lifecycle and reservation semantics, state reconciliation, and fault handling.