Reviewing a Freight-Scheduler Codebase: Bugs, Data Structures, and Concurrency
Company: Optiver
Role: Software Engineer
Category: Software Engineering Fundamentals
Difficulty: hard
Interview Round: Technical Screen
You have just finished implementing `process_order(...)` inside a large provided codebase for a freight-scheduling system (this is the review stage of a trading-firm phone screen that supplies a long Python skeleton with `Order`, `Plane`, `Cargo`, and `Scheduler` classes plus many helpers). The system's rules: each plane has a fixed departure time and a fixed capacity; each order has a placement time and a set of cargo pieces with sizes; an order's cargo may only be loaded onto planes departing after the order's time; cargo may be split across multiple planes; an order is either accepted (cargo fully allocated, remaining capacities updated) or rejected (no state change).
The interviewer now asks you to step back from your own function and critically review the **entire** codebase: find likely bugs and unhandled corner cases, identify complexity bottlenecks and better data structures, and reason about what happens when the system must handle concurrent and high-volume order traffic.
### Constraints & Assumptions
- Single-process, in-memory scheduler written in Python; no database or external queue in the baseline design.
- Baseline implementation: planes are kept in an unsorted list; every `process_order` call linearly scans all planes to filter eligible ones, sorts the eligible list, then allocates.
- Fleet size on the order of $10^5$ planes; order volume may grow to millions per day, with bursts of many orders per second.
- `process_order` must be all-or-nothing: a rejected order must leave every plane's remaining capacity untouched.
- Deterministic allocation policy (e.g., fill eligible planes by earliest departure) is required so results are reproducible and auditable.
- You may assume 64-bit integer arithmetic for sizes and capacities.
### Clarifying Questions to Ask
- Is the acceptance check based on the *sum* of eligible remaining capacities (divisible cargo), or must each cargo piece fit entirely on a single plane (bin-packing semantics)?
- Is "departs after the order time" strict (`>`) or inclusive (`>=`), and are times unique?
- Do planes ever get removed (departed flights) or added (new flights), or is the fleet fixed for the lifetime of the process?
- Are order cancellations or capacity roll-backs ever required, or is allocation permanent once committed?
- Under concurrency, what consistency guarantee is expected — must results be identical to *some* serial processing order, or to the exact arrival order?
- What is the read/write mix — are there frequent queries (e.g., "remaining capacity by route") alongside order processing?
### Part 1
Review the provided code for correctness. Which categories of bugs and unhandled corner cases would you look for in this kind of scheduler, and how would you verify or debug them under interview time pressure?
```hint Boundary sweep
Walk every comparison and boundary systematically: the departure-vs-order-time comparison (strict vs inclusive), empty or zero-size cargo, an exact-fit allocation that drives remaining capacity to exactly $0$, and — most importantly — whether the **rejection path can leave partially mutated state** if feasibility is checked plane-by-plane while allocating.
```
#### What This Part Should Cover
- A systematic taxonomy of likely defects (boundary conditions, state-mutation/atomicity bugs, aliasing or shared-mutable-state issues, numeric issues) rather than a random grab-bag.
- Recognition that the accept/allocate sequence is the highest-risk area: feasibility computed from stale or partially updated capacities, and rollback on rejection.
- A concrete verification strategy: targeted unit tests on boundaries, invariant assertions, and differential testing against a brute-force reference implementation.
### Part 2
Analyze the time complexity of the baseline `process_order` and propose data-structure changes. Which structures would you replace with a heap, tree, or indexed structure, what does each query become, and what is the resulting overall complexity for $m$ orders over $n$ planes?
```hint Which queries dominate
Per order you need two things: (a) the **sum of remaining capacity over planes departing after $t$**, and (b) those planes **in ascending departure order**. Sorting planes by departure once turns eligibility into a suffix of an array found by binary search; think about what structure gives you a dynamic suffix sum with point updates.
```
```hint Amortize the fill
A plane can be drained to zero remaining capacity at most **once** over the whole run, and each accepted order partially fills at most one plane. Count total plane-touches across all orders, not per order.
```
#### What This Part Should Cover
- A correct baseline analysis (per-order full scan plus sort, i.e., $O(m \cdot n \log n)$ worst case) before proposing fixes.
- Mapping each logical query (eligibility filter, feasibility sum, ordered fill) to an appropriate structure (sorted array + binary search, Fenwick/segment tree or heap with lazy deletion) with per-operation costs.
- An amortized argument for the allocation loop and a final end-to-end complexity, with brief discussion of what changes if the allocation policy were "most remaining capacity first."
### Part 3
Is this scheduler thread-safe as written? Explain exactly what goes wrong if multiple threads call `process_order` concurrently, and design an approach to support concurrent orders safely. Then discuss what you would change as order volume grows well beyond what a single machine handles comfortably.
```hint Find the race
The core hazard is **check-then-act**: two orders concurrently read remaining capacities, both conclude the order fits, and both commit — overselling capacity. Ask yourself what makes the feasibility check and the multi-plane update a single atomic step.
```
```hint Serialize or partition
Compare three families: one global lock / single-writer queue (simple, serial); fine-grained per-plane locking with a consistent lock order (or two-phase reserve-then-commit with rollback); and optimistic versioned updates with retry. Each trades throughput for complexity differently — and note how cheap each order actually is after Part 2's optimization.
```
#### What This Part Should Cover
- A precise statement of the data races: lost updates on remaining capacity, torn multi-plane allocations, and iteration over structures being mutated — not just "add a lock."
- At least two concrete concurrency designs with their trade-offs (global serialization vs fine-grained locking vs optimistic concurrency), including deadlock avoidance and how atomic all-or-nothing acceptance is preserved.
- A scaling story: partitioning/sharding choices, ordering and idempotency of order processing, and when to move state into a transactional store or log-based pipeline.
### What a Strong Answer Covers
Across all parts, the interviewer is evaluating how you reason about an unfamiliar, realistic codebase end-to-end — not whether you memorize APIs:
- Reading and critiquing provided code methodically: stating the system's invariants (capacity never negative, accepted orders fully allocated, rejected orders leave no trace) and checking the code against them.
- Correctness before optimization: fixing atomicity and boundary bugs first, then optimizing with a quantified before/after complexity comparison.
- Trade-off fluency: every proposal (data structure, locking scheme, sharding) comes with its cost — memory, code complexity, contention, or lost ordering guarantees.
- Clear communication under time pressure: prioritizing the highest-impact issues instead of enumerating everything at equal depth.
### Follow-up Questions
- How would you support order cancellation or plane removal (a flight departs and becomes unavailable) without breaking the amortized complexity from Part 2?
- If the process crashes mid-allocation, what state can be corrupted, and how would you add durability (write-ahead logging, snapshot + replay) to recover cleanly?
- How would you test the concurrent version — what would a stress or property-based test assert to catch oversold capacity?
- If planes belong to independent routes, how would you shard the scheduler across machines, and what new consistency problems appear for orders that could use planes on multiple shards?