Design an Online Coding Judge Platform
Company: Google
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
## Design an Online Coding Judge Platform
Design an online coding-practice and judging platform (similar to LeetCode or Codeforces).
Users should be able to browse a catalog of programming problems, write and submit source code in multiple languages, and have that code compiled and executed against both **sample** (visible) and **hidden** test cases. For each submission the system returns a verdict — `Accepted`, `Wrong Answer`, `Compile Error`, `Runtime Error`, `Memory Limit Exceeded`, or `Time Limit Exceeded` — along with runtime and memory usage, and users can review their full submission history.
The submitted code is **untrusted**, so secure isolation of execution is a first-class requirement, not an afterthought. Produce an end-to-end design that covers: functional and non-functional requirements, an MVP, capacity estimates with a bottleneck call-out, the API surface, the data model, the high-level architecture and submission flow, the sandboxing/security model, how judging scales from one machine to many, and the key trade-offs (consistency, latency, cost, operational complexity). Expect the interviewer to keep changing requirements and to probe **where the bottleneck is** — be ready to argue whether the system is CPU-, memory-, disk-, network-, or queue-bound.
```hint Where to start
Split the system into a **read plane** (browse problems — cacheable, low-latency) and a **compute plane** (judge untrusted code — the real engineering problem). Almost all the difficulty lives in the compute plane; scope the read plane fast and spend your depth on judging.
```
```hint Decouple ingestion from execution
Think about what a *synchronous* "submit → run → return" call does under a 2,000/sec spike, when each run takes a couple of seconds. What can sit between accepting a submission and executing it so ingestion stays fast while execution scales on its own clock? How does the client learn the final verdict if the submit call no longer waits for it?
```
```hint Sizing the judge fleet
Put numbers on it. From your peak arrival rate and average judge time, what does that tell you about how much work is in flight simultaneously — and therefore how many machines you need once you decide how many sandboxes one machine can safely host? Now do the same sizing for the database write path. Comparing the two should tell you which side of the system to design around.
```
```hint Isolation is the hard part
Where do you sit on the spectrum from cheap-but-shared-kernel to expensive-but-strongly-isolated? Name the mechanisms at each end and pick a default you can defend on cost vs. blast radius. Whatever you pick, enumerate the threats it has to stop: network exfiltration, fork bombs, disk fill, CPU/wall-clock overrun, breakout to the host.
```
### Constraints & Assumptions
State your own numbers, but a reasonable working set is:
- ~1M registered users; ~100K daily active users.
- Steady-state submissions are modest, but **contest / recruiting spikes** drive peaks of ~2,000 submissions/sec.
- Per-problem **time limit** (e.g. 1–10 s wall/CPU) and **memory limit** (e.g. 256 MB) are configurable; the judge must enforce both.
- Supported languages include at least C++, Java, Python, and Go (compiled and interpreted).
- Submitted code is **untrusted and adversarial** — assume it will attempt to fork-bomb, fill the disk, open sockets, read another problem's hidden tests, or escape the sandbox.
- Submission verdicts may be **eventually consistent** — a few seconds of latency to a final verdict is acceptable; strict synchronous completion is not required.
- Problem statements are read-heavy and highly cacheable; submissions are write/compute-heavy.
### Clarifying Questions to Ask
- Is this for everyday practice, timed competitive contests, or recruiting assessments? (Determines the spike profile and fairness/anti-cheat needs.)
- Do we need interactive/streaming problems and custom/special judges (floating-point tolerance, multiple correct answers), or only batch "run all test cases" problems?
- How large can a single problem's hidden test set be (KB vs. hundreds of MB)? This drives the test-data distribution strategy.
- Are partial scores per test group required, or is it all-or-nothing per submission?
- What languages and runtime versions must we support on day one?
- What is the acceptable latency from "submit" to "final verdict" under normal load vs. peak, and must sample runs feel near-instant?
### What a Strong Answer Covers
- **Requirements framing**: clear functional vs. non-functional split, with untrusted-code isolation called out as a top non-functional driver.
- **Async architecture**: submission API → queue → judge worker pool, with status polling/push and a clear submission flow.
- **Back-of-envelope sizing**: numeric estimates for both the compute side and the storage/write side, leading to a defended call on which resource is the binding constraint — and the awareness to re-argue it when load shifts to memory, disk, or network.
- **Sandboxing depth**: concrete isolation mechanisms (no network, read-only FS, cgroup CPU/memory caps, process/thread limits, wall + CPU timeouts, unprivileged user, ephemeral environment) and the trade-off between plain containers and stronger boundaries (microVM / syscall-intercepting runtime).
- **Data model & API** that separate problem metadata, large test data (object storage), and submission records (partitionable).
- **Scaling strategy** per plane: cache/CDN for reads; queue-depth autoscaling, test-data caching, and per-resource-class worker pools for judging.
- **Reliability**: idempotent jobs, worker heartbeats/requeue, a submission state machine, retry-on-infra-failure-only.
- **Explicit trade-offs** with a defended default choice, plus what to observe (queue depth, judge latency, sandbox-escape alerts).
- **Communication**: scopes quickly, surfaces 2–3 alternatives with pros/cons, and adapts as requirements change instead of monologuing.
### Follow-up Questions
- How do you distribute and cache **hundreds of MB of hidden test data** so workers don't re-download it on every run, while still allowing test-set updates?
- A submission gets `Accepted` on a worker but the result write fails before it's persisted. Walk through how your design avoids both a lost verdict and a double-judge.
- During a contest, one problem with a 10-second time limit starves all the fast jobs in a shared queue. How do you keep short submissions responsive?
- How would you make judging **deterministic and fair** (consistent CPU-time measurement across heterogeneous hardware, anti-cheat, reproducible verdicts)?
Quick Answer: This question evaluates a candidate's competencies in large-scale system design, secure multi-tenant execution and sandboxing, capacity planning, queuing and autoscaling, and trade-off reasoning across performance, cost, and operational complexity.