Design a Prompt Playground
Company: Anthropic
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
# Design a Prompt Playground (ChatGPT / Claude Playground)
Design a "prompt playground" web product: an interactive surface where a developer types a prompt, tunes generation parameters (model, system prompt, temperature, max tokens, stop sequences), runs it against an LLM, and watches the response stream back token by token. Users can save, name, and re-open prompts, compare a couple of variations side by side, and copy the resulting API call. Unlike a pure distributed-systems prompt, this question rewards **product sense** and **full-stack thinking**: reason from what the user needs, design the experience and the data behind it, then scale it.
### Constraints & Assumptions
- Primary users are developers; the product must feel instant and show output as it is generated (streaming), not after a multi-second wall of silence.
- The model itself is provided as an inference service exposing a streaming completion API; you are designing the product and platform around it, including a gateway, persistence, and the front end.
- Requests are bursty and long-lived: a single generation can stream for many seconds. Plan for growth from thousands to **millions** of concurrent users.
- Assume authentication, per-user rate limits / quotas, and cost accounting are required (LLM calls are expensive).
### Clarifying Questions to Ask
- Who is the user and what is the core job — quick experimentation, saving a prompt library, or prompt comparison/evaluation? That decides what to optimize first.
- Do we need multi-turn chat, or single-shot prompt → completion, or both?
- What transport is acceptable for streaming (Server-Sent Events vs. WebSocket) and what clients must we support?
- What are the latency targets (time-to-first-token vs. total time) and the per-user quota/cost limits?
- Is prompt/version history private per user, shareable, or collaborative in real time?
### Part 1 — Product and functional design
Define the product. List the core features and the primary user flow (open → edit prompt + params → run → stream → save/iterate). Specify the data model for a saved prompt and its versions, and the main API surface the front end calls. Make explicit which decisions come from *product sense* (what to build and in what order) versus raw engineering.
```hint Anchor on the job-to-be-done
Start from "a developer wants to try a prompt and immediately see and iterate on the result." Everything (params panel, streaming output, save/version, copy-as-code) follows from making that loop fast and repeatable.
```
```hint Data model
A `Prompt` with ordered `PromptVersion`s (text + params + model), each `Run` capturing inputs, output, token counts, latency, and cost — enough to reproduce and compare.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Backend architecture and streaming
Design the request path from the browser to the model and back, with the response **streamed** to the client. Cover the gateway/API tier, how a generation request reaches the inference service, how tokens stream back (transport choice and why), and how prompts/versions/runs are persisted. Address auth, per-user rate limiting/quota, and cost accounting on the hot path.
```hint Streaming transport
Server-Sent Events over one long-lived HTTP response is the natural fit for unidirectional token streaming (simple, proxy-friendly, auto-reconnect); reach for WebSocket only if you need bidirectional/interactive control mid-generation.
```
```hint Keep the write path off the critical latency path
Persist run metadata and accumulate the streamed output asynchronously (e.g. write the final record when the stream completes) so logging/cost accounting never delays time-to-first-token.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Scaling to millions and streaming reliability
Scale the design to millions of concurrent users while keeping streams stable. Address connection concurrency (many simultaneous long-lived streams), load balancing of long requests, protecting the expensive and capacity-limited inference backend (queueing/backpressure/admission), caching, and graceful degradation when the model tier is saturated or a stream breaks mid-generation.
```hint The scarce resource is the model tier
The bottleneck is GPU inference capacity, not web servers. Put a queue / admission control in front of the inference fleet, apply backpressure and per-user concurrency caps, and shed load with clear errors before the model tier melts down.
```
```hint Long-lived connections change the math
Millions of concurrent SSE streams means connection count, not CPU, can be the limit. Use many lightweight async gateway instances, sticky-enough routing for a stream's lifetime, and a resumable token offset so a dropped stream can reconnect without re-running the whole generation.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- How would you implement side-by-side comparison of two prompt versions running concurrently, including streaming both and aligning their results and costs?
- Caching is tricky for generative output (temperature > 0 makes responses non-deterministic). Where can caching still help (e.g. deterministic/`temperature=0` runs, embeddings, autocomplete) and where is it unsafe?
- How do you enforce fair per-user quotas and prevent a single user from monopolizing inference capacity with many concurrent long generations?
- A generation streams 800 tokens, then the connection drops at token 600. What does the client see, and how do you let it resume without paying to regenerate the first 600 tokens?
Quick Answer: This system design question evaluates a candidate's ability to design a full-stack product around a streaming LLM inference service, covering gateway architecture, persistence, and real-time token delivery. It tests product sense alongside distributed-systems fundamentals, assessing how well someone reasons from user needs to a scalable, cost-aware architecture. This conceptual and practical design question is common in software engineering interviews for AI-product platforms.