How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Onsite rounds at Anthropic.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Anthropic during technical interviews.

Design a Prompt Playground | Anthropic Interview Question

Q: Design a Prompt Playground

This system design question evaluates a candidate's ability to design a full-stack product around a streaming LLM inference service, covering gateway architecture, persistence, and real-time token delivery. It tests product sense alongside distributed-systems fundamentals, assessing how well someone reasons from user needs to a scalable, cost-aware architecture. This conceptual and practical design question is common in software engineering interviews for AI-product platforms.

Design a Prompt Playground (ChatGPT / Claude Playground)

Design a "prompt playground" web product: an interactive surface where a developer types a prompt, tunes generation parameters (model, system prompt, temperature, max tokens, stop sequences), runs it against an LLM, and watches the response stream back token by token. Users can save, name, and re-open prompts, compare a couple of variations side by side, and copy the resulting API call. Unlike a pure distributed-systems prompt, this question rewards product sense and full-stack thinking: reason from what the user needs, design the experience and the data behind it, then scale it.

Constraints & Assumptions

Primary users are developers; the product must feel instant and show output as it is generated (streaming), not after a multi-second wall of silence.
The model itself is provided as an inference service exposing a streaming completion API; you are designing the product and platform around it, including a gateway, persistence, and the front end.
Requests are bursty and long-lived: a single generation can stream for many seconds. Plan for growth from thousands to millions of concurrent users.
Assume authentication, per-user rate limits / quotas, and cost accounting are required (LLM calls are expensive).

Clarifying Questions to Ask

Who is the user and what is the core job — quick experimentation, saving a prompt library, or prompt comparison/evaluation? That decides what to optimize first.
Do we need multi-turn chat, or single-shot prompt → completion, or both?
What transport is acceptable for streaming (Server-Sent Events vs. WebSocket) and what clients must we support?
What are the latency targets (time-to-first-token vs. total time) and the per-user quota/cost limits?
Is prompt/version history private per user, shareable, or collaborative in real time?

Part 1 — Product and functional design

Define the product. List the core features and the primary user flow (open → edit prompt + params → run → stream → save/iterate). Specify the data model for a saved prompt and its versions, and the main API surface the front end calls. Make explicit which decisions come from product sense (what to build and in what order) versus raw engineering.

What This Part Should Cover Premium

Part 2 — Backend architecture and streaming

Design the request path from the browser to the model and back, with the response streamed to the client. Cover the gateway/API tier, how a generation request reaches the inference service, how tokens stream back (transport choice and why), and how prompts/versions/runs are persisted. Address auth, per-user rate limiting/quota, and cost accounting on the hot path.

What This Part Should Cover Premium

Part 3 — Scaling to millions and streaming reliability

Scale the design to millions of concurrent users while keeping streams stable. Address connection concurrency (many simultaneous long-lived streams), load balancing of long requests, protecting the expensive and capacity-limited inference backend (queueing/backpressure/admission), caching, and graceful degradation when the model tier is saturated or a stream breaks mid-generation.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

How would you implement side-by-side comparison of two prompt versions running concurrently, including streaming both and aligning their results and costs?
Caching is tricky for generative output (temperature > 0 makes responses non-deterministic). Where can caching still help (e.g. deterministic/ temperature=0 runs, embeddings, autocomplete) and where is it unsafe?
How do you enforce fair per-user quotas and prevent a single user from monopolizing inference capacity with many concurrent long generations?
A generation streams 800 tokens, then the connection drops at token 600. What does the client see, and how do you let it resume without paying to regenerate the first 600 tokens?

Design a Prompt Playground (ChatGPT / Claude Playground)

Constraints & Assumptions

Primary users are developers; the product must feel instant and show output as it is generated (streaming), not after a multi-second wall of silence.
The model itself is provided as an inference service exposing a streaming completion API; you are designing the product and platform around it, including a gateway, persistence, and the front end.
Requests are bursty and long-lived: a single generation can stream for many seconds. Plan for growth from thousands to millions of concurrent users.
Assume authentication, per-user rate limits / quotas, and cost accounting are required (LLM calls are expensive).

Clarifying Questions to Ask

Who is the user and what is the core job — quick experimentation, saving a prompt library, or prompt comparison/evaluation? That decides what to optimize first.
Do we need multi-turn chat, or single-shot prompt → completion, or both?
What transport is acceptable for streaming (Server-Sent Events vs. WebSocket) and what clients must we support?
What are the latency targets (time-to-first-token vs. total time) and the per-user quota/cost limits?
Is prompt/version history private per user, shareable, or collaborative in real time?

Part 1 — Product and functional design

What This Part Should Cover Premium

Part 2 — Backend architecture and streaming

What This Part Should Cover Premium

Part 3 — Scaling to millions and streaming reliability

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

How would you implement side-by-side comparison of two prompt versions running concurrently, including streaming both and aligning their results and costs?
Caching is tricky for generative output (temperature > 0 makes responses non-deterministic). Where can caching still help (e.g. deterministic/ temperature=0 runs, embeddings, autocomplete) and where is it unsafe?
How do you enforce fair per-user quotas and prevent a single user from monopolizing inference capacity with many concurrent long generations?
A generation streams 800 tokens, then the connection drops at token 600. What does the client see, and how do you let it resume without paying to regenerate the first 600 tokens?

Design a Prompt Playground

Quick Overview

Design a Prompt Playground

Design a Prompt Playground (ChatGPT / Claude Playground)

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Product and functional design

What This Part Should Cover Premium

Part 2 — Backend architecture and streaming

What This Part Should Cover Premium

Part 3 — Scaling to millions and streaming reliability

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

Design a Prompt Playground

Quick Overview

Design a Prompt Playground

Design a Prompt Playground (ChatGPT / Claude Playground)

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Product and functional design

What This Part Should Cover Premium

Part 2 — Backend architecture and streaming

What This Part Should Cover Premium

Part 3 — Scaling to millions and streaming reliability

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP