PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/Anthropic

Design a Prompt Playground

Last updated: Jul 1, 2026

Quick Overview

This system design question evaluates a candidate's ability to design a full-stack product around a streaming LLM inference service, covering gateway architecture, persistence, and real-time token delivery. It tests product sense alongside distributed-systems fundamentals, assessing how well someone reasons from user needs to a scalable, cost-aware architecture. This conceptual and practical design question is common in software engineering interviews for AI-product platforms.

  • medium
  • Anthropic
  • System Design
  • Software Engineer

Design a Prompt Playground

Company: Anthropic

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Onsite

# Design a Prompt Playground (ChatGPT / Claude Playground) Design a "prompt playground" web product: an interactive surface where a developer types a prompt, tunes generation parameters (model, system prompt, temperature, max tokens, stop sequences), runs it against an LLM, and watches the response stream back token by token. Users can save, name, and re-open prompts, compare a couple of variations side by side, and copy the resulting API call. Unlike a pure distributed-systems prompt, this question rewards **product sense** and **full-stack thinking**: reason from what the user needs, design the experience and the data behind it, then scale it. ### Constraints & Assumptions - Primary users are developers; the product must feel instant and show output as it is generated (streaming), not after a multi-second wall of silence. - The model itself is provided as an inference service exposing a streaming completion API; you are designing the product and platform around it, including a gateway, persistence, and the front end. - Requests are bursty and long-lived: a single generation can stream for many seconds. Plan for growth from thousands to **millions** of concurrent users. - Assume authentication, per-user rate limits / quotas, and cost accounting are required (LLM calls are expensive). ### Clarifying Questions to Ask - Who is the user and what is the core job — quick experimentation, saving a prompt library, or prompt comparison/evaluation? That decides what to optimize first. - Do we need multi-turn chat, or single-shot prompt → completion, or both? - What transport is acceptable for streaming (Server-Sent Events vs. WebSocket) and what clients must we support? - What are the latency targets (time-to-first-token vs. total time) and the per-user quota/cost limits? - Is prompt/version history private per user, shareable, or collaborative in real time? ### Part 1 — Product and functional design Define the product. List the core features and the primary user flow (open → edit prompt + params → run → stream → save/iterate). Specify the data model for a saved prompt and its versions, and the main API surface the front end calls. Make explicit which decisions come from *product sense* (what to build and in what order) versus raw engineering. ```hint Anchor on the job-to-be-done Start from "a developer wants to try a prompt and immediately see and iterate on the result." Everything (params panel, streaming output, save/version, copy-as-code) follows from making that loop fast and repeatable. ``` ```hint Data model A `Prompt` with ordered `PromptVersion`s (text + params + model), each `Run` capturing inputs, output, token counts, latency, and cost — enough to reproduce and compare. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 2 — Backend architecture and streaming Design the request path from the browser to the model and back, with the response **streamed** to the client. Cover the gateway/API tier, how a generation request reaches the inference service, how tokens stream back (transport choice and why), and how prompts/versions/runs are persisted. Address auth, per-user rate limiting/quota, and cost accounting on the hot path. ```hint Streaming transport Server-Sent Events over one long-lived HTTP response is the natural fit for unidirectional token streaming (simple, proxy-friendly, auto-reconnect); reach for WebSocket only if you need bidirectional/interactive control mid-generation. ``` ```hint Keep the write path off the critical latency path Persist run metadata and accumulate the streamed output asynchronously (e.g. write the final record when the stream completes) so logging/cost accounting never delays time-to-first-token. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 3 — Scaling to millions and streaming reliability Scale the design to millions of concurrent users while keeping streams stable. Address connection concurrency (many simultaneous long-lived streams), load balancing of long requests, protecting the expensive and capacity-limited inference backend (queueing/backpressure/admission), caching, and graceful degradation when the model tier is saturated or a stream breaks mid-generation. ```hint The scarce resource is the model tier The bottleneck is GPU inference capacity, not web servers. Put a queue / admission control in front of the inference fleet, apply backpressure and per-user concurrency caps, and shed load with clear errors before the model tier melts down. ``` ```hint Long-lived connections change the math Millions of concurrent SSE streams means connection count, not CPU, can be the limit. Use many lightweight async gateway instances, sticky-enough routing for a stream's lifetime, and a resumable token offset so a dropped stream can reconnect without re-running the whole generation. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - How would you implement side-by-side comparison of two prompt versions running concurrently, including streaming both and aligning their results and costs? - Caching is tricky for generative output (temperature > 0 makes responses non-deterministic). Where can caching still help (e.g. deterministic/`temperature=0` runs, embeddings, autocomplete) and where is it unsafe? - How do you enforce fair per-user quotas and prevent a single user from monopolizing inference capacity with many concurrent long generations? - A generation streams 800 tokens, then the connection drops at token 600. What does the client see, and how do you let it resume without paying to regenerate the first 600 tokens?

Quick Answer: This system design question evaluates a candidate's ability to design a full-stack product around a streaming LLM inference service, covering gateway architecture, persistence, and real-time token delivery. It tests product sense alongside distributed-systems fundamentals, assessing how well someone reasons from user needs to a scalable, cost-aware architecture. This conceptual and practical design question is common in software engineering interviews for AI-product platforms.

Related Interview Questions

  • Design Instagram (Feed, Photos, and Friend Recommendations) - Anthropic (medium)
  • Design a Distributed Rate Limiter - Anthropic (medium)
  • Design an LLM Request Batching System - Anthropic (medium)
  • Design a One-on-One Chat Service - Anthropic (medium)
  • Design a prompt playground - Anthropic (hard)
|Home/System Design/Anthropic

Design a Prompt Playground

Anthropic logo
Anthropic
Jun 23, 2026, 12:00 AM
mediumSoftware EngineerOnsiteSystem Design
0
0

Design a Prompt Playground (ChatGPT / Claude Playground)

Design a "prompt playground" web product: an interactive surface where a developer types a prompt, tunes generation parameters (model, system prompt, temperature, max tokens, stop sequences), runs it against an LLM, and watches the response stream back token by token. Users can save, name, and re-open prompts, compare a couple of variations side by side, and copy the resulting API call. Unlike a pure distributed-systems prompt, this question rewards product sense and full-stack thinking: reason from what the user needs, design the experience and the data behind it, then scale it.

Constraints & Assumptions

  • Primary users are developers; the product must feel instant and show output as it is generated (streaming), not after a multi-second wall of silence.
  • The model itself is provided as an inference service exposing a streaming completion API; you are designing the product and platform around it, including a gateway, persistence, and the front end.
  • Requests are bursty and long-lived: a single generation can stream for many seconds. Plan for growth from thousands to millions of concurrent users.
  • Assume authentication, per-user rate limits / quotas, and cost accounting are required (LLM calls are expensive).

Clarifying Questions to Ask

  • Who is the user and what is the core job — quick experimentation, saving a prompt library, or prompt comparison/evaluation? That decides what to optimize first.
  • Do we need multi-turn chat, or single-shot prompt → completion, or both?
  • What transport is acceptable for streaming (Server-Sent Events vs. WebSocket) and what clients must we support?
  • What are the latency targets (time-to-first-token vs. total time) and the per-user quota/cost limits?
  • Is prompt/version history private per user, shareable, or collaborative in real time?

Part 1 — Product and functional design

Define the product. List the core features and the primary user flow (open → edit prompt + params → run → stream → save/iterate). Specify the data model for a saved prompt and its versions, and the main API surface the front end calls. Make explicit which decisions come from product sense (what to build and in what order) versus raw engineering.

What This Part Should Cover Premium

Part 2 — Backend architecture and streaming

Design the request path from the browser to the model and back, with the response streamed to the client. Cover the gateway/API tier, how a generation request reaches the inference service, how tokens stream back (transport choice and why), and how prompts/versions/runs are persisted. Address auth, per-user rate limiting/quota, and cost accounting on the hot path.

What This Part Should Cover Premium

Part 3 — Scaling to millions and streaming reliability

Scale the design to millions of concurrent users while keeping streams stable. Address connection concurrency (many simultaneous long-lived streams), load balancing of long requests, protecting the expensive and capacity-limited inference backend (queueing/backpressure/admission), caching, and graceful degradation when the model tier is saturated or a stream breaks mid-generation.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

  • How would you implement side-by-side comparison of two prompt versions running concurrently, including streaming both and aligning their results and costs?
  • Caching is tricky for generative output (temperature > 0 makes responses non-deterministic). Where can caching still help (e.g. deterministic/ temperature=0 runs, embeddings, autocomplete) and where is it unsafe?
  • How do you enforce fair per-user quotas and prevent a single user from monopolizing inference capacity with many concurrent long generations?
  • A generation streams 800 tokens, then the connection drops at token 600. What does the client see, and how do you let it resume without paying to regenerate the first 600 tokens?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.