LLM Chat Product Architecture

What's being tested

You’re being tested on end-to-end architecture for an LLM chat product, not on how transformers work. A strong answer shows you can design a reliable, secure, multi-tenant web application that streams model output, persists conversation state, supports sharing/versioning, and handles failures gracefully. OpenAI cares because the product surface around the model is a distributed system: latency, state consistency, auth, abuse controls, and UX recovery all matter as much as calling the model. The interviewer is probing whether you can separate chat orchestration, storage, streaming transport, client state, and model-provider integration into clean responsibilities.

Core knowledge

Streaming response architecture is central. Most chat UIs should start rendering tokens before the full completion is available using Server-Sent Events, WebSocket, or streaming fetch. SSE is simple for one-way token streams; WebSocket is better for bidirectional collaboration or multiplexed sessions.
Request lifecycle design should be explicit: client sends user message, backend authenticates, persists or stages the message, calls the LLM provider, streams chunks back, assembles the final assistant message, then commits final state. Treat partial output as a distinct state from completed output.
Conversation data modeling usually separates conversations, messages, and message_parts. A minimal schema has conversation_id, tenant_id, user_id, role, content, created_at, parent_message_id, status, and model_config. For branching or snapshots, avoid assuming a conversation is a simple append-only list.
Snapshots and sharing require versioned, immutable records. A good design stores a snapshot pointing to a stable set of message versions rather than sharing the live conversation. This prevents “I shared answer X, but later edits changed what the recipient sees,” which is a serious trust and privacy bug.
Context-window management is a product-engineering concern even for SWE. The backend should build the prompt from selected prior messages, pinned system instructions, files, and presets while respecting a token budget. A common heuristic is: reserve output tokens, include latest turns first, summarize or drop older turns when near the limit.
Presets should be versioned configuration objects, not copied free-form strings everywhere. Store fields like preset_id, version, system_prompt, temperature, tools_enabled, and model. Existing conversations should reference the preset version they used so later preset edits do not silently alter old behavior.
Multi-tenancy must be enforced at every layer. Include tenant_id in primary data access paths, authorization checks, rate limits, audit logs, and search filters. Do not rely only on frontend filtering; every backend query should be scoped to the authenticated principal and tenant.
Secret management is non-negotiable. Browser clients should not hold long-lived provider API keys. In server-backed designs, the backend acts as a relay and stores secrets in a managed system such as AWS Secrets Manager, GCP Secret Manager, or Vault; in browser-only exercises, use user-provided keys or short-lived delegated tokens and explain the limitations.
Rate limiting and abuse controls should exist before the model call. Use per-user, per-tenant, and per-IP limits with token-bucket or leaky-bucket algorithms. Limit both request count and estimated token usage because LLM cost scales approximately with input_tokens + output_tokens.
Idempotency matters because chat submissions are easy to double-send on refresh or network retry. Use a client-generated idempotency_key for each user message so the backend can return the same in-flight or completed response rather than creating duplicate assistant messages.
Search and retrieval over chat history can start with Postgres full-text search or OpenSearch for metadata/content search. Keep the SWE answer focused on indexing fields, access control filters, pagination, and freshness; avoid drifting into embedding-model quality unless the interviewer asks.
Failure handling should distinguish user-visible states: queued, streaming, completed, cancelled, failed, and partial. If the model stream drops after 300 tokens, the UI should show a recoverable partial answer and allow “retry from here” rather than corrupting the conversation.

Worked example

For “Design a GPT chat UI with snapshots and sharing”, a strong candidate starts by clarifying scope: “Is this a web app? Do we need real-time collaboration or just one-user chat? Are shared snapshots public links, workspace-only links, or permissioned documents? Should shared views update live or remain immutable?” Then they declare assumptions: multi-tenant web app, authenticated users, streaming assistant responses, persisted history, immutable share links.

The answer can be organized around four pillars: frontend interaction model, backend chat orchestration, data model, and sharing/security. For the frontend, describe a React or SPA client that optimistically renders the user message, opens an SSE stream, appends token chunks, supports cancellation via AbortController, and reconciles final server state when completion ends. For the backend, describe an authenticated POST /conversations/{id}/messages endpoint that validates access, stores the user message with an idempotency key, calls the model gateway, streams chunks, and finalizes the assistant message.

For the schema, avoid a flat transcript-only design. Use conversations, messages, message_versions, and snapshots, where a snapshot stores stable references to message versions plus metadata like created_by, visibility, and expires_at. The explicit tradeoff to flag is immutable snapshots versus live shared conversations: immutable snapshots are safer, auditable, and easier to cache, while live shares are more interactive but require complex permission propagation and change visibility rules.

Close by mentioning operational concerns: rate limits, audit logs for share creation/access, encrypted storage for sensitive content, and background indexing for search. If there were more time, you would add branch navigation, retention policies, and admin controls for enterprise tenants.

A second angle

For “Design an AI chatbot with browser storage”, the same architecture changes because the browser, not the server database, owns most conversation state. You should frame it as a privacy-preserving or lightweight deployment where messages live in IndexedDB or localStorage, with IndexedDB preferred for larger structured data and async access. The streaming piece still applies, but the backend may be only a stateless relay that protects provider credentials, enforces rate limits, and forwards chunks to the client.

The main design tension is privacy versus durability and cross-device sync. Browser-only storage minimizes server-side data retention, but users lose history when clearing site data and cannot seamlessly continue across devices. A strong answer makes that tradeoff explicit rather than pretending browser storage is equivalent to a real backend.

Common pitfalls

Pitfall: Designing chat as a normal request/response CRUD app.

A tempting answer is “send prompt, wait for response, save it,” but that misses the defining UX and systems challenge: streaming partial output, cancellation, retries, and final reconciliation. A better answer models the assistant response as an in-progress resource with lifecycle states and a transport designed for incremental delivery.

Pitfall: Treating sharing as just another public=true column on conversations.

That design can leak future private messages if a user keeps chatting in the same conversation after sharing. Safer designs use immutable snapshots, scoped permissions, expiration, and access checks on every read path. If live sharing is required, say so explicitly and design a permission model for ongoing updates.

Pitfall: Going too deep into model internals instead of product architecture.

For a SWE interview, avoid spending five minutes on transformer attention, fine-tuning, or sampling theory. It is enough to mention configurable parameters such as model, temperature, and max tokens; the real signal is how you handle API boundaries, state, security, latency, persistence, and failure modes.

Connections

Interviewers often pivot from this topic into real-time systems, API design, authorization/multi-tenancy, frontend state management, or search over user-generated content. They may also ask about cost controls, observability, and p95/p99 latency debugging for streaming endpoints. Be ready to explain how you would instrument each stage: client send time, backend queue time, first-token latency, tokens per second, completion status, and model-provider errors.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts