LLM Chat Applications, RAG, And ML Evaluation
Asked of: Software Engineer
Last updated

What's being tested
You’re being tested on whether you can design a production-grade LLM chat application: streaming UX, backend orchestration, secure API access, conversation state, retrieval, ranking, and evaluation loops. OpenAI cares because the hardest parts are rarely “call the model API”; they are latency, reliability, privacy, abuse prevention, state consistency, and making model outputs observable enough to improve. The interviewer is probing whether you can separate concerns between browser, backend, model provider, retrieval layer, and evaluation pipeline while making explicit tradeoffs. Strong answers sound like software architecture with ML-aware interfaces, not like a research discussion about model internals.
Core knowledge
-
Streaming response delivery is central to chat UX. Common choices are server-sent events (
SSE), WebSockets, or HTTP chunked transfer.SSEis usually simpler for one-way token streaming;WebSocketsfit bidirectional collaboration, cancellation, or multi-agent updates. -
Frontend conversation state should distinguish ephemeral UI state from durable history. Browser-only designs may use
IndexedDBorlocalStorage, but sensitive chats, cross-device sync, and enterprise retention usually require server-side storage with encryption, deletion, and access controls. -
Backend relay services protect credentials and policy. The browser should not hold raw provider API keys; a backend can authenticate users, enforce quotas, add system prompts, redact secrets, call the model API, and stream tokens back to the client.
-
Rate limiting should be layered: per-user, per-IP, per-org, per-model, and sometimes per-token. Algorithms include token bucket and leaky bucket; token-based limits are often better than request counts because one request may consume 100 tokens or 100k tokens.
-
Conversation persistence needs an append-only model. Store
conversation_id,message_id,role,content,created_at,parent_message_id, model metadata, and status. This supports retries, regeneration, branching conversations, audit trails, and partial responses after stream interruption. -
Retrieval-augmented generation (
RAG) adds a retrieval path before generation: ingest documents, chunk them, embed chunks, search a vector index, optionally rerank, then pass top passages into the prompt. Typical vector stores use approximate nearest neighbor search such asHNSW; exact search becomes expensive beyond millions of chunks. -
Chunking strategy is a software tradeoff, not just an ML detail. Small chunks improve precise retrieval but lose context; large chunks preserve context but waste prompt budget. A common starting point is 300–800 tokens per chunk with overlap, plus document title and ACL metadata.
-
Enterprise access control must happen during retrieval, not after generation. Filter candidate chunks by user permissions, tenant, document classification, and freshness before they enter the prompt. “Retrieve everything, then ask the model not to reveal secrets” is an unacceptable security boundary.
-
Reranking and response ranking are service-level components. A first-stage retriever returns maybe 50–200 candidates; a reranker or verifier reduces to 5–20 high-confidence passages. For response ranking, generate multiple candidates, score them using heuristics, preference models, or LLM judges, then return the best with traceable metadata.
-
Evaluation harnesses should be designed as repeatable software systems. Maintain golden prompts, expected citations, policy checks, latency budgets, and regression tests. Useful metrics include retrieval
Recall@k, answer faithfulness, citation precision, refusal correctness,p50/p95/p99latency, error rate, and cost per successful answer. -
Failure handling matters because model calls are slow and expensive. Support cancellation, timeout budgets, exponential backoff, idempotency keys for retries, partial transcript recovery, model fallback, and graceful degradation such as “retrieval unavailable; answer from conversation only” when appropriate.
-
Observability needs request-level tracing across UI, backend, retrieval, ranking, and model calls. Log prompt template version, retrieved document IDs, token counts, latency by stage, finish reason, safety outcomes, and user feedback, while redacting sensitive content and respecting retention rules.
Worked example
For Design ChatGPT homepage with streaming choices, start by clarifying whether the page is authenticated, whether conversations persist across devices, which clients are supported, and what streaming semantics are expected: token-by-token, sentence-by-sentence, or final-only fallback. Then state assumptions: a web SPA, authenticated users, server-side conversation history, and a backend relay that calls an LLM provider rather than exposing secrets to the browser. Organize the answer around four pillars: frontend state and rendering, backend streaming API, persistence and retry semantics, and safety/limits/observability.
On the frontend, describe a message composer, optimistic user-message insertion, a streaming assistant placeholder, cancellation, and reconnection behavior. On the backend, propose POST /conversations/{id}/messages returning an SSE stream, with the server persisting the user message, invoking the model, streaming deltas, and committing the final assistant message when complete. For persistence, use a relational store such as Postgres for metadata and messages, with object storage if attachments or long transcripts are needed. The explicit tradeoff to flag is SSE versus WebSockets: SSE is simpler and robust for one-way model output, while WebSockets are more flexible but add connection management complexity. Close by saying that, with more time, you would cover abuse detection, prompt-injection handling for tool calls, multi-region failover, and an eval dashboard tracking latency, cost, and bad-output reports.
A second angle
For Design an enterprise RAG assistant for internal docs, the same core architecture shifts from chat transport to retrieval correctness and authorization. The browser and streaming path still matter, but the critical path becomes document ingestion, ACL-aware retrieval, reranking, prompt construction, citation display, and audit logging. A strong answer should explicitly say that permissions are enforced before retrieved chunks are placed into context, and that each answer should cite source documents with stable IDs. The main tradeoff is freshness versus retrieval performance: near-real-time indexing helps users trust the system, but batch indexing is simpler and cheaper. Evaluation also changes: instead of only tracking chat latency, you measure whether the assistant retrieved the right internal document, cited it correctly, and avoided hallucinating unsupported policy.
Common pitfalls
Pitfall: Treating the model API call as the whole system.
A weak answer says “the frontend sends the prompt to the LLM and displays the response.” A stronger answer adds a backend relay, authentication, streaming, persistence, rate limits, cancellation, retry behavior, logging, and a plan for partial failures.
Pitfall: Hand-waving RAG security.
A tempting but wrong design retrieves documents globally and asks the generator to obey access rules. The better design filters by tenant and document ACL before ranking, logs which chunks were used, and treats the prompt as an untrusted boundary rather than a security mechanism.
Pitfall: Over-indexing on ML details instead of SWE responsibilities.
Do not spend most of the interview comparing transformer architectures or training losses. Mention retrievers, rerankers, and evaluators as components with APIs, latency, cost, and observability requirements; then focus on how they fit into a reliable user-facing system.
Connections
Interviewers may pivot into distributed rate limiting, API design for streaming, vector search infrastructure, browser storage security, ranking service design, or online evaluation and A/B rollout mechanics. Be ready to discuss how latency budgets, state consistency, and access control change when the system moves from a toy chatbot to enterprise or high-traffic production use.
Further reading
-
OpenAI Cookbook — practical examples for streaming, retrieval, evals, and production integration patterns.
-
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — the original RAG paper; useful for understanding the retriever-generator split.
-
HNSW: Efficient and Robust Approximate Nearest Neighbor Search — background on the graph-based ANN approach used by many vector search systems.
Featured in interview prep guides
Practice questions
- Design a Text-to-Video Generation SystemOpenAI · Software Engineer · Onsite · hard
- Build a Reliable Streaming Chat UIOpenAI · Software Engineer · HR Screen · hard
- Design an enterprise RAG assistant for internal docsOpenAI · Software Engineer · Technical Screen · hard
- Design a GPT chat UI with snapshots and sharingOpenAI · Software Engineer · Technical Screen · hard
- Design an AI chatbot with browser storageOpenAI · Software Engineer · Technical Screen · medium
- Implement and Debug Backprop in NumPyOpenAI · Software Engineer · Technical Screen · medium
- Design a minimal ChatGPT with presetsOpenAI · Software Engineer · Technical Screen · hard
- Design AI chat bot systemOpenAI · Software Engineer · Technical Screen · medium
- Debug a Machine Learning PipelineOpenAI · Software Engineer · Technical Screen · medium
- Design an End-to-End ML SystemOpenAI · Software Engineer · Technical Screen · hard
- Debug a failing ML classifierOpenAI · Software Engineer · Technical Screen · hard
- Design a response-ranking ML systemOpenAI · Software Engineer · Technical Screen · hard