LLM Chat Applications, RAG, And ML Evaluation

What's being tested

You’re being tested on whether you can design a production-grade LLM chat application: streaming UX, backend orchestration, secure API access, conversation state, retrieval, ranking, and evaluation loops. OpenAI cares because the hardest parts are rarely “call the model API”; they are latency, reliability, privacy, abuse prevention, state consistency, and making model outputs observable enough to improve. The interviewer is probing whether you can separate concerns between browser, backend, model provider, retrieval layer, and evaluation pipeline while making explicit tradeoffs. Strong answers sound like software architecture with ML-aware interfaces, not like a research discussion about model internals.

Core knowledge

Streaming response delivery is central to chat UX. Common choices are server-sent events (SSE), WebSockets, or HTTP chunked transfer. SSE is usually simpler for one-way token streaming; WebSockets fit bidirectional collaboration, cancellation, or multi-agent updates.
Frontend conversation state should distinguish ephemeral UI state from durable history. Browser-only designs may use IndexedDB or localStorage, but sensitive chats, cross-device sync, and enterprise retention usually require server-side storage with encryption, deletion, and access controls.
Backend relay services protect credentials and policy. The browser should not hold raw provider API keys; a backend can authenticate users, enforce quotas, add system prompts, redact secrets, call the model API, and stream tokens back to the client.
Rate limiting should be layered: per-user, per-IP, per-org, per-model, and sometimes per-token. Algorithms include token bucket and leaky bucket; token-based limits are often better than request counts because one request may consume 100 tokens or 100k tokens.
Conversation persistence needs an append-only model. Store conversation_id, message_id, role, content, created_at, parent_message_id, model metadata, and status. This supports retries, regeneration, branching conversations, audit trails, and partial responses after stream interruption.
Retrieval-augmented generation (RAG) adds a retrieval path before generation: ingest documents, chunk them, embed chunks, search a vector index, optionally rerank, then pass top passages into the prompt. Typical vector stores use approximate nearest neighbor search such as HNSW; exact search becomes expensive beyond millions of chunks.
Chunking strategy is a software tradeoff, not just an ML detail. Small chunks improve precise retrieval but lose context; large chunks preserve context but waste prompt budget. A common starting point is 300–800 tokens per chunk with overlap, plus document title and ACL metadata.
Enterprise access control must happen during retrieval, not after generation. Filter candidate chunks by user permissions, tenant, document classification, and freshness before they enter the prompt. “Retrieve everything, then ask the model not to reveal secrets” is an unacceptable security boundary.
Reranking and response ranking are service-level components. A first-stage retriever returns maybe 50–200 candidates; a reranker or verifier reduces to 5–20 high-confidence passages. For response ranking, generate multiple candidates, score them using heuristics, preference models, or LLM judges, then return the best with traceable metadata.
Evaluation harnesses should be designed as repeatable software systems. Maintain golden prompts, expected citations, policy checks, latency budgets, and regression tests. Useful metrics include retrieval Recall@k, answer faithfulness, citation precision, refusal correctness, p50/p95/p99 latency, error rate, and cost per successful answer.
Failure handling matters because model calls are slow and expensive. Support cancellation, timeout budgets, exponential backoff, idempotency keys for retries, partial transcript recovery, model fallback, and graceful degradation such as “retrieval unavailable; answer from conversation only” when appropriate.
Observability needs request-level tracing across UI, backend, retrieval, ranking, and model calls. Log prompt template version, retrieved document IDs, token counts, latency by stage, finish reason, safety outcomes, and user feedback, while redacting sensitive content and respecting retention rules.

Worked example

For Design ChatGPT homepage with streaming choices, start by clarifying whether the page is authenticated, whether conversations persist across devices, which clients are supported, and what streaming semantics are expected: token-by-token, sentence-by-sentence, or final-only fallback. Then state assumptions: a web SPA, authenticated users, server-side conversation history, and a backend relay that calls an LLM provider rather than exposing secrets to the browser. Organize the answer around four pillars: frontend state and rendering, backend streaming API, persistence and retry semantics, and safety/limits/observability.

On the frontend, describe a message composer, optimistic user-message insertion, a streaming assistant placeholder, cancellation, and reconnection behavior. On the backend, propose POST /conversations/{id}/messages returning an SSE stream, with the server persisting the user message, invoking the model, streaming deltas, and committing the final assistant message when complete. For persistence, use a relational store such as Postgres for metadata and messages, with object storage if attachments or long transcripts are needed. The explicit tradeoff to flag is SSE versus WebSockets: SSE is simpler and robust for one-way model output, while WebSockets are more flexible but add connection management complexity. Close by saying that, with more time, you would cover abuse detection, prompt-injection handling for tool calls, multi-region failover, and an eval dashboard tracking latency, cost, and bad-output reports.

A second angle

For Design an enterprise RAG assistant for internal docs, the same core architecture shifts from chat transport to retrieval correctness and authorization. The browser and streaming path still matter, but the critical path becomes document ingestion, ACL-aware retrieval, reranking, prompt construction, citation display, and audit logging. A strong answer should explicitly say that permissions are enforced before retrieved chunks are placed into context, and that each answer should cite source documents with stable IDs. The main tradeoff is freshness versus retrieval performance: near-real-time indexing helps users trust the system, but batch indexing is simpler and cheaper. Evaluation also changes: instead of only tracking chat latency, you measure whether the assistant retrieved the right internal document, cited it correctly, and avoided hallucinating unsupported policy.

Common pitfalls

Pitfall: Treating the model API call as the whole system.

A weak answer says “the frontend sends the prompt to the LLM and displays the response.” A stronger answer adds a backend relay, authentication, streaming, persistence, rate limits, cancellation, retry behavior, logging, and a plan for partial failures.

Pitfall: Hand-waving RAG security.

A tempting but wrong design retrieves documents globally and asks the generator to obey access rules. The better design filters by tenant and document ACL before ranking, logs which chunks were used, and treats the prompt as an untrusted boundary rather than a security mechanism.

Pitfall: Over-indexing on ML details instead of SWE responsibilities.

Do not spend most of the interview comparing transformer architectures or training losses. Mention retrievers, rerankers, and evaluators as components with APIs, latency, cost, and observability requirements; then focus on how they fit into a reliable user-facing system.

Connections

Interviewers may pivot into distributed rate limiting, API design for streaming, vector search infrastructure, browser storage security, ranking service design, or online evaluation and A/B rollout mechanics. Be ready to discuss how latency budgets, state consistency, and access control change when the system moves from a toy chatbot to enterprise or high-traffic production use.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts