PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/ML System Design/Plaid

Design RAG Evaluation and Debugging

Last updated: Jun 19, 2026

Quick Overview

This question evaluates a candidate's ability to design evaluation and debugging frameworks for production RAG systems, testing practical ML engineering skills in retrieval-augmented generation pipelines. It assesses system design competency around offline and online quality metrics, regression detection, and failure localization in high-stakes domains requiring strict data permissions and faithfulness guarantees.

  • medium
  • Plaid
  • ML System Design
  • Machine Learning Engineer

Design RAG Evaluation and Debugging

Company: Plaid

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: medium

Interview Round: Onsite

You own a production RAG-powered semantic search feature for a fintech product. Users enter natural-language questions; the system retrieves relevant internal documents or records, and an LLM generates an answer with inline citations. Design a lightweight but production-ready **evaluation and debugging framework** for this system. The interviewer cares less about a sprawling architecture diagram and more about whether you can measure quality at every stage, catch regressions before they ship, and localize a live failure with evidence rather than guesswork. The problem has five parts (below). Treat each as its own discussion, but keep them coherent — the offline metrics you define should be the same ones your CI gates and dashboards track, and the debugging walkthrough should reference the instrumentation you proposed. ### Constraints & Assumptions - **Domain:** fintech (banking/payments/compliance). Wrong or unsupported answers carry regulatory and trust risk; **faithfulness and permission-correctness are non-negotiable**. - **Scale:** assume on the order of $10^5$–$10^6$ queries/day, a corpus of hundreds of thousands to a few million documents/records, with frequent document churn (policies and account data change daily). - **Pipeline (assume this is roughly what exists):** query normalization → optional query rewrite → query embedding → vector (ANN) retrieval, optionally hybrid with lexical/BM25 → metadata/permission filtering → reranking → context construction/truncation → LLM generation → citation extraction → logging/feedback. - **Multi-tenant:** documents belong to tenants/users; retrieval **must** respect permission and tenant boundaries. - "Lightweight" means: prefer a small, maintainable set of metrics and automated checks over an exhaustive eval platform; reuse production logs rather than building a large labeling org. ### Part 1 — Offline Evaluation Define the metrics you would compute offline (on a fixed benchmark, before traffic) for: **retrieval quality, answer quality, faithfulness, latency, and cost**. For each, say what it measures and roughly what target/threshold you'd hold it to. Make clear which metrics evaluate a single stage (e.g. retrieval) versus the end-to-end answer. ```hint Separate the stages Evaluate retrieval and generation *independently* as well as end-to-end. A great retriever can still produce a bad answer (and vice versa) — if you only measure the final answer you can't attribute a regression to a stage. ``` ```hint Retrieval metrics to name Recall the standard ranked-IR metrics and think about which require *graded* relevance labels vs. binary. Don't forget a fintech-specific correctness requirement that should be a hard gate rather than a soft target. ``` ```hint Answer quality without ground-truth strings Exact-match rarely works for free-form answers. Think about what properties a *faithful* answer must have in a compliance-sensitive context, and how you can measure those properties at scale without a large human-labeling team. ``` ### Part 2 — Online Evaluation Describe the **online** signals: product metrics, business metrics, user feedback, A/B testing methodology, and guardrail metrics. Explain why offline metrics alone are insufficient and how you'd run a controlled experiment for a retriever/prompt change. ```hint Offline ≠ online Offline benchmarks are stale and narrow; only live traffic tells you whether users actually got value. Map each layer: product (CTR, answer-acceptance, reformulation rate, time-to-answer), business (support-ticket deflection, task completion), guardrails. ``` ```hint Experiment design pitfalls Bucket by **user/tenant** (not per-request) for stable assignment; segment results by query type / tenant size / language; run long enough for low-frequency query classes and seasonality. Always read guardrail metrics alongside the win metric — a CTR lift that doubles hallucination rate is not a win. ``` ### Part 3 — Evaluation Datasets Explain where the benchmark data comes from, how labels are created, how you generate and use **hard negatives**, and how you keep the benchmark **fresh** without destroying trend comparability. ```hint Mine production, don't invent The best eval data is your own logs: queries with negative feedback, reformulated queries, support tickets, incident-derived queries. Synthetic-from-documents is fine for coverage but must be human-audited. ``` ```hint Why hard negatives matter here In fintech, the dangerous failure is retrieving a doc that *looks* right but isn't — same entity wrong tenant, superseded policy, similar wording wrong account type. Build negatives that probe exactly these confusions; they're what separate a strong embedding/reranker from a weak one. ``` ```hint Fresh vs. stable Keep two sets: a **frozen regression suite** for apples-to-apples trend lines, plus a **rolling fresh set** sampled from recent traffic/incidents. Don't silently mutate the frozen set or your historical comparisons become meaningless. ``` ### Part 4 — Automation, Gates, Dashboards & Alerts Describe how evaluation runs in CI/CD, what your **model-release gates** are, what dashboards you'd build, what you'd alert on, and how you'd do regression analysis when a metric moves. ```hint Gate on what's cheap and what's catastrophic Not every change needs the full eval. Run unit/filter tests + retrieval eval on every change to prompts/chunking/embeddings/index/reranker; reserve the heavier answer-quality eval for a representative sample. Make permission-filter violations a hard zero-tolerance gate, and express quality gates as *relative regression vs. the current production baseline*, not absolute floors. ``` ```hint Dashboards and alerts should be stage- and segment-aware Break metrics down by pipeline stage (so you can attribute) and by query category/tenant (so you catch localized regressions a global average hides). Alert on leading user signals (no-result spike, thumbs-down spike) and operational ones (index lag, vector-DB latency, embedding error rate). ``` ### Part 5 — Failure Analysis: Sudden Drop in Semantic Search Quality Search relevance suddenly drops in production. Walk through a **concrete** debugging process: how you instrument and localize the failure, the likely root causes (ranked), the experiments you'd run to confirm, the fix, and crucially **how you prove the fix worked**. Do **not** simply say "I'd swap the embedding model" — show how you'd attribute the regression to a specific stage and validate the fix with evidence. ```hint Quantify the symptom first Before touching anything, pin down *which* metric dropped, *when* it started relative to deploys or infrastructure events, and *who* is affected — global or a specific slice (tenant, language, query category). A precise symptom statement already constrains the search space considerably. ``` ```hint Bisect the pipeline Search regressions are rarely caused by the embedding model itself. Think about what categories of pipeline component could silently degrade relevance without any obvious error signal, and design experiments that isolate one stage at a time. ``` ```hint Proving the fix A green global metric is not sufficient evidence. Think about what a skeptical PM or regulator would need to see before you call the incident closed — breadth of evidence matters as much as the fix itself. ``` ### Clarifying Questions to Ask - What is the **primary product surface** — a chatbot answer, a ranked list of documents, or both? (Determines whether retrieval metrics or answer metrics dominate.) - What does "answer with citations" require — must every sentence be cited, or just the answer overall? Is a wrong-but-cited answer worse than a refusal? - What's the **relabeling/labeling budget** — do we have domain experts (compliance/CS) available, or must we lean on LLM-as-judge? - How often does the corpus change, and is there an existing ground-truth notion of "the right document" for a query (e.g. analyst-curated answers)? - What are the hard **latency and cost SLAs** the product commits to, and is there a fallback path when the primary retriever times out? - Are there **regulatory constraints** (data residency, PII handling, audit logging) that the eval and logging system itself must satisfy? ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - Your LLM-as-judge for faithfulness is itself a model — how do you evaluate and trust the judge, and what do you do when the judge and human labels disagree? - A new retriever improves NDCG@10 offline but loses the online A/B on answer-acceptance. How do you reconcile that, and which do you believe? - A regulator asks you to demonstrate that a specific user could not have retrieved a document they weren't entitled to. What in your eval/logging system lets you answer that, and how would you test for permission leaks proactively? - The corpus is 90% unchanged week-over-week but 10% churns daily. How do you keep retrieval quality stable across re-indexing, and how would you detect a silent indexing failure on the churning slice?

Quick Answer: This question evaluates a candidate's ability to design evaluation and debugging frameworks for production RAG systems, testing practical ML engineering skills in retrieval-augmented generation pipelines. It assesses system design competency around offline and online quality metrics, regression detection, and failure localization in high-stakes domains requiring strict data permissions and faithfulness guarantees.

Related Interview Questions

  • Design a Fraud Detection System - Plaid (medium)
Plaid logo
Plaid
Jun 19, 2026, 12:00 AM
Machine Learning Engineer
Onsite
ML System Design
0
0

You own a production RAG-powered semantic search feature for a fintech product. Users enter natural-language questions; the system retrieves relevant internal documents or records, and an LLM generates an answer with inline citations.

Design a lightweight but production-ready evaluation and debugging framework for this system. The interviewer cares less about a sprawling architecture diagram and more about whether you can measure quality at every stage, catch regressions before they ship, and localize a live failure with evidence rather than guesswork.

The problem has five parts (below). Treat each as its own discussion, but keep them coherent — the offline metrics you define should be the same ones your CI gates and dashboards track, and the debugging walkthrough should reference the instrumentation you proposed.

Constraints & Assumptions

  • Domain: fintech (banking/payments/compliance). Wrong or unsupported answers carry regulatory and trust risk; faithfulness and permission-correctness are non-negotiable .
  • Scale: assume on the order of 10510^5105 – 10610^6106 queries/day, a corpus of hundreds of thousands to a few million documents/records, with frequent document churn (policies and account data change daily).
  • Pipeline (assume this is roughly what exists): query normalization → optional query rewrite → query embedding → vector (ANN) retrieval, optionally hybrid with lexical/BM25 → metadata/permission filtering → reranking → context construction/truncation → LLM generation → citation extraction → logging/feedback.
  • Multi-tenant: documents belong to tenants/users; retrieval must respect permission and tenant boundaries.
  • "Lightweight" means: prefer a small, maintainable set of metrics and automated checks over an exhaustive eval platform; reuse production logs rather than building a large labeling org.

Part 1 — Offline Evaluation

Define the metrics you would compute offline (on a fixed benchmark, before traffic) for: retrieval quality, answer quality, faithfulness, latency, and cost. For each, say what it measures and roughly what target/threshold you'd hold it to. Make clear which metrics evaluate a single stage (e.g. retrieval) versus the end-to-end answer.

Part 2 — Online Evaluation

Describe the online signals: product metrics, business metrics, user feedback, A/B testing methodology, and guardrail metrics. Explain why offline metrics alone are insufficient and how you'd run a controlled experiment for a retriever/prompt change.

Part 3 — Evaluation Datasets

Explain where the benchmark data comes from, how labels are created, how you generate and use hard negatives, and how you keep the benchmark fresh without destroying trend comparability.

Part 4 — Automation, Gates, Dashboards & Alerts

Describe how evaluation runs in CI/CD, what your model-release gates are, what dashboards you'd build, what you'd alert on, and how you'd do regression analysis when a metric moves.

Part 5 — Failure Analysis: Sudden Drop in Semantic Search Quality

Search relevance suddenly drops in production. Walk through a concrete debugging process: how you instrument and localize the failure, the likely root causes (ranked), the experiments you'd run to confirm, the fix, and crucially how you prove the fix worked. Do not simply say "I'd swap the embedding model" — show how you'd attribute the regression to a specific stage and validate the fix with evidence.

Clarifying Questions to Ask

  • What is the primary product surface — a chatbot answer, a ranked list of documents, or both? (Determines whether retrieval metrics or answer metrics dominate.)
  • What does "answer with citations" require — must every sentence be cited, or just the answer overall? Is a wrong-but-cited answer worse than a refusal?
  • What's the relabeling/labeling budget — do we have domain experts (compliance/CS) available, or must we lean on LLM-as-judge?
  • How often does the corpus change, and is there an existing ground-truth notion of "the right document" for a query (e.g. analyst-curated answers)?
  • What are the hard latency and cost SLAs the product commits to, and is there a fallback path when the primary retriever times out?
  • Are there regulatory constraints (data residency, PII handling, audit logging) that the eval and logging system itself must satisfy?

What a Strong Answer Covers Premium

Follow-up Questions

  • Your LLM-as-judge for faithfulness is itself a model — how do you evaluate and trust the judge, and what do you do when the judge and human labels disagree?
  • A new retriever improves NDCG@10 offline but loses the online A/B on answer-acceptance. How do you reconcile that, and which do you believe?
  • A regulator asks you to demonstrate that a specific user could not have retrieved a document they weren't entitled to. What in your eval/logging system lets you answer that, and how would you test for permission leaks proactively?
  • The corpus is 90% unchanged week-over-week but 10% churns daily. How do you keep retrieval quality stable across re-indexing, and how would you detect a silent indexing failure on the churning slice?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Plaid•More Machine Learning Engineer•Plaid Machine Learning Engineer•Plaid ML System Design•Machine Learning Engineer ML System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.