Design RAG Retrieval for Data Assets
Company: Databricks
Role: Software Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Technical Screen
Design the **retrieval component** for an internal AI coding/analytics assistant. When a user asks a data-related question, the assistant must identify the most relevant **data assets** — primarily database tables and analysis notebooks, but also dashboards, metric definitions, and documentation — so it can answer accurately and ground its response in real evidence.
The core of the problem is a **retrieval + reranking** pipeline: what to index, how to represent and search over heterogeneous assets, how to handle noisy/stale/duplicated/low-quality notebooks and tables, when to answer versus ask a clarifying question, and how to measure success.
Example user questions the system must serve:
- "Which table contains daily active users by country?"
- "How do I calculate net revenue retention?"
- "Is there an existing notebook for enterprise churn analysis?"
- "What is the source of truth for subscription events?"
### Constraints & Assumptions
- **Scale:** assume an enterprise lakehouse with $10^5$–$10^6$ tables and a similar number of notebooks, plus dashboards, queries, and catalog docs. Assets change continuously (new tables daily, notebooks edited constantly).
- **Heterogeneity:** assets range from canonical, well-documented production tables to one-off personal scratch notebooks. Metadata quality is uneven and often missing.
- **Governance is mandatory:** the system must respect access control (table/row/column-level), data classification (PII, financial, restricted), and never surface assets a user cannot see.
- **Latency:** interactive use; aim for end-to-end retrieval in the low hundreds of milliseconds for candidate generation, with a heavier reranking budget reserved for a small top-$k$.
- **Output is evidence, not just prose:** the assistant should return ranked assets with justifications, not only a generated paragraph.
- You may assume access to a data catalog, query/usage logs, lineage graph, and data-quality test results as input signals.
### The Problem
Design the retrieval system end-to-end, covering at minimum:
1. **What assets and metadata to index**, and how to represent them.
2. **The retrieval + reranking pipeline** (query understanding → candidate generation → filtering → reranking).
3. **Handling noisy, stale, duplicated, and low-quality** notebooks and tables.
4. **The answer-vs-clarify decision** — when does the system have enough evidence to answer, and when should it ask a scoping question?
5. **Offline and online evaluation metrics**, including governance guardrails.
```hint Where to start
Frame this as **hybrid retrieval over enterprise metadata**, not plain document RAG. The hard part isn't embedding text — it's that the same query must hit *exact* identifiers (table/column names) **and** *semantic* intent (business concepts), under hard permission constraints, over assets of wildly varying quality.
```
```hint Representation
A single embedding per asset quietly loses something. Ask what *distinct kinds of matching* an enterprise data query needs — an exact identifier lookup behaves differently from a fuzzy business-concept search, and both differ from "filter to only assets I'm allowed to see that are fresh enough." Which of those does a vector index actually do well, and what would you reach for when it doesn't? Also: is one representation per asset the right granularity, or does a wide table / long notebook want to be addressable at finer pieces?
```
```hint Ranking signals
Topical relevance is only part of what makes an asset the *right* one to trust. Beyond "does this match the query," what would make you prefer one matching table over another — and which of those preferences are you willing to encode as a hard yes/no gate versus a tunable nudge? Be deliberate about quality and staleness here: is a rarely-touched asset always wrong to surface, or can it still be the only thing holding the correct logic?
```
```hint Answer vs. clarify
The single top score is a weak confidence signal on its own. What else about your candidate list — beyond the absolute top score — could tell you whether the system has actually pinned down one trustworthy answer versus stumbled into ambiguity? And separately, which *query* characteristics should push you toward asking a scoping question rather than guessing?
```
```hint Evaluation pitfall
If you only score the final generated answer, a wrong answer leaves you unable to say whether retrieval or generation failed — so think about how to measure each layer on its own. Then ask where your relevance labels even come from when you have no judgments yet, and what failure modes are so unacceptable (think about the governance constraints) that they deserve their own metrics rather than being averaged into a quality number.
```
### Clarifying Questions to Ask
- What asset types are in scope at launch — tables and notebooks only, or also dashboards, metric definitions, and docs? Is one type prioritized?
- Is the output a ranked list of assets, a generated natural-language answer, generated SQL, or all three?
- How rich and trustworthy is existing metadata (descriptions, ownership, tags), and can we rely on query/usage logs and a lineage graph?
- What are the access-control granularities we must enforce (table / row / column), and how current is the permission data at query time?
- What are the latency and cost budgets per query, and how many concurrent users?
- Is there existing labeled relevance data (click logs, catalog search logs), or do we need to bootstrap an evaluation set?
### What a Strong Answer Covers
*This rubric names the dimensions a strong answer is judged on, not the answers themselves — the design choices are yours to make and defend.*
- **Scopes what to index beyond raw text:** identifies that trust, usage, quality, freshness, and lineage signals — not just descriptions — drive ranking, and tailors the inventory to each asset type.
- **Chooses and justifies appropriate representations:** recognizes that no single representation serves exact-identifier lookups, semantic intent, and hard filtering equally well, and argues for whatever mix they pick rather than defaulting to one embedding per asset.
- **Reasons about granularity and ingestion-time normalization:** addresses how heterogeneous, noisy assets are cleaned and broken into retrievable units, rather than indexing raw token windows.
- **Defines a staged retrieval pipeline with tradeoffs:** separates the recall-oriented and precision-oriented phases, places governance/validity constraints deliberately, and is explicit about the latency/cost/precision tradeoffs at each stage.
- **Handles noise, staleness, and duplication principledly:** distinguishes signals worth a hard gate from those worth a soft penalty, and has a defensible policy for collapsing near-duplicates to a trustworthy representative.
- **Treats permissions/governance as a hard constraint** applied at retrieval *and* generation time, not bolted on.
- **Gives a calibrated answer-vs-clarify policy** grounded in measurable confidence signals rather than a single score.
- **Separates evaluation by layer with appropriate metrics:** measures retrieval, generation, and governance independently and selects metrics suited to each (ranking quality, answer grounding, and safety guardrails) instead of one blended score.
- **Surfaces the real tradeoffs** (e.g. recall vs. noise, summarization error, rerank cost, popularity vs. discoverability of new assets).
### Follow-up Questions
- How would you **bootstrap evaluation and ranking** on day one with no click logs — cold start for both the labeled set and any learning-to-rank model?
- How do you keep the index **fresh and consistent** when tables and notebooks change continuously and permissions can be revoked at any moment? What's your invalidation/re-embedding strategy?
- A new, high-quality canonical table has zero usage history, so popularity signals bury it under a legacy table everyone still queries. How do you surface it without breaking trust in popularity as a signal?
- How would you extend this from "find the right asset" to **agentic multi-hop retrieval** (e.g., follow lineage to find the upstream source of truth, or assemble the tables needed to write a join)?
Quick Answer: This question evaluates skills in designing hybrid retrieval pipelines over enterprise metadata, including competencies in information retrieval, exact-match and semantic representations, access control and data governance, relevance reranking, and scalable system architecture.