How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a medium difficulty ML System Design question, commonly asked during Technical Screen rounds at Databricks.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Databricks during technical interviews.

Design RAG Retrieval for Data Assets | Databricks Interview Question

Q: Design RAG Retrieval for Data Assets

This question evaluates skills in designing hybrid retrieval pipelines over enterprise metadata, including competencies in information retrieval, exact-match and semantic representations, access control and data governance, relevance reranking, and scalable system architecture.

Design the retrieval component for an internal AI coding/analytics assistant. When a user asks a data-related question, the assistant must identify the most relevant data assets — primarily database tables and analysis notebooks, but also dashboards, metric definitions, and documentation — so it can answer accurately and ground its response in real evidence.

The core of the problem is a retrieval + reranking pipeline: what to index, how to represent and search over heterogeneous assets, how to handle noisy/stale/duplicated/low-quality notebooks and tables, when to answer versus ask a clarifying question, and how to measure success.

Example user questions the system must serve:

"Which table contains daily active users by country?"
"How do I calculate net revenue retention?"
"Is there an existing notebook for enterprise churn analysis?"
"What is the source of truth for subscription events?"

Constraints & Assumptions

Scale: assume an enterprise lakehouse with $10^5$ – $10^6$ tables and a similar number of notebooks, plus dashboards, queries, and catalog docs. Assets change continuously (new tables daily, notebooks edited constantly).
Heterogeneity: assets range from canonical, well-documented production tables to one-off personal scratch notebooks. Metadata quality is uneven and often missing.
Governance is mandatory: the system must respect access control (table/row/column-level), data classification (PII, financial, restricted), and never surface assets a user cannot see.
Latency: interactive use; aim for end-to-end retrieval in the low hundreds of milliseconds for candidate generation, with a heavier reranking budget reserved for a small top- $k$ .
Output is evidence, not just prose: the assistant should return ranked assets with justifications, not only a generated paragraph.
You may assume access to a data catalog, query/usage logs, lineage graph, and data-quality test results as input signals.

The Problem

Design the retrieval system end-to-end, covering at minimum:

What assets and metadata to index , and how to represent them.
The retrieval + reranking pipeline (query understanding → candidate generation → filtering → reranking).
Handling noisy, stale, duplicated, and low-quality notebooks and tables.
The answer-vs-clarify decision — when does the system have enough evidence to answer, and when should it ask a scoping question?
Offline and online evaluation metrics , including governance guardrails.

Clarifying Questions to Ask

What asset types are in scope at launch — tables and notebooks only, or also dashboards, metric definitions, and docs? Is one type prioritized?
Is the output a ranked list of assets, a generated natural-language answer, generated SQL, or all three?
How rich and trustworthy is existing metadata (descriptions, ownership, tags), and can we rely on query/usage logs and a lineage graph?
What are the access-control granularities we must enforce (table / row / column), and how current is the permission data at query time?
What are the latency and cost budgets per query, and how many concurrent users?
Is there existing labeled relevance data (click logs, catalog search logs), or do we need to bootstrap an evaluation set?

What a Strong Answer Covers

This rubric names the dimensions a strong answer is judged on, not the answers themselves — the design choices are yours to make and defend.

Scopes what to index beyond raw text: identifies that trust, usage, quality, freshness, and lineage signals — not just descriptions — drive ranking, and tailors the inventory to each asset type.
Chooses and justifies appropriate representations: recognizes that no single representation serves exact-identifier lookups, semantic intent, and hard filtering equally well, and argues for whatever mix they pick rather than defaulting to one embedding per asset.
Reasons about granularity and ingestion-time normalization: addresses how heterogeneous, noisy assets are cleaned and broken into retrievable units, rather than indexing raw token windows.
Defines a staged retrieval pipeline with tradeoffs: separates the recall-oriented and precision-oriented phases, places governance/validity constraints deliberately, and is explicit about the latency/cost/precision tradeoffs at each stage.
Handles noise, staleness, and duplication principledly: distinguishes signals worth a hard gate from those worth a soft penalty, and has a defensible policy for collapsing near-duplicates to a trustworthy representative.
Treats permissions/governance as a hard constraint applied at retrieval and generation time, not bolted on.
Gives a calibrated answer-vs-clarify policy grounded in measurable confidence signals rather than a single score.
Separates evaluation by layer with appropriate metrics: measures retrieval, generation, and governance independently and selects metrics suited to each (ranking quality, answer grounding, and safety guardrails) instead of one blended score.
Surfaces the real tradeoffs (e.g. recall vs. noise, summarization error, rerank cost, popularity vs. discoverability of new assets).

Follow-up Questions

How would you bootstrap evaluation and ranking on day one with no click logs — cold start for both the labeled set and any learning-to-rank model?
How do you keep the index fresh and consistent when tables and notebooks change continuously and permissions can be revoked at any moment? What's your invalidation/re-embedding strategy?
A new, high-quality canonical table has zero usage history, so popularity signals bury it under a legacy table everyone still queries. How do you surface it without breaking trust in popularity as a signal?
How would you extend this from "find the right asset" to agentic multi-hop retrieval (e.g., follow lineage to find the upstream source of truth, or assemble the tables needed to write a join)?

Example user questions the system must serve:

"Which table contains daily active users by country?"
"How do I calculate net revenue retention?"
"Is there an existing notebook for enterprise churn analysis?"
"What is the source of truth for subscription events?"

Constraints & Assumptions

Scale: assume an enterprise lakehouse with $10^5$ – $10^6$ tables and a similar number of notebooks, plus dashboards, queries, and catalog docs. Assets change continuously (new tables daily, notebooks edited constantly).
Heterogeneity: assets range from canonical, well-documented production tables to one-off personal scratch notebooks. Metadata quality is uneven and often missing.
Governance is mandatory: the system must respect access control (table/row/column-level), data classification (PII, financial, restricted), and never surface assets a user cannot see.
Latency: interactive use; aim for end-to-end retrieval in the low hundreds of milliseconds for candidate generation, with a heavier reranking budget reserved for a small top- $k$ .
Output is evidence, not just prose: the assistant should return ranked assets with justifications, not only a generated paragraph.
You may assume access to a data catalog, query/usage logs, lineage graph, and data-quality test results as input signals.

The Problem

Design the retrieval system end-to-end, covering at minimum:

What assets and metadata to index , and how to represent them.
The retrieval + reranking pipeline (query understanding → candidate generation → filtering → reranking).
Handling noisy, stale, duplicated, and low-quality notebooks and tables.
The answer-vs-clarify decision — when does the system have enough evidence to answer, and when should it ask a scoping question?
Offline and online evaluation metrics , including governance guardrails.

Clarifying Questions to Ask

What asset types are in scope at launch — tables and notebooks only, or also dashboards, metric definitions, and docs? Is one type prioritized?
Is the output a ranked list of assets, a generated natural-language answer, generated SQL, or all three?
How rich and trustworthy is existing metadata (descriptions, ownership, tags), and can we rely on query/usage logs and a lineage graph?
What are the access-control granularities we must enforce (table / row / column), and how current is the permission data at query time?
What are the latency and cost budgets per query, and how many concurrent users?
Is there existing labeled relevance data (click logs, catalog search logs), or do we need to bootstrap an evaluation set?

What a Strong Answer Covers

This rubric names the dimensions a strong answer is judged on, not the answers themselves — the design choices are yours to make and defend.

Scopes what to index beyond raw text: identifies that trust, usage, quality, freshness, and lineage signals — not just descriptions — drive ranking, and tailors the inventory to each asset type.
Chooses and justifies appropriate representations: recognizes that no single representation serves exact-identifier lookups, semantic intent, and hard filtering equally well, and argues for whatever mix they pick rather than defaulting to one embedding per asset.
Reasons about granularity and ingestion-time normalization: addresses how heterogeneous, noisy assets are cleaned and broken into retrievable units, rather than indexing raw token windows.
Defines a staged retrieval pipeline with tradeoffs: separates the recall-oriented and precision-oriented phases, places governance/validity constraints deliberately, and is explicit about the latency/cost/precision tradeoffs at each stage.
Handles noise, staleness, and duplication principledly: distinguishes signals worth a hard gate from those worth a soft penalty, and has a defensible policy for collapsing near-duplicates to a trustworthy representative.
Treats permissions/governance as a hard constraint applied at retrieval and generation time, not bolted on.
Gives a calibrated answer-vs-clarify policy grounded in measurable confidence signals rather than a single score.
Separates evaluation by layer with appropriate metrics: measures retrieval, generation, and governance independently and selects metrics suited to each (ranking quality, answer grounding, and safety guardrails) instead of one blended score.
Surfaces the real tradeoffs (e.g. recall vs. noise, summarization error, rerank cost, popularity vs. discoverability of new assets).

Follow-up Questions

How would you bootstrap evaluation and ranking on day one with no click logs — cold start for both the labeled set and any learning-to-rank model?
How do you keep the index fresh and consistent when tables and notebooks change continuously and permissions can be revoked at any moment? What's your invalidation/re-embedding strategy?
A new, high-quality canonical table has zero usage history, so popularity signals bury it under a legacy table everyone still queries. How do you surface it without breaking trust in popularity as a signal?
How would you extend this from "find the right asset" to agentic multi-hop retrieval (e.g., follow lineage to find the upstream source of truth, or assemble the tables needed to write a join)?

Design RAG Retrieval for Data Assets

Quick Overview

Constraints & Assumptions

The Problem

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Design RAG Retrieval for Data Assets

Quick Overview

Constraints & Assumptions

The Problem

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP