Design a multimodal RAG assistant
Company: Apple
Role: Machine Learning Engineer
Category: System Design
Difficulty: medium
Interview Round: Technical Screen
## Prompt
Design a **Retrieval-Augmented Generation (RAG)** system that can answer user questions using an internal knowledge base containing **multiple modalities** (at least text and images; optionally PDFs/tables).
### Requirements
- Users ask natural-language questions and want grounded answers with citations.
- Knowledge base items may include:
- Plain text docs (wiki pages, tickets)
- PDFs (mixed text + images)
- Images (diagrams/screenshots) with minimal surrounding metadata
- The system should retrieve relevant evidence across modalities and use an LLM to generate an answer.
### What to cover
1. Data ingestion and preprocessing for each modality
2. Indexing strategy (vector, keyword, hybrid) and how you would store metadata
3. Retrieval at query time (including cross-modal retrieval)
4. How you would handle chunking, embeddings, and re-ranking
5. Prompting / grounding strategy and citation generation
6. Quality evaluation (offline + online), latency, and cost considerations
7. Failure modes (hallucinations, stale data, missing modality) and mitigations
You may make reasonable assumptions and state them clearly.
Quick Answer: This question evaluates system design and machine learning engineering skills for building a multimodal Retrieval-Augmented Generation (RAG) assistant, covering competencies in data ingestion and preprocessing across modalities, indexing and retrieval strategies, embeddings and re-ranking, grounding/prompting with citations, and evaluation and failure-mode mitigation. It is commonly asked to gauge an engineer's ability to architect scalable, grounded QA pipelines that balance retrieval quality, latency, and cost; it sits in the System Design domain and tests both high-level architectural reasoning and practical implementation considerations.