What does the Amazon Data Scientist interview process look like?

Based on candidate reports compiled in this guide, the Amazon Data Scientist loop typically includes 3 stages: Technical Screen, Onsite, Take-home Project. Each stage covers a distinct set of topics walked through in detail above.

What topics does Amazon focus on in Data Scientist interviews?

Amazon Data Scientist interviews cover Data Manipulation (SQL/Python), Analytics & Experimentation, Statistics & Math, Machine Learning, Coding & Algorithms, Behavioral & Leadership. The guide above breaks each topic down into core concepts, worked examples, and the real questions candidates were asked.

How many real Amazon Data Scientist interview questions are in this guide?

This guide is anchored to 32 real Amazon Data Scientist interview questions sourced from candidate reports, each linked to a full practice page with starter code, solution discussion, and community comments.

Amazon Data Scientist Interview Prep Guide

Everything Amazon actually asks Data Scientist candidates — concept walkthroughs, worked examples, and the real interview questions, drawn from candidate reports. Free to read.

Amazon Data Scientist Interview Cheatsheet cover

Technical Screen

Data Manipulation (SQL/Python)

SQL Analytical Querying And Data Modeling — covered in depth under Take-home Project below.
Python/Pandas Data Manipulation — covered in depth under Onsite below.

Analytics & Experimentation

A/B Testing And Statistical Inference — covered in depth under Onsite below.
Product Metrics, Root-Cause Analysis And Visualization — covered in depth under Onsite below.

Statistics & Math

Propensity Score Matching — covered in depth under Onsite below.

Machine Learning

Supervised ML Fundamentals, Evaluation And Feature Engineering — covered in depth under Onsite below.
ML System Design, Recommenders, Forecasting And Allocation — covered in depth under Onsite below.

Retrieval-Augmented Generation

What's being tested

Interviewers are probing whether you can reason about retrieval-augmented generation as an applied ML system: when it is preferable to fine-tuning, how to evaluate answer quality, and how to control hallucination, latency, and cost. For a Data Scientist, the emphasis is not on building the serving stack, but on defining success metrics, designing offline and online evaluations, diagnosing failure modes, and making evidence-based tradeoffs. Amazon cares because many internal and customer-facing products depend on accurate answers over changing catalogs, policies, reviews, support docs, and seller content. A strong answer connects model behavior to business risk: wrong answers, unsupported claims, poor coverage, high inference cost, and degraded customer trust.

Core knowledge

RAG pipeline anatomy: a typical system has document selection, chunking, embedding, vector retrieval, optional lexical retrieval, reranking, prompt construction, generation, and post-generation validation. As a DS, focus on which stage explains failures: missing document, bad chunk, weak ranking, poor prompt grounding, or model hallucination.
RAG vs. fine-tuning: use RAG when knowledge changes frequently, answers require citations, or the source corpus is large and dynamic. Use fine-tuning when you need style, task format, domain-specific reasoning patterns, or classification behavior. Fine-tuning usually does not reliably “store” thousands of facts and can still hallucinate.
Retrieval metrics: evaluate retrieval before generation using Recall@k, Precision@k, MRR, and nDCG@k. If the correct supporting passage appears in the top $k$ , retrieval recall is high; if it appears near rank 1, MRR and nDCG improve. Poor retrieval caps final answer quality no matter how strong the LLM is.
Answer-quality metrics: evaluate final responses on faithfulness, answer correctness, citation accuracy, coverage, refusal quality, and helpfulness. For factual systems, split “is the answer true?” from “is the answer supported by retrieved context?” because a true answer can still be ungrounded.
Human evaluation design: create a labeled test set stratified by query type: lookup, comparison, multi-hop, ambiguous, out-of-scope, freshness-sensitive, and adversarial. Use blinded raters, rubric-based labels, inter-rater agreement such as Cohen’s kappa, and adjudication for ambiguous cases.
Offline-to-online gap: offline metrics like Recall@5 and judge-rated correctness are necessary but not sufficient. Online metrics may include CTR, task completion, deflection rate, escalation rate, answer acceptance, repeat-contact rate, refund/contact outcomes, and guardrail metrics like harmful-answer rate.
Cost metrics: total expected cost per query is roughly $C = C_\text{embed} + k C_\text{rerank} + T_\text{input} c_\text{in} + T_\text{output} c_\text{out},$ where $k$ is retrieved candidates and $T$ is token count. DS tradeoffs include reducing top_k, compressing context, using cheaper rerankers, caching frequent answers, or routing simple queries to smaller models.
Chunking tradeoffs: small chunks improve precise retrieval but can lose context; large chunks preserve context but add noise and token cost. Common starting points are 200–800 tokens with overlap, then tune using retrieval recall and downstream answer accuracy rather than arbitrary chunk size.
Embedding tradeoffs: higher-dimensional embeddings can improve semantic resolution but increase storage, retrieval cost, and risk of overfitting to benchmark-like queries. Compare embedding models with a fixed evaluation set, including domain-specific synonyms, abbreviations, multilingual queries, and entity-heavy queries.
Hybrid retrieval: dense retrieval captures semantic similarity, while BM25 or lexical retrieval is better for exact product IDs, policy names, error codes, and rare entities. A hybrid system with reranking often beats either alone, especially for Amazon-like catalogs and support documents with many near-duplicate entities.
Reranking role: a cross-encoder reranker scores query-document pairs more accurately than vector similarity but is slower and costlier. Use it on a candidate pool, e.g. retrieve top 50–200, rerank to top 5–10, then pass only the most relevant passages to the LLM.
Grounding and refusal: good systems explicitly handle “answer not in context.” Measure false-answer rate on unanswerable queries, not just accuracy on answerable ones. The prompt should instruct the model to cite evidence and refuse unsupported claims, but prompt instructions are not a substitute for evaluation.

Worked example

For “Design and evaluate a RAG system,” start by framing the use case: “What corpus are we answering from, how fresh is it, what is the cost of a wrong answer, and do we need citations or just conversational help?” Then declare assumptions, such as a customer-support assistant over policy and troubleshooting documents where correctness and groundedness matter more than creativity. Organize the answer into four pillars: data and query taxonomy, retrieval quality, generation quality, and online experiment design.

For retrieval, say you would build an offline benchmark of real and synthetic queries with gold supporting documents, then track Recall@k, MRR, and coverage by segment. For generation, evaluate answer correctness, faithfulness to retrieved context, citation precision, refusal behavior, and latency/cost per resolved query. A concrete tradeoff to flag is top_k: increasing it may improve recall but can add irrelevant context, raise token cost, and sometimes reduce answer faithfulness. For launch, propose an A/B test against the current experience with primary metrics like successful resolution or accepted answer rate, guardrails like escalation rate and complaint rate, and segmented analysis for long-tail topics. Close by saying that, with more time, you would add error taxonomy reviews: no relevant doc retrieved, relevant doc retrieved but ignored, conflicting docs, stale source, and ambiguous user intent.

A second angle

For “Choose Between Fine-Tuning and RAG for Client Chatbot,” the same concepts apply, but the decision is framed as model adaptation rather than system evaluation. A strong answer says RAG is the default if the chatbot must answer from changing client documents, provide citations, or support auditability. Fine-tuning is more appropriate if the main gap is tone, output schema, intent classification, or domain-specific phrasing. The best answer often combines them: RAG for factual grounding and a lightly fine-tuned or instruction-tuned model for consistent behavior. The evaluation should compare variants on the same labeled query set and include cost per successful resolution, not just model accuracy.

Common pitfalls

Pitfall: Treating RAG as “just add a vector database.”

That answer is too shallow for a Data Scientist interview because it skips measurement. A better answer decomposes performance into retrieval recall, ranking quality, grounding, and end-user outcome metrics, then explains how each would be evaluated and improved.

Pitfall: Optimizing only average answer accuracy.

Average accuracy can hide severe failures on high-risk or low-frequency segments such as policy exceptions, medical/legal disclaimers, seller disputes, or fresh catalog changes. Segment by query type, document domain, language, customer cohort, and answerability; also track worst-case or tail metrics like unsupported-answer rate.

Pitfall: Claiming fine-tuning “teaches the model the knowledge base.”

Fine-tuning can improve format and behavior, but it is unreliable for frequently changing facts and hard to audit. For factual enterprise chatbots, say that RAG provides freshness and traceability, while fine-tuning may complement it for style, routing, or specialized reasoning patterns.

Connections

Interviewers may pivot from here to ranking evaluation, LLM hallucination measurement, A/B testing, semantic search, or transformer attention. They may also ask about RNNs vs. Transformers to check whether you understand why modern retrieval and generation systems rely on attention-based models for long-context language tasks.

Design and evaluate a RAG system

Evaluates designing and evaluating retrieval-augmented generation (RAG) systems, including document ingestion, chunking, embedding and retrieval...

Amazon Data Scientist Interview Prep Guide

Technical Screen

Data Manipulation (SQL/Python)

Analytics & Experimentation

Statistics & Math

Machine Learning

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Design and evaluate a RAG system

Choose Between Fine-Tuning and RAG for Client Chatbot

Coding & Algorithms

Behavioral & Leadership

Onsite

Data Manipulation (SQL/Python)

What's being tested

Patterns & templates

Common pitfalls

Practice these

Transform retail data with pandas groupby/merge/concat

Compute 3-Day Rolling Revenue Averages with Pandas

Identify Top Spenders and Segment Customers Using Python

Analytics & Experimentation

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Walk through an A/B test end-to-end

Design A/B Test for New Amazon Recommendation Module

Diagnose Causes and Test Hypotheses for Metric Drop

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Identify Causes and Validate Web Product Performance Drop

Explore Dataset to Assess Quality and Choose Visualizations

Identify Issues and Redesign Customer-Conversion Chart

Statistics & Math

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Estimate Treatment Effects Using PSM, DiD, and DML Methods

Validate DID and IV assumptions rigorously

Explain Statistical Outputs to Non-Technical Stakeholders

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Explain Propensity Score Matching and Assess Covariate Balance

Identify P-Value Limitations and Complementary Approaches

Explain P-value, Confidence Interval, and Multiple Testing Adjustments

Machine Learning

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Explain core ML concepts and metrics

Optimize Predictive Analytics: Feature Engineering to Model Evaluation

Handle Missing Values and Choose ML Algorithms Wisely

What's being tested