What does the Amazon Machine Learning Engineer interview process look like?

Based on candidate reports compiled in this guide, the Amazon Machine Learning Engineer loop typically includes 2 stages: Technical Screen, Onsite. Each stage covers a distinct set of topics walked through in detail above.

What topics does Amazon focus on in Machine Learning Engineer interviews?

Amazon Machine Learning Engineer interviews cover Machine Learning, ML System Design, Coding & Algorithms, Behavioral & Leadership. The guide above breaks each topic down into core concepts, worked examples, and the real questions candidates were asked.

How many real Amazon Machine Learning Engineer interview questions are in this guide?

This guide is anchored to 33 real Amazon Machine Learning Engineer interview questions sourced from candidate reports, each linked to a full practice page with starter code, solution discussion, and community comments.

Amazon Machine Learning Engineer Interview Prep Guide

Everything Amazon actually asks Machine Learning Engineer candidates — concept walkthroughs, worked examples, and the real interview questions, drawn from candidate reports. Free to read.

Amazon Machine Learning Engineer Interview Cheatsheet cover

Technical Screen

Machine Learning

Transformer Architectures And Attention — covered in depth under Onsite below.
LLM Architecture, Tuning, And Evaluation — covered in depth under Onsite below.
Logistic Regression And Linear Models — covered in depth under Onsite below.
Evaluation, Statistical Inference, And Class Imbalance — covered in depth under Onsite below.

ML System Design

Production ML Pipelines And System Design — covered in depth under Onsite below.

Coding & Algorithms

Coding Algorithms And Data Structures — covered in depth under Onsite below.
PyTorch Training And Model Implementation — covered in depth under Onsite below.

Behavioral & Leadership

Leadership Principles, Ownership, And Measurable Impact — covered in depth under Onsite below.

Onsite

Machine Learning

Transformer Architectures And Attention

Landscape infographic: left-to-right transformer pipeline showing tokens → embeddings + pos enc → multi-head attention (Q,K,V, mask) → add & norm (pre-norm) → feed-forward → output; callouts for scaled dot-product, causal mask, layer norm, and positional encoding types.

What's being tested

Interviewers are probing whether you understand Transformer internals well enough to implement, debug, train, and serve them—not just describe them at a high level. For a Machine Learning Engineer, the bar is shape reasoning, masking correctness, numerical stability, training efficiency, and knowing how architectural choices affect latency, memory, and model quality. Amazon cares because production ML systems often use Transformer-based encoders, decoders, rankers, retrieval models, and LLM-backed services where small implementation errors can silently degrade relevance, safety, or cost. Expect follow-ups that connect math to `PyTorch` tensors, GPU memory, distributed training, and online serving constraints.

Core knowledge

Scaled dot-product attention computes $\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)V$ where $M$ is an additive mask, often 0 for allowed positions and -inf for blocked positions. The $\sqrt{d_k}$ scaling prevents large logits from saturating softmax.
Multi-head attention projects an input $X \in \mathbb{R}^{B \times T \times d_{\text{model}}}$ into $Q,K,V$ tensors shaped roughly $B \times h \times T \times d_{\text{head}}$ . Heads let the model learn different relation patterns; implementation bugs usually come from incorrect view, transpose, or non-contiguous tensors.
Causal masking is mandatory for decoder-only language models. Token $t$ can attend only to positions $\leq t$ , usually via a lower-triangular mask of shape $T \times T$ . Forgetting this causes label leakage: training loss looks excellent, but generation quality fails.
Decoder-only GPT-style blocks typically contain masked self-attention, a position-wise feed-forward network, residual connections, and normalization. A common pre-norm block is x = x + attention(LN(x)), then x = x + MLP(LN(x)), which improves deep-network training stability versus post-norm.
Layer normalization normalizes across the feature dimension for each token independently: $\mathrm{LN}(x)=\gamma \frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}+\beta$ where $\mu,\sigma^2$ are computed over hidden features, and $\gamma,\beta$ are learned per-feature parameters. Unlike batch norm, it does not depend on batch statistics.
Position information is required because self-attention alone is permutation-equivariant. Common choices include learned absolute embeddings, sinusoidal embeddings, rotary position embeddings (`RoPE`), and relative position bias. For long-context serving, extrapolation behavior and `KV cache` compatibility matter.
Feed-forward networks in Transformers are usually two linear layers with an activation such as `GELU`, often expanding hidden size by about 4x: $d_{\text{model}} \rightarrow 4d_{\text{model}} \rightarrow d_{\text{model}}$ . In LLMs, gated variants like `SwiGLU` improve quality but increase parameter and memory considerations.
Training-time memory is dominated by activations, attention matrices, and optimizer states. Full attention is $O(T^2)$ in sequence length and $O(BT^2)$ for attention probabilities; for long sequences, consider flash attention, gradient checkpointing, mixed precision, sequence packing, or sparse/local attention.
Inference-time bottlenecks differ from training. Autoregressive decoding is sequential over generated tokens, but `KV cache` avoids recomputing keys and values for previous tokens, changing per-step attention from recomputing full prefix projections to attending against cached state. Cache memory grows with layers, heads, context length, and batch size.
Mixture-of-Experts replaces dense MLPs with multiple expert networks and a learned router, often top-1 or top-2 routing. It increases parameter count without activating all parameters per token, but introduces load-balancing losses, capacity factors, token dropping risk, and distributed communication such as `all-to-all`.
Transformers versus CNNs is about inductive bias and scaling tradeoffs. CNNs encode locality and translation equivariance through shared kernels, making them sample-efficient for images; Transformers learn global interactions with weaker priors but scale well with data, compute, multimodal inputs, and variable-length sequences.
Evaluation and deployment require more than perplexity. For production MLE work, track offline loss, task metrics like `AUC` or `NDCG`, calibration, latency `p50/p99`, GPU memory, throughput, drift, and online/offline parity. A lower loss model may be unacceptable if it doubles serving cost or violates latency SLOs.

Worked example

For “Implement decoder-only GPT-style transformer,” a strong candidate first clarifies the expected scope: “Should I implement a minimal `PyTorch` module with token embeddings, positional embeddings, masked multi-head self-attention, MLP blocks, and logits, or also include training and generation?” They state assumptions early: input token IDs have shape B x T, vocabulary size is V, embedding dimension is C, number of heads divides C, and the model returns logits shaped B x T x V. The answer should be organized around four pillars: embeddings and positions, Transformer block structure, attention tensor shapes and causal mask, and output projection/loss.

The implementation skeleton would include `nn.Embedding` for tokens, either learned positional embeddings or `RoPE`, a stack of blocks using pre-norm residuals, and a final `LayerNorm` plus linear language-model head. The candidate should explicitly explain that q, k, and v are projected from x, reshaped from B x T x C to B x n_heads x T x head_dim, then attention scores become B x n_heads x T x T. A specific tradeoff to flag is whether to use a simple registered triangular mask, which is clear and interview-friendly, or optimized kernels like torch.nn.functional.scaled_dot_product_attention / FlashAttention for production efficiency. They should call out common edge cases: T exceeding configured context length, incorrect mask broadcasting, missing dropout behavior between train and eval, and using .view() after transpose without .contiguous() or .reshape(). A good close is: “If I had more time, I’d add generation with `KV cache`, unit tests for shape and causal leakage, and a small overfit test to verify the implementation learns.”

A second angle

For “Explain Layer Normalization in Transformers,” the same concept shifts from architecture construction to training stability and numerical behavior. Instead of writing modules end to end, the candidate should derive the equation, identify the normalized dimension, and explain the roles of $\gamma$ , $\beta$ , and $\epsilon$ . The key MLE insight is that LayerNorm behaves consistently across small or variable batch sizes, which is important for sequence models and inference workloads where batch composition changes. A strong answer also distinguishes pre-norm from post-norm: pre-norm improves gradient flow in deeper networks, while post-norm matches the original Transformer but can be harder to optimize at scale. The interviewer may then pivot to why incorrect normalization placement can cause divergence, slow convergence, or degraded perplexity even when tensor shapes are correct.

Common pitfalls

Pitfall: Treating attention as “the model looks at important tokens” without discussing masks, shapes, or scaling.

That answer is too abstract for an MLE interview. A better response names $Q,K,V$ , gives the attention equation, describes the B x heads x T x T score tensor, and explains how masking prevents future-token leakage in decoder-only models.

Pitfall: Confusing LayerNorm with BatchNorm.

BatchNorm normalizes using batch-level statistics and behaves differently between training and inference, while LayerNorm normalizes per token across hidden features and uses the same computation at inference. This distinction matters for variable-length sequences, small batches, autoregressive decoding, and distributed training.

Pitfall: Over-indexing on model architecture while ignoring production constraints.

Saying “use a larger Transformer” is incomplete if you do not mention latency, memory, throughput, quantization, `KV cache`, or online metrics. For an Amazon MLE, a strong answer connects architecture choices to deployability: model size, sequence length, batching strategy, GPU utilization, and monitoring for quality regressions.

Connections

Interviewers may pivot from this topic into distributed training, including data parallelism, tensor parallelism, pipeline parallelism, and MoE `all-to-all` communication. They may also ask about model compression techniques such as quantization, distillation, pruning, and low-rank adaptation, or about evaluation for LLM-style systems, including perplexity, task accuracy, hallucination checks, and latency/cost tradeoffs.

Analyze attention complexity and improvements

Evaluates understanding of Transformer self-attention in the Machine Learning domain, testing the analyze time and space complexity...

Amazon Machine Learning Engineer Interview Prep Guide

Technical Screen

Machine Learning

ML System Design

Coding & Algorithms

Behavioral & Leadership

Onsite

Machine Learning

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Analyze attention complexity and improvements

Explain Layer Normalization in Transformers

Explain key ML theory and techniques

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Explain LLM architecture, tuning, evaluation

Explain LLM fundamentals and trade-offs

Explain NLP/RL concepts used in LLM agents

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Explain XGBoost Parallelism Strategies

Explain key ML concepts and techniques

List hyperparameter tuning methods

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Explain Logistic Regression Fundamentals

Implement SGD for linear regression and derive gradients

Explain core ML concepts and diagnostics

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Explain ML basics: imbalance, metrics, bias-variance

Explain imbalance, metrics, bias-variance, Transformers vs. CNNs

Explain core ML fundamentals

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Explain Collaborative Filtering Approaches

Handle cold start, dropout, and training stability

Design a search relevance prediction approach

ML System Design

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Explain parallelism and collectives in training

Debug online worse than offline model performance

Explain ML statistics and model design concepts

What's being tested