Meta Machine Learning Engineer Interview Prep Guide
Everything Meta actually asks Machine Learning Engineer candidates — concept walkthroughs, worked examples, and the real interview questions, drawn from candidate reports. Free to read.
Last updated

Technical Screen
ML System Design
-
ML Model Evaluation, Metrics, And Experimentation — covered in depth under Onsite below.
-
Generative AI Training, Attention, And Post-Training — covered in depth under Onsite below.
Coding & Algorithms
-
Arrays, Strings, And Matrix Fundamentals — covered in depth under Onsite below.
-
Heaps And Selection Algorithms — covered in depth under Onsite below.

What's being tested
Sliding-window streaming statistics test whether you can maintain aggregates over the last k items without recomputing from scratch. Expect efficient data-structure choices for moving average, minimum/maximum, median, and adjacent array/matrix basics like diagonal checks or BFS path reconstruction.
Patterns & templates
-
Fixed-size sliding window — use
dequeplus runningsum;next(x)isO(1)time,O(k)space. -
Moving average — add new value, evict oldest when size exceeds
k, returnsum / len(window); watch integer overflow. -
Sliding minimum/maximum — use monotonic deque storing indices; each element enters/exits once, so total
O(n)time. -
Sliding median — use two heaps: max-heap lower half, min-heap upper half; rebalance sizes after insert/delete.
-
Lazy deletion for heaps — maintain
delayed[value]counts because arbitrary heap removal is notO(log k)in Python. -
Matrix diagonal / Toeplitz check — verify
matrix[r][c] == matrix[r-1][c-1];O(mn)time,O(1)space. -
2D pathfinding — use BFS for shortest unweighted path; store
parent[(r,c)]to reconstruct path after reaching target.
Common pitfalls
Pitfall: Recomputing
sum(window)or sorting each window givesO(nk)orO(nk log k)and usually misses the intended solution.
Pitfall: Median implementations often fail on duplicates unless deletion is counted by value and heap sizes track only valid elements.
Pitfall: Returning average over
kbefore the stream haskelements; usually divide by current window length unless specified otherwise.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
-
BFS, DFS, Graph, And Grid Traversal — covered in depth under Onsite below.
-
Trees, Linked Lists, And Pointer Algorithms — covered in depth under Onsite below.
Behavioral & Leadership
- Behavioral Leadership, Ownership, And Compliance — covered in depth under Onsite below.
Onsite
ML System Design

What's being tested
These interviews test whether you can design a large-scale recommender ML system as an MLE: candidate generation, ranking, feature computation, model serving, evaluation, and deployment under latency constraints. Meta cares because surfaces like notifications, local recommendations, feed units, and nearby places require fast personalization from billions of possible items while preserving user experience and trust. The interviewer is probing whether you can separate retrieval from ranking, reason about online/offline feature parity, handle real-time context, and describe how you would measure and safely ship model improvements. Strong answers are not just “train a model”; they explain where features come from, how candidates are narrowed, how rankers are served, and how the system degrades when signals are missing or stale.
Core knowledge
-
Two-stage recommendation architecture is the default pattern: first retrieve hundreds or thousands of candidates cheaply, then score tens to hundreds with a heavier ranker. Retrieval optimizes recall under tight latency; ranking optimizes calibrated utility such as , , or expected long-term value.
-
Candidate generation usually combines multiple sources: collaborative filtering, embedding retrieval, popularity/trending lists, graph-based neighbors, geographic radius filters, and business-rule-safe fallbacks. For nearby places, sources might include “places within 2 km,” “friends visited,” “similar to places user saved,” and “currently popular nearby.”
-
Approximate nearest neighbor search with algorithms like HNSW, IVF, or product quantization enables embedding retrieval at large scale. Exact nearest-neighbor search is fine for up to roughly millions of vectors in controlled settings, but at hundreds of millions or billions, systems like
`FAISS`,`ScaNN`, or`Annoy`are used to trade a small recall loss for major latency savings. -
Ranking models should match serving constraints.
`XGBoost`/GBDT-style models are strong baselines for tabular ranking; neural rankers can model richer user-item interactions but cost more at serving time. A common production pattern is: lightweight first-pass ranker, heavier second-pass ranker, then rule-based re-ranking for diversity, integrity, freshness, or notification fatigue. -
Feature stores solve consistency between training and serving. The offline store supports point-in-time training examples; the online store serves low-latency feature lookups for inference. The MLE responsibility is not pipeline plumbing, but defining feature semantics, freshness requirements, defaults, transformations, and ensuring online values match the offline training view.
-
Point-in-time correctness prevents leakage. A training row for user , item , and timestamp must only use features known before . A common bug is using “total clicks in the next 7 days” or a post-impression aggregate as a feature, which inflates offline
`AUC`and collapses online. -
Feature freshness depends on signal type. Static features like category, language, and long-term user embeddings can refresh daily; session context, location, notification history, and recent interactions may need minute-level or request-time computation. Interviewers expect you to explicitly separate batch, near-real-time, and request-time features.
-
Online inference latency budget should be decomposed. For example, candidate generation might get 30–50 ms, feature lookup 20 ms, ranking 20–40 ms, and re-ranking/filtering 5–10 ms under a
`p99`target. If the model is too slow, use caching, precomputed embeddings, model distillation, feature pruning, or staged ranking. -
Objective design often requires multi-task learning or weighted utility. A notification ranker may combine open probability, downstream engagement, hide/report probability, and fatigue cost:
The exact weights are usually tuned through offline evaluation and online experiments. -
Evaluation should include both offline and online layers. Offline metrics include
`AUC`,`log loss`,`NDCG@K`,`Recall@K`, calibration plots, and slice metrics by geography/device/new users. Online metrics include click-through rate, saves, hides, notification disables, session depth, latency, and guardrails like crashes or integrity violations. -
Cold start needs explicit handling. New users rely more on context, location, demographics if allowed, onboarding choices, and global popularity. New places/items rely on metadata, category, geospatial priors, creator quality, and exploration buckets rather than historical engagement.
-
Monitoring should cover model and data behavior. Track feature missingness, value distributions, embedding norms, candidate-source mix, score calibration, serving latency, fallback rate, and business guardrails. Feature drift and label drift are especially important when location patterns, seasonality, or notification policies change.
Worked example
For Design Nearby and Notification Ranking, a strong candidate would start by clarifying whether “nearby” means physical location recommendations, people nearby, or local events, and whether notifications are push notifications, in-app notifications, or both. They would ask about the primary objective: opens, meaningful actions after open, long-term engagement, or minimizing notification fatigue. Then they would declare assumptions: user has location permission, ranking must run under a tight `p99` latency budget, and negative feedback such as hides or notification disables must be modeled.
The answer skeleton should have four pillars: candidate generation, feature design, ranking/serving, and evaluation/monitoring. Candidate generation would combine geospatial filters, user-place embeddings, social graph signals, and popularity/freshness candidates. Features would include distance, time of day, historical engagement with similar places, recent notification count, friend activity, place quality, and contextual availability. Ranking could use a multi-task model predicting open probability, downstream engagement, and negative feedback, with a final utility score penalizing fatigue and low-quality notifications.
One explicit tradeoff to flag is precompute versus request-time personalization: precomputing nearby candidates improves latency, but request-time location and freshness improve relevance. A good compromise is to precompute stable user/place embeddings and retrieve candidates online using current location and context. Close by saying: if there were more time, you would discuss exploration for underexposed places, privacy constraints around location, and how to run a staged rollout with calibration and fatigue guardrails.
A second angle
For Design nearby place recommendations, the same concepts apply, but the emphasis shifts from notification interruption cost to local relevance and geospatial retrieval. Candidate generation becomes more constrained by distance, open hours, category, safety filters, and place availability, while ranking can optimize saves, directions, check-ins, or profile views. Real-time context matters more: current location, day of week, weather-like context if available, and whether the user is traveling versus near home. The feature store discussion should emphasize point-in-time historical place engagement and request-time features like distance and open status. Unlike notification ranking, the system may tolerate slightly more latency in an in-app surface, but it needs stronger diversity so the top results are not ten identical restaurants from the same chain.
Common pitfalls
Pitfall: Treating ranking as a single model over all possible items.
A tempting but weak answer is “score every place or notification with a neural network and sort.” At Meta scale, the candidate space is too large, so you need retrieval layers, approximate search, source blending, filtering, and then ranking. Say explicitly how you reduce billions of possible items to thousands, then hundreds, then the final top `K`.
Pitfall: Ignoring online/offline feature mismatch.
Many candidates describe rich training features but never explain whether those values are available at serving time. A better answer names feature freshness, default values, point-in-time joins, and monitoring for missingness or drift. If a feature cannot be computed before the ranking request, it should not be in the online model.
Pitfall: Optimizing only click-through rate.
For recommendations and notifications, higher `CTR` can harm users if it increases spam, fatigue, hides, or low-quality engagement. A stronger answer frames ranking as utility optimization with guardrails: engagement, negative feedback, calibration, latency, fairness/slice performance, and long-term retention proxies where appropriate.
Connections
Interviewers may pivot from this topic into embedding-based retrieval, learning-to-rank, real-time feature serving, A/B testing for recommender systems, or model monitoring and drift detection. They may also ask how you would debug a drop in `NDCG@K`, a spike in feature missingness, or a mismatch between offline `AUC` gains and flat online metrics.
Further reading
-
Facebook DLRM: Deep Learning Recommendation Model for Personalization and Recommendation Systems — Meta-authored architecture paper for large-scale recommendation modeling.
-
FAISS: A Library for Efficient Similarity Search and Clustering of Dense Vectors — practical background on vector retrieval at scale.
-
Hidden Technical Debt in Machine Learning Systems — classic paper on production ML failure modes, including feature dependencies and monitoring.
Practice questions

What's being tested
Interviewers are probing whether you can design evaluation and experimentation loops for ML systems whose success cannot be judged by offline accuracy alone. For recommendation, nearby ranking, notification ranking, and multimodal generation, Meta cares about whether an MLE can connect training objectives, offline metrics, online A/B metrics, guardrails, and monitoring into one coherent launch process. You are expected to know how to choose metrics that match user value, detect offline/online mismatch, reason about bias in logged data, and safely iterate models in production. The strongest answers do not just say “run an A/B test”; they explain what is measured before launch, during ramp, and after deployment drift begins.
Core knowledge
-
Offline evaluation is necessary but insufficient. For ranking systems, common metrics include
`AUC`,`log loss`,`NDCG@K`,`MAP@K`,`MRR`,`Recall@K`, and calibration error.`AUC`measures pairwise ordering globally, while`NDCG@K`emphasizes top-ranked items, which is usually closer to feed, notification, and place recommendation behavior. -
Ranking metrics should match the product surface. For a notification ranker,
`precision@1`, expected click/open probability, hide/report rate, and send suppression quality matter more than long-list`Recall@100`. For nearby places,`NDCG@K`may need distance-aware gains such as relevance discounted by travel distance or freshness. -
Calibration matters when scores drive thresholds, auctions, notification sends, or multi-objective ranking. A model with good ranking may still be poorly calibrated. Check bins of predicted probability versus empirical rate, use expected calibration error:
and consider`Platt scaling`, isotonic regression, or temperature scaling when appropriate. -
Counterfactual bias appears because training labels come from previously ranked or filtered items. If the old system never showed a place, you do not observe whether the user would have clicked it. Common mitigations include randomized exploration buckets, inverse propensity scoring, doubly robust estimators, and careful comparison against the serving policy that generated the logs.
-
Online experimentation should define one or two primary metrics, several secondary metrics, and hard guardrails. For recommenders, primary metrics might be meaningful engagement or long-term retention; guardrails often include
`p95`/`p99`latency, notification opt-outs, hides, reports, integrity violations, battery/network usage, and diversity or fairness constraints. -
A/B test power depends on variance, baseline rate, minimum detectable effect, and traffic. A simplified sample size intuition is:
Small lifts in rare events like place visits or notification disables may require large samples or proxy metrics, but proxies must be validated. -
Sequential testing and peeking can inflate false positives. In production ramps, metrics are often checked daily, but launch decisions should respect preplanned analysis windows, alpha spending, or sequential methods. A candidate should say “I would monitor guardrails continuously, but judge success on a prespecified window.”
-
Offline/online parity is an MLE responsibility. Training-time feature definitions, timestamp joins, normalization, embedding versions, missing-value handling, and model preprocessing should match serving. A small mismatch can produce a model that looks strong offline but regresses online, especially with real-time context like location, time of day, or session intent.
-
Slice-based evaluation catches average-metric failures. Break down quality by geography, language, device class, new versus mature users, cold-start items, low-connectivity regions, high-activity users, and sensitive integrity segments. A recommender can improve global
`NDCG@10`while harming new users or causing over-notification for a small cohort. -
Multi-objective optimization is common at Meta scale. Ranking often combines predicted click, dwell, conversion, hide, report, freshness, distance, and diversity:
The key is to explain how weights are tuned offline, validated online, and constrained by guardrails. -
Generative model evaluation needs both automatic and human/safety metrics. For image or multimodal generation, use prompt adherence, aesthetic quality, diversity, toxicity, policy violation rate, memorization checks, latency, and cost per generation. Automatic metrics like
`FID`,`CLIPScore`, or embedding similarity are useful but can be gamed and should not replace human eval. -
Post-launch monitoring should watch data drift, prediction drift, label drift, calibration drift, serving latency, feature freshness, and business guardrails. Use population stability index, embedding distribution shifts, calibration-by-time, and alerting on large deviations. Retraining should be triggered by measured degradation, not just a fixed schedule.
Worked example
For Design Nearby and Notification Ranking, a strong candidate would start by clarifying the surface: “Are we ranking nearby friend/activity suggestions, place notifications, or general push notifications, and is the objective opens, downstream engagement, or long-term notification health?” They would declare assumptions: candidates are generated elsewhere, the ranker scores a few hundred items per user request, and the system must respect latency and notification fatigue. The answer can be organized around four pillars: offline evaluation, online experiment design, guardrails, and monitoring.
For offline evaluation, they would propose `NDCG@K` or `precision@K` for top-ranked relevance, `log loss` and calibration for send thresholds, and slice metrics by geography, user activity level, and cold-start locations. For online testing, they would define a primary metric such as qualified notification opens or downstream sessions, with guardrails like opt-out rate, hides, reports, duplicate notifications, and `p99` ranking latency. A specific tradeoff to flag is exploration versus user trust: adding randomized exploration improves counterfactual learning, but too much random notification traffic can increase disables, so exploration should be small, controlled, and excluded or propensity-corrected in evaluation. They should also call out offline/online mismatch: location freshness, time-of-day features, and notification cooldown features must be computed consistently in training and serving. A good close would be: “If I had more time, I would add long-term holdouts to measure notification fatigue, plus calibration monitoring so the send threshold remains stable after retraining.”
A second angle
For Design image and multimodal generation systems, the same evaluation discipline applies, but the metrics shift from ranking relevance to generation quality, safety, and cost. Offline metrics like `FID`, `CLIPScore`, prompt adherence classifiers, and human preference ratings help compare model checkpoints, but they do not fully predict user satisfaction or policy risk. Online experimentation would measure generation completion rate, regeneration rate, explicit negative feedback, share/save rate, safety violation rate, latency, and GPU cost per successful generation. The harder constraint is that rare but severe failures matter: a model with better average aesthetic quality may still be unlaunchable if policy violation rate or memorization risk increases. A strong MLE frames evaluation as a layered gate: automated eval, red-team/safety eval, shadow or limited rollout, then controlled A/B testing with strict guardrails.
Common pitfalls
Pitfall: Optimizing only for
`CTR`or opens.
For ranking and notification systems, `CTR` is tempting because it is dense and easy to measure, but it can reward clickbait, over-notification, or low-quality engagement. A better answer pairs an engagement metric with negative feedback, long-term retention, integrity metrics, and calibration checks.
Pitfall: Treating offline metric lift as proof of launch readiness.
An offline `NDCG@10` gain can disappear online because logs are biased by the old ranker, features differ between training and serving, or the new model changes the candidate distribution. Strong candidates explicitly discuss counterfactual bias, shadow evaluation, ramped A/B tests, and slice-level regressions.
Pitfall: Communicating a metric list without a decision framework.
Interviewers do not want a catalog of every metric you know. They want to hear which metric is primary, which metrics are guardrails, what launch threshold you would use, and what you would do if metrics conflict—for example, engagement rises but opt-outs also rise.
Connections
This topic often leads into feature store design, online serving latency, model monitoring, retraining pipelines, and candidate generation versus ranking tradeoffs. Interviewers may also pivot to multi-task learning, counterfactual learning-to-rank, or safe rollout mechanisms such as shadow mode, canaries, and gradual traffic ramps.
Further reading
-
Trustworthy Online Controlled Experiments — practical reference for A/B testing, guardrails, variance, and launch decision-making.
-
Hidden Technical Debt in Machine Learning Systems — explains why monitoring, data dependencies, and evaluation debt matter in production ML.
-
Counterfactual Learning and Evaluation for Recommender Systems — useful background on logged-bandit feedback, propensity weighting, and biased recommendation logs.
Practice questions

What's being tested
Interviewers are probing whether you can design large-scale generative model training systems that are correct, scalable, debuggable, and safe to deploy. For a Machine Learning Engineer, the emphasis is not just knowing model families, but explaining how data, distributed training, attention kernels, evaluation, checkpointing, and post-training fit into a production-grade pipeline. Meta cares because generative AI systems stress every part of ML infrastructure: GPU utilization, memory, communication, model quality, safety, latency, and continuous improvement loops. Strong answers show you can reason from first principles while naming concrete tools such as `PyTorch`, `FSDP`, `Megatron-LM`, `DeepSpeed`, `NCCL`, and online evaluation infrastructure.
Core knowledge
-
Scaled dot-product attention maps queries, keys, and values via The scaling prevents logits from growing with dimension; the mask handles causal or padding constraints.
-
Attention implementation details matter because numerical instability often appears before modeling mistakes. Use a stable softmax by subtracting row-wise max logits, apply masks before softmax, avoid attending to padded tokens, and verify tensor shapes:
`Q: [B,H,T,D]`,`K: [B,H,S,D]`,`V: [B,H,S,D]`. -
Memory-efficient attention is critical for long contexts. Vanilla attention stores a matrix, giving memory and compute.
`FlashAttention`reduces memory traffic by tiling attention computation and recomputing softmax statistics, often improving GPU utilization without changing outputs. -
Distributed pretraining combines several parallelism dimensions. Data parallelism replicates models across workers, tensor parallelism shards matrix multiplications, pipeline parallelism splits layers, and sequence parallelism shards activations across sequence length. Large models usually require a hybrid plan rather than one strategy.
-
Fully Sharded Data Parallel systems such as
`FSDP`and`ZeRO`shard parameters, gradients, and optimizer states. This matters because`AdamW`stores multiple tensors per parameter; memory can be roughly 12–16 bytes per parameter in mixed precision once weights, gradients, and optimizer states are included. -
Mixture-of-Experts models activate only a subset of experts per token, commonly top-1 or top-2 routing. This increases parameter count without proportional FLOPs but introduces routing imbalance, all-to-all communication, expert capacity limits, and token dropping if capacity is exceeded.
-
MoE load balancing requires explicit losses and monitoring. A common auxiliary objective penalizes skew between router probabilities and actual expert assignment. Track expert utilization entropy, dropped-token rate, routing collapse, per-expert gradient norms, and all-to-all communication time.
-
Multimodal generation usually requires modality-specific encoders or tokenizers plus a shared generative backbone. Image generation may use diffusion models, autoregressive image tokens, or latent diffusion; multimodal chat often uses a vision encoder like
`ViT`projected into an LLM embedding space. -
Training objectives differ by modality and phase. LLM pretraining uses next-token prediction with cross-entropy; diffusion uses denoising score matching or noise prediction; contrastive image-text pretraining uses objectives like
`CLIP`; post-training may use supervised fine-tuning, preference optimization, or reinforcement learning. -
Post-training pipelines separate model serving from learner updates. In RLHF-style systems, actors generate completions, reward models or preference models score them, a learner updates the policy, and evaluation gates decide promotion. Asynchrony improves throughput but increases policy lag and reproducibility complexity.
-
Evaluation gates should include offline and online checks. For generative systems, use held-out perplexity or loss, task benchmarks, human preference win rate, safety violation rate, toxicity classifiers, hallucination probes, latency, memory footprint, and regression tests on known failure prompts.
-
Production readiness depends on observability and rollback. Track training loss spikes, gradient norm, NaN rate, GPU utilization, checkpoint restore success, data mixture proportions, router health for MoE, reward-model drift, and serving metrics such as
`p50`,`p95`,`p99`latency and tokens/sec.
Worked example
For Design a scalable MoE pretraining pipeline, a strong candidate first clarifies the target: model size, token budget, context length, available GPU cluster, expected training duration, and whether the priority is lowest cost, fastest time-to-train, or best quality. Then they state assumptions, for example: “I’ll design for a trillion-token-scale decoder-only transformer with top-2 MoE layers, trained on a multi-node `GPU` cluster using mixed precision.” The answer can be organized around four pillars: data and batching, model architecture and routing, distributed training strategy, and reliability/evaluation.
In the architecture pillar, explain where MoE layers sit, how the router chooses experts, and how capacity factor controls the tradeoff between utilization and dropped tokens. In distributed training, describe a hybrid of data parallelism, tensor parallelism, expert parallelism, and `FSDP`/`ZeRO` optimizer sharding, with `NCCL` all-reduce and all-to-all communication as the main bottlenecks. In reliability, include checkpointing of model, optimizer, router state, RNG state, and data iterator state so training can resume deterministically enough after preemption.
A specific tradeoff to flag is top-1 versus top-2 routing: top-1 is cheaper and simpler, while top-2 often improves quality and gradient flow but doubles expert communication and can worsen all-to-all congestion. A good close is: “If I had more time, I’d go deeper on kernel-level optimization, exact cluster topology-aware placement, and the evaluation suite used to decide whether the MoE model beats a dense baseline at equal training FLOPs.”
A second angle
For Architect an asynchronous RL post-training system, the same training-systems concepts apply, but the bottleneck shifts from pure throughput to coordination between generation, scoring, learning, and evaluation. Instead of static pretraining data, the system continuously produces new trajectories from a changing policy, so versioning is central: every sample should carry policy version, reward model version, prompt source, decoding parameters, and safety filters applied. Asynchrony improves hardware utilization because actors can keep generating while learners update, but it creates policy lag, where the learner trains on samples from stale policies. A strong answer discusses bounded staleness, replay buffers, KL penalties to a reference model, canary evaluation, and rollback if reward hacking or safety regressions appear. The framing is less about expert routing and more about controlling feedback loops without letting the system optimize the wrong reward.
Common pitfalls
Pitfall: Treating model architecture as the whole system.
A tempting weak answer lists “use a transformer, add attention, train on GPUs” and stops there. A better answer covers the operational pieces that make the model trainable: sharding, checkpointing, data mixture validation, memory pressure, communication bottlenecks, evaluation gates, and rollback.
Pitfall: Ignoring numerical and shape correctness in attention.
For attention implementation, many candidates know the formula but miss mask semantics or stable softmax. Interviewers often look for details like applying the causal mask before softmax, using or a large negative value correctly under mixed precision, and ensuring padding tokens receive zero probability.
Pitfall: Overclaiming evaluation quality.
Generative systems cannot be validated with one metric. Saying “we’ll check loss” is insufficient because lower loss can coexist with worse instruction following, unsafe outputs, or reward hacking. A stronger answer separates pretraining metrics, task benchmarks, human preference evaluation, safety classifiers, and production latency/cost metrics.
Connections
Interviewers may pivot from here into model serving, especially batching, KV-cache management, quantization, and latency/cost tradeoffs. They may also ask about feature/data quality for ML pipelines, online/offline evaluation parity, distributed systems debugging, or model safety and red-teaming for generative outputs.
Further reading
-
Attention Is All You Need — original transformer paper defining scaled dot-product attention and multi-head attention.
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — practical reference for sparse MoE routing, capacity, and load balancing.
-
Training language models to follow instructions with human feedback — foundational RLHF system design, including supervised fine-tuning, reward modeling, and policy optimization.
Practice questions
Coding & Algorithms

What's being tested
These prompts test linear-time array processing, two-pointer in-place mutation, matrix indexing, BFS pathfinding, and streaming window aggregates. For an MLE, the interviewer is probing whether you can write production-safe primitives for feature streams, batched tensors, ranking lists, and grid/state traversal with clear O(n) / O(mn) complexity.
Patterns & templates
-
Single-pass extrema tracking — for
maxProfit(prices), maintainmin_so_farandbest;O(n)time,O(1)space. -
Reverse two-pointer merge — for
merge(nums1, m, nums2, n), fill from the end to avoid overwriting; handle exhausted-array tails. -
Matrix diagonal traversal — compare
matrix[i][i]or anti-diagonalmatrix[i][cols-1-i]; validate rectangular shape and dimensions first. -
Bounds checking helper — centralize
0 <= r < rows and 0 <= c < cols; prevents repeated off-by-one bugs in grid logic. -
BFS for shortest path — use
deque,visited, andparentmap for path reconstruction;O(V+E)over grid cells. -
Sliding moving average — maintain
dequeplus runningsum; update inO(1)per event, avoiding recomputation over the full window. -
Sliding window order statistic — for median/min/max, use heaps, monotonic queues, or balanced trees; average alone only needs sum.
Common pitfalls
Pitfall: Treating empty arrays, single-element arrays, or all-decreasing stock prices as afterthoughts; define return values before coding.
Pitfall: Merging sorted arrays from the front in-place, which overwrites unprocessed values in
nums1.
Pitfall: Using DFS when the prompt asks for shortest path; BFS is the default for unweighted grid distances.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions

What's being tested
Heap-based selection and sorted-structure traversal: finding top-k, kth, or median-like elements without fully sorting or materializing all candidates. Interviewers probe whether you can exploit sorted inputs, bound memory to O(k), and reason clearly about duplicates, negative values, and boundary cases.
Patterns & templates
-
Max-heap for k smallest — scan
nvalues, keep heap sizek; use negated values in Python;O(n log k)time,O(k)space. -
Min-heap k-way merge — push one head per sorted list/row, repeatedly
heappop; complexityO(k log m)formsources. -
Best-first search over pair sums — start at
(0,0), push neighbors(i+1,j)and(i,j+1); usevisitedto avoid duplicates;O(k log k). -
Sorted matrix selection — either min-heap over rows,
O(k log n), or value-space binary search with count<= mid,O(n log range). -
Quickselect — average
O(n)for kth element when input is unsorted; handle equal pivots with 3-way partitioning. -
Median of two sorted arrays — binary search partition, not heap; target left size
(m+n+1)//2, checkmaxLeft <= minRight. -
Tie and duplicate handling — duplicate values may be valid outputs, but duplicate heap states like
(i,j)should usually be suppressed.
Common pitfalls
Pitfall: Generating all pair sums or flattening a matrix is usually
O(nm log nm)orO(n^2 log n)and misses the intended selection pattern.
Pitfall: Confusing “k smallest elements” with “kth smallest element”; one returns a collection, the other returns a single rank statistic.
Pitfall: Forgetting
k == 0,k > n, empty arrays, duplicate values, and negative numbers leads to brittle code even if the core heap idea is correct.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions

What's being tested
Binary search on answer: recognizing when the value you need is not directly searchable in an array, but a monotonic predicate over possible answers is. Interviewers are probing whether you can define bounds, implement can(x) correctly, prove monotonicity, and handle edge cases without off-by-one bugs.
Patterns & templates
-
Maximize feasible value — use
while lo <= hi, movelo = mid + 1whencan(mid)is true; returnhi. -
Minimize feasible value — move
hi = mid - 1whencan(mid)is true; returnloas the smallest valid answer. -
Capacity partitioning —
can(capacity)scans once, counting days/partitions;O(n log S)time,O(1)space, preserve input order. -
Piece-count feasibility —
sum(length // x) >= kis monotonic decreasing asxgrows; avoid testingx = 0. -
Value-space matrix search — count elements
<= midinO(n)from bottom-left/top-right; totalO(n log range)time. -
Integer-safe midpoint — compute
mid = lo + (hi - lo) // 2; in Python overflow is irrelevant, but still communicate the habit.
Common pitfalls
Pitfall: Confusing searching indices with searching answer values; the answer may not appear as an array element.
Pitfall: Using the wrong return value after convergence; for “minimum feasible,” return
lo, for “maximum feasible,” usually returnhi.
Pitfall: Weak bounds cause bugs: shipping lower bound is
max(weights), ribbon lower bound is1, matrix bounds are min/max values.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions

What's being tested
Graph traversal fluency: modeling matrices, grids, nested structures, and dependency lists as nodes plus edges, then choosing BFS for shortest paths or DFS for exhaustive exploration. Interviewers are probing correctness under edge cases, clean implementation, and complexity reasoning: for graphs, for grids.
Patterns & templates
-
Grid DFS component sizing — iterate every cell, launch
dfs(r,c)on unvisited valid cells, count region size;O(RC)time,O(RC)worst stack. -
Iterative DFS with
stack— safer than recursion for large matrices; mark visited when pushing, not popping, to avoid duplicates. -
BFS shortest path — use
deque, level distance, and optionalparent[(r,c)]map for reconstruction; handles obstacles and unweighted moves. -
Cycle detection in directed graphs — use
WHITE/GRAY/BLACKstates indfs(node); encounteringGRAYmeans cycle, postorder gives topological order. -
Adjacency construction — for dependency/order problems, build
dict[node] -> list[neighbors]; include isolated nodes so output size is correct. -
Bounds helper — centralize
in_bounds(r,c)andDIRS = [(1,0),(-1,0),(0,1),(0,-1)]; clarify whether diagonals count. -
Nested DFS aggregation — pass
depthinto recursion for depth-weighted sums; watch for empty lists and mixed integer/list nodes.
Common pitfalls
Pitfall: Using DFS for shortest path in an unweighted grid. DFS may find a path, but BFS guarantees minimum distance.
Pitfall: Marking grid cells visited too late. If multiple neighbors enqueue the same cell, runtime and parent reconstruction can break.
Pitfall: Forgetting recursion-depth limits. In
Python, a long chain or huge connected region can exceed the call stack; switch to iterative traversal.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
What's being tested
Recursive decomposition, backtracking search, and combinatorial pruning over permutations, expression generation, nested structures, and constrained selections. Interviewers want to see clean state management, duplicate avoidance, complexity reasoning, and when to switch from exhaustive search to binary search, BFS, or dynamic pruning.
Patterns & templates
-
Backtracking template —
dfs(path, used)with choose/recurse/unchoose; permutations costO(n! * n)time andO(n)stack space. -
Duplicate-safe permutations — sort first, then skip
nums[i] == nums[i-1] and not used[i-1]; prevents generating equivalent branches. -
Expression insertion DFS — track
index,value,last_operand, andexpr; multiplication needs undoing previous operand viavalue - last + last * cur. -
Subset/combination pruning —
dfs(start, mask)for character uniqueness; use bitmasks forO(1)overlap checks and early reject invalid strings. -
Nested recursion — compute
depthSum(node, depth)by accumulating integers and recursing into lists; stack depth equals maximum nesting depth. -
Constraint search alternative — shipping capacity is binary search on answer using
canShip(capacity)inO(n log(sum(weights))), not backtracking. -
Grid distance fallback — shortest path in an unweighted grid is BFS,
O(mn)time; recursion risks stack overflow and wrong shortest-path ordering.
Common pitfalls
-
Pitfall: Using a
setof full permutations to remove duplicates works but wastes memory; sort-and-skip is the expected interview solution. -
Pitfall: Forgetting multiplication precedence in expression generation gives plausible but incorrect results; carry
last_operandexplicitly. -
Pitfall: Describing exponential complexity vaguely is weak; state output-sensitive bounds like
O(n! * n)orO(4^n)where applicable.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
What's being tested
These problems test pointer-based traversal, in-place state updates, and edge-case-safe algorithm implementation across trees, linked lists, and numeric routines. For an MLE, the signal is whether you can write production-quality code for core data structures without relying on library shortcuts.
Patterns & templates
-
Fast exponentiation — implement
`pow`(x,n) with exponent halving inO(log n)time; handlen < 0,INT_MIN,x == 0. -
Floyd cycle detection — use
slowandfastpointers; after collision, reset one pointer toheadto find cycle entry inO(n). -
Tree level traversal — use
queueBFS for left-side view; first node per level is answer,O(n)time andO(width)space. -
DFS with depth tracking — pre-order left-first recursion records first value at each depth; simpler code, but stack can hit
O(height). -
In-order successor with parent pointers — if right child exists, return leftmost right subtree; otherwise climb until leaving a left edge.
-
Perfect binary tree neighbor linking — use already-established
nextpointers level by level; achieveO(n)time andO(1)extra space. -
Constant-time board state tracking — maintain row, column, diagonal counters; update each move in
O(1)instead of rescanningn×n.
Common pitfalls
Pitfall: Treating negative exponents as
1 / pow(x, -n)can overflow forINT_MIN; cast tolongbefore negation.
Pitfall: For tree views, confusing “leftmost node per level” with “always follow left children” fails on missing-child and skewed cases.
Pitfall: In linked-list cycle detection, stopping at the meeting point gives only cycle existence, not the cycle entry.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
Behavioral & Leadership
What's being tested
Interviewers are probing whether you can exercise ownership over an ML system when the hard part is not only model quality, but risk, ambiguity, and cross-functional execution. For a Machine Learning Engineer at Meta, that means knowing how training data, features, model outputs, logs, and deployment paths can create privacy, compliance, fairness, or operational risks. Strong answers show that you can lead without formal authority: clarify constraints, make tradeoffs explicit, align Legal/Privacy/Product/Infra partners, and still deliver measurable ML impact. The interviewer is also checking whether you can communicate under pressure without hiding uncertainty or blaming other teams.
Core knowledge
-
STAR+L framing is the default behavioral structure: Situation, Task, Action, Result, and Learning. For senior MLE answers, spend less time on background and more on technical judgment, tradeoffs, stakeholder alignment, and measurable outcomes such as latency, model quality, incident reduction, or launch safety.
-
Data minimization means using only features needed for the model objective, not “everything available.” A strong MLE can explain dropping high-risk attributes, aggregating user-level signals, shortening lookback windows, or using embeddings/derived features when raw sensitive fields are unnecessary.
-
Purpose limitation is central to compliance: data collected for one purpose may not be valid for another model or product surface. Before training, confirm the allowed use, retention policy, consent state, and whether downstream predictions change the user experience in a regulated or sensitive way.
-
Privacy-by-design should appear in the ML lifecycle: dataset creation, feature review, training, evaluation, model artifact storage, serving logs, debugging traces, and rollback. Common controls include access reviews, audit logs, deletion propagation, retention limits, encryption, and redaction of sensitive fields from model telemetry.
-
Offline/online parity is both a quality and governance issue. If training features differ from serving features, you can get silent regressions, leakage, or policy violations. Validate feature definitions, freshness, default values, and eligibility filters before launch, especially when using a feature store such as
Feastor an internal equivalent. -
Risk scoring can be communicated simply as . For a model launch, likelihood includes data exposure probability, drift risk, and rollback complexity; impact includes user harm, regulatory exposure, business criticality, and blast radius across surfaces or countries.
-
Progressive delivery is a leadership tool, not only an infra tactic. Use shadow mode, canaries, holdouts, and staged ramps such as 1%, 5%, 25%, 50%, 100% while monitoring
p95/p99latency, error rate, calibration, fairness slices, complaint rate, and guardrail metrics. -
Model evaluation should include more than aggregate
AUC,NDCG, or loss. For compliance-sensitive systems, evaluate cohorts, geographies, age-eligibility buckets when applicable, sparse-data users, deletion-request users, and cold-start cases. Averages can hide harmful behavior on small but important slices. -
Incident ownership means naming the current highest-risk unknown, creating an immediate mitigation, and assigning clear owners. In ML systems this may involve disabling a feature, reverting a model version, freezing training data, widening a guardrail threshold, or switching to a known-safe baseline.
-
Stakeholder management is not “getting approval at the end.” Bring Privacy, Legal, Security, Product, and Infra into the decision early with a concrete artifact: model card, data lineage summary, feature list, launch checklist, risk register, or decision log with open questions and owners.
-
Conflict resolution should be evidence-driven. If a PM wants to launch and Privacy wants more review, translate the disagreement into options: reduce data scope, launch to a smaller population, use a less personalized baseline, delay high-risk features, or add monitoring and rollback gates.
-
Senior-level impact is measured by systems that continue working after you leave. Good examples include creating reusable compliance checklists, feature review templates, model-serving safeguards, automated PII scans, or launch gates that reduce future review time without weakening safety.
Worked example
For “Demonstrate leadership and ensure data compliance,” a strong candidate would frame the first 30 seconds by clarifying the ML system, the data involved, the regulatory or policy constraint, and the launch deadline. They might say: “I’ll describe a ranking model where we discovered late that one candidate feature had ambiguous usage rights; my goal was to protect users and the company while preserving as much model quality as possible.” The answer skeleton should have four pillars: identify the risk, align stakeholders, redesign the ML approach, and operationalize prevention.
In the action section, the candidate should explain how they audited the feature list, checked training-versus-serving usage, and partnered with Privacy/Legal rather than making a unilateral call. They should describe a concrete technical mitigation, such as removing the questionable feature, replacing it with an aggregated non-sensitive proxy, retraining the model, and validating that NDCG or calibration degradation stayed within an acceptable threshold. A useful tradeoff to flag is launch speed versus compliance confidence: “We accepted a 0.7% offline quality drop to unblock launch safely, then planned a follow-up review for a more expressive compliant feature.” The result should include measurable outcomes: model launched on time or after a bounded delay, no policy exception required, reduced review time for future launches, or a reusable governance artifact adopted by other MLEs. Close with learning: “If I had more time, I would have added automated feature-policy checks earlier in the training pipeline so this surfaced before launch week.”
A second angle
For “Describe handling intense time pressure,” the same ownership concept applies, but the interviewer is less focused on privacy details and more focused on prioritization under stress. A strong answer should separate urgent from important: what had to be fixed before launch, what could be mitigated with a guardrail, and what could safely move to a follow-up. In an MLE context, that could mean choosing to rollback to the previous model, disable a new feature family, or ramp only a canary population while debugging drift. The candidate should explicitly mention communication cadence: short status updates, decision owners, risk register, and a clear launch/no-launch criterion. The key transfer is that leadership is demonstrated through structured decisions, not heroics or working longer hours.
Common pitfalls
Pitfall: Treating compliance as someone else’s job.
A weak answer says, “Legal approved it, so we moved forward.” A stronger answer says you involved Legal or Privacy, but as the MLE you still owned the feature inventory, model behavior, logging paths, deletion implications, and launch gates because those details live in the ML implementation.
Pitfall: Over-indexing on model metrics while ignoring risk.
A tempting answer is, “We kept the feature because it improved AUC by 2%.” That fails if the feature has unclear consent, sensitive inference risk, or weak deletion support; better is to quantify both model impact and compliance risk, then present safer alternatives with measured quality tradeoffs.
Pitfall: Giving a personality-based conflict story.
Avoid framing conflict as “the PM was unreasonable” or “the privacy reviewer blocked us.” Strong candidates translate disagreement into constraints and options: what each stakeholder optimized for, what evidence changed the decision, and how you preserved trust while making progress.
Connections
Interviewers may pivot from this topic into ML system design, especially feature stores, model serving, drift monitoring, and rollback strategy. They may also ask about model evaluation, responsible AI, privacy-preserving ML, launch readiness, incident response, or senior-level influence across teams.
Further reading
-
NIST AI Risk Management Framework — useful vocabulary for mapping, measuring, managing, and governing AI risks.
-
Google: The ML Test Score — practical checklist mindset for production ML readiness and technical debt.
-
Model Cards for Model Reporting — foundational paper on documenting intended use, evaluation, limitations, and ethical considerations for models.
Practice questions