PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/ML System Design/Meta

Design a scalable MoE pretraining pipeline

Last updated: Mar 29, 2026

Quick Overview

This question evaluates expertise in large-scale ML system design and distributed training for Mixture-of-Experts (MoE) models, covering competencies in model architecture, parallelism and communication strategies, memory and optimizer sharding, data curation and tokenization, monitoring, checkpointing, and fault tolerance; it is categorized under ML System Design and targets practical application with systems-level architectural reasoning rather than purely theoretical concepts. It is commonly asked to evaluate the ability to balance throughput, stability, and cost when scaling pretraining to hundreds of GPUs, reason about routing and capacity trade-offs, and design production-grade pipelines that can extend to larger clusters.

  • hard
  • Meta
  • ML System Design
  • Machine Learning Engineer

Design a scalable MoE pretraining pipeline

Company: Meta

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

Design a large-scale Mixture-of-Experts (MoE) pretraining pipeline for a bilingual LLM trained on approximately 1T tokens across 256 A100-80GB GPUs. Specify the model architecture (number of experts, gating/routing strategy, top-k selection), parallelism plan (data/tensor/pipeline/expert), communication patterns (all-to-all), capacity factor and token dropping policy, load-balancing/auxiliary losses, memory and optimizer sharding (e.g., FSDP/ZeRO), checkpointing and fault tolerance, dataset curation and deduplication, tokenization, scheduling and hyperparameters, monitoring (tokens/sec, step time) and evaluation. Discuss key failure modes (router collapse, stragglers, expert overload) and concrete mitigations, and explain how you would scale the same system to 1,024 GPUs while controlling cost.

Quick Answer: This question evaluates expertise in large-scale ML system design and distributed training for Mixture-of-Experts (MoE) models, covering competencies in model architecture, parallelism and communication strategies, memory and optimizer sharding, data curation and tokenization, monitoring, checkpointing, and fault tolerance; it is categorized under ML System Design and targets practical application with systems-level architectural reasoning rather than purely theoretical concepts. It is commonly asked to evaluate the ability to balance throughput, stability, and cost when scaling pretraining to hundreds of GPUs, reason about routing and capacity trade-offs, and design production-grade pipelines that can extend to larger clusters.

Related Interview Questions

  • Design an Automated Ticket Investigation Agent - Meta (hard)
  • Prevent Private Code Leakage in Coding Agents - Meta (medium)
  • Design Place Recommendation System - Meta (medium)
  • Design a Code Review Agent - Meta (medium)
  • Design a Short-Video Recommendation System - Meta (medium)
Meta logo
Meta
Sep 6, 2025, 12:00 AM
Machine Learning Engineer
Onsite
ML System Design
9
0

Design a Large-Scale MoE Pretraining Pipeline (Bilingual LLM, 1T Tokens, 256×A100-80GB)

Context

You are designing a pretraining pipeline for a decoder-only, bilingual large language model (LLM) using a Mixture-of-Experts (MoE) architecture. The target is to process approximately 1 trillion tokens on a cluster of 256 NVIDIA A100-80GB GPUs (assume 32 nodes × 8 GPUs/node with NVLink/NVSwitch within nodes and 200–400 Gbps interconnect across nodes).

Make minimal, explicit assumptions as needed. Aim for a production-grade plan that balances quality, throughput, stability, and cost.

Requirements

Specify the following:

  1. Model architecture
    • Number of experts per MoE layer, expert MLP shape, where MoE layers are placed, top-k selection, gating/routing strategy.
  2. Parallelism plan
    • Data/tensor/pipeline/expert parallelism, group sizes, and how experts are placed on GPUs.
  3. Communication patterns
    • All-to-all for token routing/combine, all-reduce/reduce-scatter for grads, P2P for pipeline; overlap strategies.
  4. Capacity factor and token dropping policy
    • How capacity per expert is computed; dropless vs. dropping; overflow handling.
  5. Load balancing and auxiliary losses
    • Auxiliary loss form, z-loss/noise, and any stabilization tricks.
  6. Memory and optimizer sharding
    • FSDP/ZeRO strategy, precision, activation checkpointing, offload options, and expected memory budget per GPU.
  7. Checkpointing and fault tolerance
    • What gets saved, how often, shard format, elastic/restart behavior.
  8. Dataset curation and deduplication
    • Sources, bilingual balance, filtering, near-dup detection, temperature sampling.
  9. Tokenization
    • Vocabulary type/size, normalization, special tokens, bilingual specifics.
  10. Scheduling and hyperparameters
    • LR schedule, batch sizing (micro-batch, grad accumulation), optimizer, regularization.
  11. Monitoring and evaluation
    • Online throughput and comm metrics; held-out and downstream evals for both languages.
  12. Failure modes and mitigations
    • Router collapse, stragglers, expert overload, OOM, and concrete mitigations.
  13. Scaling to 1,024 GPUs
    • How to extend the same design while controlling cost and maintaining stability.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Meta•More Machine Learning Engineer•Meta Machine Learning Engineer•Meta ML System Design•Machine Learning Engineer ML System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.