Design a scalable MoE pretraining pipeline

Q: Design a scalable MoE pretraining pipeline

This question evaluates expertise in large-scale ML system design and distributed training for Mixture-of-Experts (MoE) models, covering competencies in model architecture, parallelism and communication strategies, memory and optimizer sharding, data curation and tokenization, monitoring, checkpointing, and fault tolerance; it is categorized under ML System Design and targets practical application with systems-level architectural reasoning rather than purely theoretical concepts. It is commonly asked to evaluate the ability to balance throughput, stability, and cost when scaling pretraining to hundreds of GPUs, reason about routing and capacity trade-offs, and design production-grade pipelines that can extend to larger clusters.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Design a Large-Scale MoE Pretraining Pipeline (Bilingual LLM, 1T Tokens, 256×A100-80GB)

Context

You are designing a pretraining pipeline for a decoder-only, bilingual large language model (LLM) using a Mixture-of-Experts (MoE) architecture. The target is to process approximately 1 trillion tokens on a cluster of 256 NVIDIA A100-80GB GPUs (assume 32 nodes × 8 GPUs/node with NVLink/NVSwitch within nodes and 200–400 Gbps interconnect across nodes).

Make minimal, explicit assumptions as needed. Aim for a production-grade plan that balances quality, throughput, stability, and cost.

Requirements

Specify the following:

Model architecture
- Number of experts per MoE layer, expert MLP shape, where MoE layers are placed, top-k selection, gating/routing strategy.
Parallelism plan
- Data/tensor/pipeline/expert parallelism, group sizes, and how experts are placed on GPUs.
Communication patterns
- All-to-all for token routing/combine, all-reduce/reduce-scatter for grads, P2P for pipeline; overlap strategies.
Capacity factor and token dropping policy
- How capacity per expert is computed; dropless vs. dropping; overflow handling.
Load balancing and auxiliary losses
- Auxiliary loss form, z-loss/noise, and any stabilization tricks.
Memory and optimizer sharding
- FSDP/ZeRO strategy, precision, activation checkpointing, offload options, and expected memory budget per GPU.
Checkpointing and fault tolerance
- What gets saved, how often, shard format, elastic/restart behavior.
Dataset curation and deduplication
- Sources, bilingual balance, filtering, near-dup detection, temperature sampling.
Tokenization
- Vocabulary type/size, normalization, special tokens, bilingual specifics.
Scheduling and hyperparameters
- LR schedule, batch sizing (micro-batch, grad accumulation), optimizer, regularization.
Monitoring and evaluation
- Online throughput and comm metrics; held-out and downstream evals for both languages.
Failure modes and mitigations
- Router collapse, stragglers, expert overload, OOM, and concrete mitigations.
Scaling to 1,024 GPUs
- How to extend the same design while controlling cost and maintaining stability.

Design a scalable MoE pretraining pipeline

Design a Large-Scale MoE Pretraining Pipeline (Bilingual LLM, 1T Tokens, 256×A100-80GB)

Context

Requirements

Solution

Comments (0)

Design a scalable MoE pretraining pipeline

Overview

Design a Large-Scale MoE Pretraining Pipeline (Bilingual LLM, 1T Tokens, 256×A100-80GB)

Context

Requirements

Solution

Comments (0)