This question evaluates expertise in large-scale ML system design and distributed training for Mixture-of-Experts (MoE) models, covering competencies in model architecture, parallelism and communication strategies, memory and optimizer sharding, data curation and tokenization, monitoring, checkpointing, and fault tolerance; it is categorized under ML System Design and targets practical application with systems-level architectural reasoning rather than purely theoretical concepts. It is commonly asked to evaluate the ability to balance throughput, stability, and cost when scaling pretraining to hundreds of GPUs, reason about routing and capacity trade-offs, and design production-grade pipelines that can extend to larger clusters.

You are designing a pretraining pipeline for a decoder-only, bilingual large language model (LLM) using a Mixture-of-Experts (MoE) architecture. The target is to process approximately 1 trillion tokens on a cluster of 256 NVIDIA A100-80GB GPUs (assume 32 nodes × 8 GPUs/node with NVLink/NVSwitch within nodes and 200–400 Gbps interconnect across nodes).
Make minimal, explicit assumptions as needed. Aim for a production-grade plan that balances quality, throughput, stability, and cost.
Specify the following:
Login required