Generative AI Training, Attention, And Post-Training

What's being tested

Interviewers are probing whether you can design large-scale generative model training systems that are correct, scalable, debuggable, and safe to deploy. For a Machine Learning Engineer, the emphasis is not just knowing model families, but explaining how data, distributed training, attention kernels, evaluation, checkpointing, and post-training fit into a production-grade pipeline. Meta cares because generative AI systems stress every part of ML infrastructure: GPU utilization, memory, communication, model quality, safety, latency, and continuous improvement loops. Strong answers show you can reason from first principles while naming concrete tools such as `PyTorch`, `FSDP`, `Megatron-LM`, `DeepSpeed`, `NCCL`, and online evaluation infrastructure.

Core knowledge

Scaled dot-product attention maps queries, keys, and values via $\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)V.$ The $\sqrt{d_k}$ scaling prevents logits from growing with dimension; the mask $M$ handles causal or padding constraints.
Attention implementation details matter because numerical instability often appears before modeling mistakes. Use a stable softmax by subtracting row-wise max logits, apply masks before softmax, avoid attending to padded tokens, and verify tensor shapes: `Q: [B,H,T,D]`, `K: [B,H,S,D]`, `V: [B,H,S,D]`.
Memory-efficient attention is critical for long contexts. Vanilla attention stores a $T \times T$ matrix, giving $O(T^2)$ memory and compute. `FlashAttention` reduces memory traffic by tiling attention computation and recomputing softmax statistics, often improving GPU utilization without changing outputs.
Distributed pretraining combines several parallelism dimensions. Data parallelism replicates models across workers, tensor parallelism shards matrix multiplications, pipeline parallelism splits layers, and sequence parallelism shards activations across sequence length. Large models usually require a hybrid plan rather than one strategy.
Fully Sharded Data Parallel systems such as `FSDP` and `ZeRO` shard parameters, gradients, and optimizer states. This matters because `AdamW` stores multiple tensors per parameter; memory can be roughly 12–16 bytes per parameter in mixed precision once weights, gradients, and optimizer states are included.
Mixture-of-Experts models activate only a subset of experts per token, commonly top-1 or top-2 routing. This increases parameter count without proportional FLOPs but introduces routing imbalance, all-to-all communication, expert capacity limits, and token dropping if capacity is exceeded.
MoE load balancing requires explicit losses and monitoring. A common auxiliary objective penalizes skew between router probabilities and actual expert assignment. Track expert utilization entropy, dropped-token rate, routing collapse, per-expert gradient norms, and all-to-all communication time.
Multimodal generation usually requires modality-specific encoders or tokenizers plus a shared generative backbone. Image generation may use diffusion models, autoregressive image tokens, or latent diffusion; multimodal chat often uses a vision encoder like `ViT` projected into an LLM embedding space.
Training objectives differ by modality and phase. LLM pretraining uses next-token prediction with cross-entropy; diffusion uses denoising score matching or noise prediction; contrastive image-text pretraining uses objectives like `CLIP`; post-training may use supervised fine-tuning, preference optimization, or reinforcement learning.
Post-training pipelines separate model serving from learner updates. In RLHF-style systems, actors generate completions, reward models or preference models score them, a learner updates the policy, and evaluation gates decide promotion. Asynchrony improves throughput but increases policy lag and reproducibility complexity.
Evaluation gates should include offline and online checks. For generative systems, use held-out perplexity or loss, task benchmarks, human preference win rate, safety violation rate, toxicity classifiers, hallucination probes, latency, memory footprint, and regression tests on known failure prompts.
Production readiness depends on observability and rollback. Track training loss spikes, gradient norm, NaN rate, GPU utilization, checkpoint restore success, data mixture proportions, router health for MoE, reward-model drift, and serving metrics such as `p50`, `p95`, `p99` latency and tokens/sec.

Worked example

For Design a scalable MoE pretraining pipeline, a strong candidate first clarifies the target: model size, token budget, context length, available GPU cluster, expected training duration, and whether the priority is lowest cost, fastest time-to-train, or best quality. Then they state assumptions, for example: “I’ll design for a trillion-token-scale decoder-only transformer with top-2 MoE layers, trained on a multi-node `GPU` cluster using mixed precision.” The answer can be organized around four pillars: data and batching, model architecture and routing, distributed training strategy, and reliability/evaluation.

In the architecture pillar, explain where MoE layers sit, how the router chooses experts, and how capacity factor controls the tradeoff between utilization and dropped tokens. In distributed training, describe a hybrid of data parallelism, tensor parallelism, expert parallelism, and `FSDP`/`ZeRO` optimizer sharding, with `NCCL` all-reduce and all-to-all communication as the main bottlenecks. In reliability, include checkpointing of model, optimizer, router state, RNG state, and data iterator state so training can resume deterministically enough after preemption.

A specific tradeoff to flag is top-1 versus top-2 routing: top-1 is cheaper and simpler, while top-2 often improves quality and gradient flow but doubles expert communication and can worsen all-to-all congestion. A good close is: “If I had more time, I’d go deeper on kernel-level optimization, exact cluster topology-aware placement, and the evaluation suite used to decide whether the MoE model beats a dense baseline at equal training FLOPs.”

A second angle

For Architect an asynchronous RL post-training system, the same training-systems concepts apply, but the bottleneck shifts from pure throughput to coordination between generation, scoring, learning, and evaluation. Instead of static pretraining data, the system continuously produces new trajectories from a changing policy, so versioning is central: every sample should carry policy version, reward model version, prompt source, decoding parameters, and safety filters applied. Asynchrony improves hardware utilization because actors can keep generating while learners update, but it creates policy lag, where the learner trains on samples from stale policies. A strong answer discusses bounded staleness, replay buffers, KL penalties to a reference model, canary evaluation, and rollback if reward hacking or safety regressions appear. The framing is less about expert routing and more about controlling feedback loops without letting the system optimize the wrong reward.

Common pitfalls

Pitfall: Treating model architecture as the whole system.

A tempting weak answer lists “use a transformer, add attention, train on GPUs” and stops there. A better answer covers the operational pieces that make the model trainable: sharding, checkpointing, data mixture validation, memory pressure, communication bottlenecks, evaluation gates, and rollback.

Pitfall: Ignoring numerical and shape correctness in attention.

For attention implementation, many candidates know the formula but miss mask semantics or stable softmax. Interviewers often look for details like applying the causal mask before softmax, using $-\infty$ or a large negative value correctly under mixed precision, and ensuring padding tokens receive zero probability.

Pitfall: Overclaiming evaluation quality.

Generative systems cannot be validated with one metric. Saying “we’ll check loss” is insufficient because lower loss can coexist with worse instruction following, unsafe outputs, or reward hacking. A stronger answer separates pretraining metrics, task benchmarks, human preference evaluation, safety classifiers, and production latency/cost metrics.

Connections

Interviewers may pivot from here into model serving, especially batching, KV-cache management, quantization, and latency/cost tradeoffs. They may also ask about feature/data quality for ML pipelines, online/offline evaluation parity, distributed systems debugging, or model safety and red-teaming for generative outputs.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts