Design and train a PPO pipeline

Q: Design and train a PPO pipeline

This is a ML System Design interview question from XPeng for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

End-to-End PPO Training: Describe Your Pipeline

You are asked to explain, in concrete and reproducible terms, how you trained a policy with Proximal Policy Optimization (PPO) for a real-world machine learning engineering project.

Please cover the following, using specific design choices, numbers, and rationales:

Setup and Interfaces

Environment setup: simulation vs. real, physics/sensor fidelity, time limits.
Observation space: what features, dimensions, preprocessing/normalization.
Action space: discrete vs. continuous, ranges, squashing/scaling.

Learning Signal and Data Collection

Reward shaping: components, weights, potential-based shaping if applicable.
Rollout collection: parallelization, rollout length (horizon), total batch size per update.
Advantage estimation: GAE or alternatives; formulas and normalization.

Optimization and Stability

Key PPO hyperparameters: clip range, policy/value learning rates and schedules, epochs, minibatch size, entropy and value loss coefficients, gradient clipping, target KL, value clipping.
Normalization: observation normalization, reward/return normalization or PopArt.
Parallelization strategy: vectorized vs. distributed, CPU/GPU usage.
Checkpointing and early stopping: what, when, and how you save; early-stopping criteria.

Evaluation and Engineering Considerations

Evaluation protocol: offline metrics, validation environments/seeds, OOD tests, ablations.
Handling stability and sample efficiency: common pitfalls and fixes.
Sim-to-real transfer (if relevant): domain randomization, system ID, safety constraints, fine-tuning.

Be concise but specific; include concrete values (e.g., batch sizes, horizons, clip ranges) and justify trade‑offs.