Design and train a PPO pipeline
Company: XPeng
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Technical Screen
Describe how you trained PPO in your project end-to-end: environment setup (simulation or real), observation and action spaces, reward shaping, rollout collection, horizon, and advantage estimation (e.g., GAE). Specify key hyperparameters (clip range, learning rate/schedule, batch size, epochs, entropy coefficient), normalization (state/reward), parallelization strategy, checkpointing, and early stopping. Explain your evaluation protocol (offline metrics, validation environments, ablations) and how you handled stability, sample efficiency, and possible sim-to-real transfer.
Quick Answer: This question evaluates proficiency in designing and training Proximal Policy Optimization (PPO) pipelines, covering environment interfacing, observation/action design, reward shaping, rollout and advantage estimation, hyperparameterization, normalization, parallelization, evaluation, and sim‑to‑real considerations.