This question evaluates proficiency in designing large-scale MapReduce-style batch processing systems, covering data schemas and partitioning, parallelization and sharding strategies, network optimization techniques, fault tolerance semantics, handling skew/stragglers, and performance/complexity estimation for ML feature aggregation.

You are designing a batch pipeline, using a MapReduce-style architecture, to aggregate raw event logs into daily user-level features for downstream machine learning. The system must scale to tens of terabytes per day, run reliably, and minimize resource usage.
Login required