Design a Hybrid Evaluation Platform
Company: Waymo
Role: Software Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Onsite
Design an evaluation platform for model outputs that supports both human evaluation and automated LLM-based evaluation.
The system should:
- ingest prompts, model outputs, references, and metadata
- support multiple evaluation rubrics and score types
- route some tasks to human reviewers and some to LLM judges
- collect scores, rationales, and reviewer metadata
- measure evaluator agreement and score quality
- aggregate results by model version, task type, and slice
- expose results through dashboards or APIs
Discuss the architecture, data model, workflow, calibration strategy, reliability checks, and how you would handle disagreement between human and automated evaluators.
Quick Answer: This question evaluates skills in designing scalable ML evaluation platforms, covering architecture, data modeling, human-in-the-loop workflows, LLM-based automated judging, calibration strategies, reliability checks, and measurement of evaluator agreement and score quality.