PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/ML System Design/Waymo

Design a Hybrid Evaluation Platform

Last updated: Apr 19, 2026

Quick Overview

This question evaluates skills in designing scalable ML evaluation platforms, covering architecture, data modeling, human-in-the-loop workflows, LLM-based automated judging, calibration strategies, reliability checks, and measurement of evaluator agreement and score quality.

  • medium
  • Waymo
  • ML System Design
  • Software Engineer

Design a Hybrid Evaluation Platform

Company: Waymo

Role: Software Engineer

Category: ML System Design

Difficulty: medium

Interview Round: Onsite

Design an evaluation platform for model outputs that supports both human evaluation and automated LLM-based evaluation. The system should: - ingest prompts, model outputs, references, and metadata - support multiple evaluation rubrics and score types - route some tasks to human reviewers and some to LLM judges - collect scores, rationales, and reviewer metadata - measure evaluator agreement and score quality - aggregate results by model version, task type, and slice - expose results through dashboards or APIs Discuss the architecture, data model, workflow, calibration strategy, reliability checks, and how you would handle disagreement between human and automated evaluators.

Quick Answer: This question evaluates skills in designing scalable ML evaluation platforms, covering architecture, data modeling, human-in-the-loop workflows, LLM-based automated judging, calibration strategies, reliability checks, and measurement of evaluator agreement and score quality.

Related Interview Questions

  • Design a Drop-off Spot Selector - Waymo (hard)
  • Design Large-Scale Inference Serving - Waymo (medium)
Waymo logo
Waymo
Feb 6, 2026, 12:00 AM
Software Engineer
Onsite
ML System Design
12
0
Loading...

Design an evaluation platform for model outputs that supports both human evaluation and automated LLM-based evaluation.

The system should:

  • ingest prompts, model outputs, references, and metadata
  • support multiple evaluation rubrics and score types
  • route some tasks to human reviewers and some to LLM judges
  • collect scores, rationales, and reviewer metadata
  • measure evaluator agreement and score quality
  • aggregate results by model version, task type, and slice
  • expose results through dashboards or APIs

Discuss the architecture, data model, workflow, calibration strategy, reliability checks, and how you would handle disagreement between human and automated evaluators.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Waymo•More Software Engineer•Waymo Software Engineer•Waymo ML System Design•Software Engineer ML System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.