PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/ML System Design/Amazon

Debug online worse than offline model performance

Last updated: Mar 29, 2026

Quick Overview

This question evaluates the ability to diagnose discrepancies between offline and online model performance by reasoning about data distributions, feature computation, serving infrastructure, evaluation methodology, and feedback loops, testing skills in debugging model deployment and operational analytics.

  • medium
  • Amazon
  • ML System Design
  • Machine Learning Engineer

Debug online worse than offline model performance

Company: Amazon

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: medium

Interview Round: Onsite

## Production ML: online performance worse than offline You launch an ML model. Offline evaluation (validation/test) looked good, but after deployment the **online metrics are significantly worse**. ### Question What step-by-step troubleshooting order would you follow to identify and fix the issue? Include what you would check in data, features, serving, evaluation, and feedback loops.

Quick Answer: This question evaluates the ability to diagnose discrepancies between offline and online model performance by reasoning about data distributions, feature computation, serving infrastructure, evaluation methodology, and feedback loops, testing skills in debugging model deployment and operational analytics.

Related Interview Questions

  • Design systems for global request detection and labeling - Amazon (hard)
  • Design a computer-use agent end-to-end - Amazon (medium)
  • Approach an ambiguous business problem - Amazon (medium)
  • Explain parallelism and collectives in training - Amazon (medium)
  • Design an LLM quality validation system - Amazon (medium)
Amazon logo
Amazon
Jan 6, 2026, 12:00 AM
Machine Learning Engineer
Onsite
ML System Design
3
0
Loading...

Production ML: online performance worse than offline

You launch an ML model. Offline evaluation (validation/test) looked good, but after deployment the online metrics are significantly worse.

Question

What step-by-step troubleshooting order would you follow to identify and fix the issue? Include what you would check in data, features, serving, evaluation, and feedback loops.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Amazon•More Machine Learning Engineer•Amazon Machine Learning Engineer•Amazon ML System Design•Machine Learning Engineer ML System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.