PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Amazon

Decide standardization, sparse numerics, correlated features

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's competency in feature engineering and preprocessing for mixed-type tabular data, including handling sparse counts and heavy-tailed monetary features, missingness and zero-inflation, correlated continuous measurements, model-specific scaling needs, and the design of leak-safe pipelines and validation strategies.

  • Medium
  • Amazon
  • Machine Learning
  • Data Scientist

Decide standardization, sparse numerics, correlated features

Company: Amazon

Role: Data Scientist

Category: Machine Learning

Difficulty: Medium

Interview Round: Technical Screen

You are given a tabular dataset for supervised learning with features: F1 (counts, mostly small integers with many zeros), F2 (monetary amounts in dollars, heavy-tailed), F3 (binary flag), F4 and F5 (highly correlated continuous measurements), and target y. Tasks: 1) Decide exactly which features need standardization or normalization and why; specify the scaler and whether to fit on train only to avoid leakage. 2) Propose a principled approach for F1 when it has many zeros and missing values: imputation options, zero-inflated modeling, or transformations; justify how you will validate the choice. 3) With F4 and F5 strongly correlated (|r| > 0.9), describe three alternative strategies to select or transform features (e.g., VIF thresholding, L1-penalized model, PCA) and how to choose among them with cross-validation while keeping interpretability. 4) For three model families (linear/logistic with regularization, tree-based ensembles, and k-NN), specify exactly how your preprocessing differs and why scale and correlation matter differently. 5) Provide a leak-safe sklearn-style pipeline and cross-validation plan that evaluates these choices, including metrics, stratification, and how you would compare pipelines statistically.

Quick Answer: This question evaluates a candidate's competency in feature engineering and preprocessing for mixed-type tabular data, including handling sparse counts and heavy-tailed monetary features, missingness and zero-inflation, correlated continuous measurements, model-specific scaling needs, and the design of leak-safe pipelines and validation strategies.

Related Interview Questions

  • Predicting the Next Elevator Call Location - Amazon (medium)
  • Explain Transformer and MoE Fundamentals - Amazon (medium)
  • Explain Core ML Interview Concepts - Amazon (hard)
  • Evaluate NLP Classification Models - Amazon (easy)
  • Explain overfitting, regularization, and LLM techniques - Amazon (medium)
Amazon logo
Amazon
Oct 13, 2025, 9:49 PM
Data Scientist
Technical Screen
Machine Learning
2
0

You are given a tabular dataset for supervised learning with features: F1 (counts, mostly small integers with many zeros), F2 (monetary amounts in dollars, heavy-tailed), F3 (binary flag), F4 and F5 (highly correlated continuous measurements), and target y. Tasks: 1) Decide exactly which features need standardization or normalization and why; specify the scaler and whether to fit on train only to avoid leakage. 2) Propose a principled approach for F1 when it has many zeros and missing values: imputation options, zero-inflated modeling, or transformations; justify how you will validate the choice. 3) With F4 and F5 strongly correlated (|r| > 0.9), describe three alternative strategies to select or transform features (e.g., VIF thresholding, L1-penalized model, PCA) and how to choose among them with cross-validation while keeping interpretability. 4) For three model families (linear/logistic with regularization, tree-based ensembles, and k-NN), specify exactly how your preprocessing differs and why scale and correlation matter differently. 5) Provide a leak-safe sklearn-style pipeline and cross-validation plan that evaluates these choices, including metrics, stratification, and how you would compare pipelines statistically.

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Amazon•More Data Scientist•Amazon Data Scientist•Amazon Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.