PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Data Manipulation (SQL/Python)/Amazon

Implement robust word counts and min/max

Last updated: Mar 29, 2026

Quick Overview

This question evaluates proficiency in large-scale text processing, Unicode-aware tokenization and normalization, memory-efficient streaming and frequency counting (including implementing counting with plain dicts), and designing robust safe_min/safe_max semantics to handle NaNs, mixed comparable types, and stability concerns.

  • Medium
  • Amazon
  • Data Manipulation (SQL/Python)
  • Data Scientist

Implement robust word counts and min/max

Company: Amazon

Role: Data Scientist

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Onsite

You receive a 50GB UTF-8 text corpus on disk. Implement a Python solution that:\n- Streams the file without loading it fully into memory.\n- Counts case-insensitive word frequencies, treating "e-mail" and "email" as the same token, stripping punctuation, and normalizing Unicode (e.g., accented forms). Specify your tokenization rules for hyphens, apostrophes, and emojis.\n- Ignores a provided stopword list of ~500 words.\n- Emits the top 10 words with counts and their percentage of total tokens.\n- Reports time and memory complexity in Big-O and any practical optimizations (e.g., mmap, chunking, generators).\nThen, implement the same without collections.Counter using only dicts. Finally, implement safe_min(iterable, key=None, default=sentinel) and safe_max(...) that:\n- Work with NaNs present (treat NaN as greater than all numbers for max, less than all for min) and mixed comparable types via a key function.\n- Return default if iterable is empty, else raise ValueError when default is not supplied.\n- Are stable with equal keys. Explain edge cases you tested.

Quick Answer: This question evaluates proficiency in large-scale text processing, Unicode-aware tokenization and normalization, memory-efficient streaming and frequency counting (including implementing counting with plain dicts), and designing robust safe_min/safe_max semantics to handle NaNs, mixed comparable types, and stability concerns.

Related Interview Questions

  • Find recommended friend pairs by shared songs - Amazon (medium)
  • Find recommended friend pairs by shared listening - Amazon (easy)
  • Write SQL window functions for D7 retention - Amazon (medium)
  • Find daily first-order merchants with SQL - Amazon (Medium)
  • Design student–course data models and SQL - Amazon (Medium)
Amazon logo
Amazon
Oct 13, 2025, 9:49 PM
Data Scientist
Onsite
Data Manipulation (SQL/Python)
2
0

You receive a 50GB UTF-8 text corpus on disk. Implement a Python solution that:\n- Streams the file without loading it fully into memory.\n- Counts case-insensitive word frequencies, treating "e-mail" and "email" as the same token, stripping punctuation, and normalizing Unicode (e.g., accented forms). Specify your tokenization rules for hyphens, apostrophes, and emojis.\n- Ignores a provided stopword list of ~500 words.\n- Emits the top 10 words with counts and their percentage of total tokens.\n- Reports time and memory complexity in Big-O and any practical optimizations (e.g., mmap, chunking, generators).\nThen, implement the same without collections.Counter using only dicts. Finally, implement safe_min(iterable, key=None, default=sentinel) and safe_max(...) that:\n- Work with NaNs present (treat NaN as greater than all numbers for max, less than all for min) and mixed comparable types via a key function.\n- Return default if iterable is empty, else raise ValueError when default is not supplied.\n- Are stable with equal keys. Explain edge cases you tested.

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Data Manipulation (SQL/Python)•More Amazon•More Data Scientist•Amazon Data Scientist•Amazon Data Manipulation (SQL/Python)•Data Scientist Data Manipulation (SQL/Python)
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.