PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches

Quick Overview

This question evaluates understanding of stratified sampling, proportional allocation across groups, reproducible random sampling, data grouping and edge-case handling in an in-memory dataset.

  • medium
  • Perplexity
  • Coding & Algorithms
  • Data Scientist

Implement Stratified Sampling by Country

Company: Perplexity

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: medium

Interview Round: Technical Screen

You are given a large in-memory Python list named `records`, where each element is a dictionary with at least the following keys: - `country: str` - `device: str` - `timestamp: str | datetime` Each dictionary may also contain additional fields. Write a Python function: `def stratified_sample_by_country(records, sample_size, random_state=None):` The function should return exactly `sample_size` records, sampled **without replacement**, such that the distribution of `country` in the sample matches the original dataset as closely as possible. For example, if 10% of the input records have `country == "US"`, then approximately 10% of the sampled records should come from the US. Your solution should address: 1. How to group records by country. 2. How to compute the target number of samples per country. 3. How to handle non-integer target counts while still returning exactly `sample_size` rows. 4. How to make the sampling random but reproducible when `random_state` is provided. 5. Edge cases such as `sample_size = 0`, missing `country` values, and `sample_size > len(records)`. You should also briefly discuss the time and space complexity of your approach.

Quick Answer: This question evaluates understanding of stratified sampling, proportional allocation across groups, reproducible random sampling, data grouping and edge-case handling in an in-memory dataset.

You are given an in-memory Python list named records, where each element is a dictionary containing at least device and timestamp, and usually a country key. Implement a function solution(records, sample_size, random_state=None) that returns exactly sample_size records sampled without replacement so that the country distribution in the sample matches the original dataset as closely as possible. Treat missing country values and country=None as the same bucket: None. To make the result deterministic for evaluation, use this quota rule: 1. For each country bucket with count c, compute its ideal quota as sample_size * c / len(records). 2. Take the floor of each ideal quota. 3. If the floors do not add up to sample_size, give the remaining slots to the buckets with the largest fractional remainders. 4. If fractional remainders are tied, give the extra slot to the bucket whose str(country) is lexicographically smaller. After computing the quota for each country, randomly sample that many records from each bucket without replacement. If random_state is provided, the sampling must be reproducible. Return the sampled records in the same relative order they appeared in the original input list. If sample_size is 0, return an empty list. If sample_size is negative or greater than len(records), raise a ValueError.

Constraints

  • 0 <= sample_size <= len(records) for a valid returned result; otherwise raise ValueError
  • Each record is a dictionary and may contain additional fields beyond country, device, and timestamp
  • Sampling must be without replacement
  • Missing country values and country=None must be treated as the same bucket

Examples

Input: ([{'country': 'US', 'device': 'mobile', 'timestamp': '2024-01-01'}, {'country': 'US', 'device': 'mobile', 'timestamp': '2024-01-01'}, {'country': 'CA', 'device': 'desktop', 'timestamp': '2024-01-02'}, {'country': 'IN', 'device': 'tablet', 'timestamp': '2024-01-03'}], 2, 7)

Expected Output: [{'country': 'US', 'device': 'mobile', 'timestamp': '2024-01-01'}, {'country': 'CA', 'device': 'desktop', 'timestamp': '2024-01-02'}]

Explanation: Counts are US=2, CA=1, IN=1. Ideal quotas for sample_size=2 are US=1.0, CA=0.5, IN=0.5. After floors, one remaining slot goes to CA because CA and IN tie on remainder and 'CA' < 'IN'.

Input: ([{'country': 'US', 'device': 'mobile', 'timestamp': '2024-02-01'}, {'country': 'US', 'device': 'mobile', 'timestamp': '2024-02-01'}, {'device': 'desktop', 'timestamp': '2024-02-02'}, {'country': None, 'device': 'tablet', 'timestamp': '2024-02-03'}, {'country': 'FR', 'device': 'watch', 'timestamp': '2024-02-04'}], 4, 3)

Expected Output: [{'country': 'US', 'device': 'mobile', 'timestamp': '2024-02-01'}, {'device': 'desktop', 'timestamp': '2024-02-02'}, {'country': None, 'device': 'tablet', 'timestamp': '2024-02-03'}, {'country': 'FR', 'device': 'watch', 'timestamp': '2024-02-04'}]

Explanation: Missing country and None are grouped together. Bucket sizes are US=2, None=2, FR=1. Ideal quotas are 1.6, 1.6, and 0.8. Floors give 1, 1, 0, and the two leftover slots go to FR and None.

Input: ([{'country': 'JP', 'device': 'mobile', 'timestamp': '2024-03-01'}], 0, None)

Expected Output: []

Explanation: A sample size of 0 must always return an empty list.

Input: ([{'country': 'BR', 'device': 'mobile', 'timestamp': '2024-04-01'}, {'country': 'BR', 'device': 'desktop', 'timestamp': '2024-04-02'}, {'country': 'ZA', 'device': 'tablet', 'timestamp': '2024-04-03'}], 3, 11)

Expected Output: [{'country': 'BR', 'device': 'mobile', 'timestamp': '2024-04-01'}, {'country': 'BR', 'device': 'desktop', 'timestamp': '2024-04-02'}, {'country': 'ZA', 'device': 'tablet', 'timestamp': '2024-04-03'}]

Explanation: When sample_size equals the number of input records, all records must be returned.

Input: ([{'country': 'DE', 'device': 'mobile', 'timestamp': '2024-05-01'}, {'country': 'DE', 'device': 'mobile', 'timestamp': '2024-05-01'}, {'country': 'DE', 'device': 'mobile', 'timestamp': '2024-05-01'}, {'country': 'DE', 'device': 'mobile', 'timestamp': '2024-05-01'}], 2, 5)

Expected Output: [{'country': 'DE', 'device': 'mobile', 'timestamp': '2024-05-01'}, {'country': 'DE', 'device': 'mobile', 'timestamp': '2024-05-01'}]

Explanation: There is only one country bucket, so the function simply returns two records from that bucket without replacement.

Hints

  1. First group records by country and count how many records belong to each bucket, including a separate bucket for missing/None countries.
  2. Use floor(ideal_quota) for every country, then distribute the leftover slots by largest fractional remainder so the final sample size is exact.
Last updated: May 7, 2026

Related Coding Questions

  • Implement Task Dependencies and Failure - Perplexity (hard)
  • Implement time-versioned KV store with restore - Perplexity (medium)

Loading coding console...

PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.