Implement Stratified Sampling by Country
Company: Perplexity
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: medium
Interview Round: Technical Screen
Quick Answer: This question evaluates understanding of stratified sampling, proportional allocation across groups, reproducible random sampling, data grouping and edge-case handling in an in-memory dataset.
Constraints
- 0 <= sample_size <= len(records) for a valid returned result; otherwise raise ValueError
- Each record is a dictionary and may contain additional fields beyond country, device, and timestamp
- Sampling must be without replacement
- Missing country values and country=None must be treated as the same bucket
Examples
Input: ([{'country': 'US', 'device': 'mobile', 'timestamp': '2024-01-01'}, {'country': 'US', 'device': 'mobile', 'timestamp': '2024-01-01'}, {'country': 'CA', 'device': 'desktop', 'timestamp': '2024-01-02'}, {'country': 'IN', 'device': 'tablet', 'timestamp': '2024-01-03'}], 2, 7)
Expected Output: [{'country': 'US', 'device': 'mobile', 'timestamp': '2024-01-01'}, {'country': 'CA', 'device': 'desktop', 'timestamp': '2024-01-02'}]
Explanation: Counts are US=2, CA=1, IN=1. Ideal quotas for sample_size=2 are US=1.0, CA=0.5, IN=0.5. After floors, one remaining slot goes to CA because CA and IN tie on remainder and 'CA' < 'IN'.
Input: ([{'country': 'US', 'device': 'mobile', 'timestamp': '2024-02-01'}, {'country': 'US', 'device': 'mobile', 'timestamp': '2024-02-01'}, {'device': 'desktop', 'timestamp': '2024-02-02'}, {'country': None, 'device': 'tablet', 'timestamp': '2024-02-03'}, {'country': 'FR', 'device': 'watch', 'timestamp': '2024-02-04'}], 4, 3)
Expected Output: [{'country': 'US', 'device': 'mobile', 'timestamp': '2024-02-01'}, {'device': 'desktop', 'timestamp': '2024-02-02'}, {'country': None, 'device': 'tablet', 'timestamp': '2024-02-03'}, {'country': 'FR', 'device': 'watch', 'timestamp': '2024-02-04'}]
Explanation: Missing country and None are grouped together. Bucket sizes are US=2, None=2, FR=1. Ideal quotas are 1.6, 1.6, and 0.8. Floors give 1, 1, 0, and the two leftover slots go to FR and None.
Input: ([{'country': 'JP', 'device': 'mobile', 'timestamp': '2024-03-01'}], 0, None)
Expected Output: []
Explanation: A sample size of 0 must always return an empty list.
Input: ([{'country': 'BR', 'device': 'mobile', 'timestamp': '2024-04-01'}, {'country': 'BR', 'device': 'desktop', 'timestamp': '2024-04-02'}, {'country': 'ZA', 'device': 'tablet', 'timestamp': '2024-04-03'}], 3, 11)
Expected Output: [{'country': 'BR', 'device': 'mobile', 'timestamp': '2024-04-01'}, {'country': 'BR', 'device': 'desktop', 'timestamp': '2024-04-02'}, {'country': 'ZA', 'device': 'tablet', 'timestamp': '2024-04-03'}]
Explanation: When sample_size equals the number of input records, all records must be returned.
Input: ([{'country': 'DE', 'device': 'mobile', 'timestamp': '2024-05-01'}, {'country': 'DE', 'device': 'mobile', 'timestamp': '2024-05-01'}, {'country': 'DE', 'device': 'mobile', 'timestamp': '2024-05-01'}, {'country': 'DE', 'device': 'mobile', 'timestamp': '2024-05-01'}], 2, 5)
Expected Output: [{'country': 'DE', 'device': 'mobile', 'timestamp': '2024-05-01'}, {'country': 'DE', 'device': 'mobile', 'timestamp': '2024-05-01'}]
Explanation: There is only one country bucket, so the function simply returns two records from that bucket without replacement.
Hints
- First group records by country and count how many records belong to each bucket, including a separate bucket for missing/None countries.
- Use floor(ideal_quota) for every country, then distribute the leftover slots by largest fractional remainder so the final sample size is exact.