Implement Election Report and Banking Pipeline
Company: Point72
Role: Data Engineer
Category: Coding & Algorithms
Difficulty: hard
Interview Round: Take-home Project
Quick Answer: This multipart question evaluates data engineering competencies including SQL report design for complex aggregation and ranking, plus PySpark-based ETL and transaction-level analytics.
Part 1: Election Exit Poll State Rankings
Constraints
- 0 <= len(candidates) <= 10^4
- 0 <= len(results) <= 2 * 10^5
- Candidate ids in candidates are unique
- State names are non-empty strings
- If a vote references a candidate id not present in candidates, ignore it
Examples
Input: ([(1, 'Davide Kentish'), (2, 'Thorstein Bridge')], [(1, 'California'), (1, 'Texas'), (2, 'California'), (1, 'California'), (2, 'Texas'), (1, 'Alabama'), (2, 'New York'), (2, 'Texas'), (1, 'California'), (2, 'California'), (1, 'Texas'), (2, 'Alabama'), (1, 'Texas'), (2, 'New York'), (1, 'Alabama'), (2, 'Texas'), (1, 'California'), (2, 'California'), (1, 'California')])
Expected Output: [{'candidate_name': 'Davide Kentish', '1st_place': 'California (5)', '2nd_place': 'Texas (3)', '3rd_place': 'Alabama (2)'}, {'candidate_name': 'Thorstein Bridge', '1st_place': 'California (3), Texas (3)', '2nd_place': 'New York (2)', '3rd_place': 'Alabama (1)'}]
Explanation: Davide Kentish has state counts California=5, Texas=3, Alabama=2. Thorstein Bridge has California=3, Texas=3, New York=2, Alabama=1, so California and Texas share 1st place.
Input: ([(1, 'Alice'), (2, 'Bob'), (3, 'Cara')], [(1, 'Utah'), (1, 'Ohio'), (3, 'Florida'), (1, 'Utah'), (1, 'Nevada'), (1, 'Ohio')])
Expected Output: [{'candidate_name': 'Alice', '1st_place': 'Ohio (2), Utah (2)', '2nd_place': 'Nevada (1)', '3rd_place': None}, {'candidate_name': 'Bob', '1st_place': None, '2nd_place': None, '3rd_place': None}, {'candidate_name': 'Cara', '1st_place': 'Florida (1)', '2nd_place': None, '3rd_place': None}]
Explanation: Alice has a tie for the top total with Ohio and Utah at 2 votes each. Bob has no votes, so all places are None. Cara has only one state with votes.
Input: ([(10, 'Lina'), (11, 'Mark')], [(10, 'Texas'), (11, 'Oregon'), (10, 'California'), (10, 'New York'), (11, 'Washington'), (10, 'Florida'), (10, 'California'), (11, 'Idaho'), (10, 'Texas'), (10, 'New York'), (11, 'Washington'), (10, 'Arizona'), (10, 'California'), (11, 'Oregon'), (10, 'Florida'), (10, 'New York'), (11, 'Idaho'), (10, 'Texas'), (10, 'California'), (10, 'New York')])
Expected Output: [{'candidate_name': 'Lina', '1st_place': 'California (4), New York (4)', '2nd_place': 'Texas (3)', '3rd_place': 'Florida (2)'}, {'candidate_name': 'Mark', '1st_place': 'Idaho (2), Oregon (2), Washington (2)', '2nd_place': None, '3rd_place': None}]
Explanation: Lina has four distinct totals but only the top three are returned, so Arizona (1) is omitted. Mark has only one distinct total shared by three states.
Input: ([(7, 'Solo')], [])
Expected Output: [{'candidate_name': 'Solo', '1st_place': None, '2nd_place': None, '3rd_place': None}]
Explanation: With no votes at all, the candidate still appears in the result with all places set to None.
Hints
- First count votes for each (candidate, state) pair.
- After counting, group states by their vote totals so you can rank distinct totals instead of individual states.
Part 2: Banking Transaction Validation and Source Account Stats
Constraints
- 0 <= len(accounts) <= 10^5
- 0 <= len(transactions) <= 2 * 10^5
- Account numbers in accounts are unique strings
- Balances and transfer amounts are non-negative integers
- Validation uses the original balance only; balances do not decrease as transactions are processed
Examples
Input: ([('A1', 100), ('A2', 50), ('A3', 200)], [('A1', 'A2', 70), ('A1', 'A4', 10), ('A2', 'A3', 60), ('A3', 'A1', 150), ('A1', 'A3', 70)])
Expected Output: {'valid_transactions': [('A1', 'A2', 70), ('A3', 'A1', 150), ('A1', 'A3', 70)], 'distinct_source_accounts': 2, 'top_source_accounts': {'A1': 2, 'A3': 1}}
Explanation: The second transaction is invalid because the destination account does not exist. The third is invalid because 60 exceeds A2's balance of 50.
Input: ([('X', 10)], [('X', 'Y', 5), ('Z', 'X', 1), ('X', 'X', 11)])
Expected Output: {'valid_transactions': [], 'distinct_source_accounts': 0, 'top_source_accounts': {}}
Explanation: This edge case has no valid transactions at all.
Input: ([('A', 100), ('B', 100), ('C', 100), ('D', 100)], [('B', 'A', 10), ('A', 'B', 20), ('B', 'C', 30), ('A', 'D', 40), ('C', 'C', 5)])
Expected Output: {'valid_transactions': [('B', 'A', 10), ('A', 'B', 20), ('B', 'C', 30), ('A', 'D', 40), ('C', 'C', 5)], 'distinct_source_accounts': 3, 'top_source_accounts': {'A': 2, 'B': 2, 'C': 1}}
Explanation: All transactions are valid. A and B tie with 2 transactions each, so the tie is broken by account number ascending.
Input: ([('A01', 100), ('A02', 100), ('A03', 100), ('A04', 100), ('A05', 100), ('A06', 100), ('A07', 100), ('A08', 100), ('A09', 100), ('A10', 100), ('A11', 100), ('A12', 100)], [('A01', 'A12', 1), ('A02', 'A12', 1), ('A03', 'A12', 1), ('A04', 'A12', 1), ('A05', 'A12', 1), ('A06', 'A12', 1), ('A07', 'A12', 1), ('A08', 'A12', 1), ('A09', 'A12', 1), ('A10', 'A12', 1), ('A11', 'A12', 1)])
Expected Output: {'valid_transactions': [('A01', 'A12', 1), ('A02', 'A12', 1), ('A03', 'A12', 1), ('A04', 'A12', 1), ('A05', 'A12', 1), ('A06', 'A12', 1), ('A07', 'A12', 1), ('A08', 'A12', 1), ('A09', 'A12', 1), ('A10', 'A12', 1), ('A11', 'A12', 1)], 'distinct_source_accounts': 11, 'top_source_accounts': {'A01': 1, 'A02': 1, 'A03': 1, 'A04': 1, 'A05': 1, 'A06': 1, 'A07': 1, 'A08': 1, 'A09': 1, 'A10': 1}}
Explanation: There are 11 distinct valid source accounts, but only the top 10 should be returned. Since all counts are equal, the lexicographically smallest 10 account numbers are kept.
Hints
- Build a hash map from account number to balance so account existence and balance checks are O(1).
- After filtering valid transactions, count source accounts and sort by count descending, then account number ascending.