PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This multipart question evaluates data engineering competencies including SQL report design for complex aggregation and ranking, plus PySpark-based ETL and transaction-level analytics.

  • hard
  • Point72
  • Coding & Algorithms
  • Data Engineer

Implement Election Report and Banking Pipeline

Company: Point72

Role: Data Engineer

Category: Coding & Algorithms

Difficulty: hard

Interview Round: Take-home Project

You are given two independent data engineering coding tasks. ## Task 1: SQL Election Exit Poll by State Report An election analytics team needs a report showing, for each candidate, the top three vote totals by state and the states where those totals occurred. ### Tables #### `candidates` | Column | Type | Description | |---|---:|---| | `id` | SMALLINT | Candidate ID | | `first_name` | VARCHAR(64) | Candidate first name | | `last_name` | VARCHAR(64) | Candidate last name | #### `results` | Column | Type | Description | |---|---:|---| | `candidate_id` | SMALLINT | Candidate ID | | `state` | VARCHAR(19) | U.S. state where a vote was recorded | Each row in `results` represents one vote for a candidate in a state. ### Output Return one row per candidate with the following columns: | Column | Description | |---|---| | `candidate_name` | Candidate first and last name concatenated with a single space. | | `1st_place` | State names and vote totals for the highest vote count. | | `2nd_place` | State names and vote totals for the second-highest distinct vote count. | | `3rd_place` | State names and vote totals for the third-highest distinct vote count. | Each place column should contain one or more values formatted as: ```text State Name (votes) ``` If multiple states have the same vote count for the same place, sort those states alphabetically and join them with `, `. For example: ```text Alabama (23), New York (23) ``` The top three places are based on distinct vote totals, not on individual states. For example, if a candidate has state vote counts `3, 3, 2, 1`, then the first place contains both states with `3`, the second place contains the state with `2`, and the third place contains the state with `1`. ### Sample Data #### `candidates` | id | first_name | last_name | |---:|---|---| | 1 | Davide | Kentish | | 2 | Thorstein | Bridge | #### `results` | candidate_id | state | |---:|---| | 1 | Alabama | | 1 | Alabama | | 1 | California | | 1 | California | | 1 | California | | 1 | California | | 1 | California | | 1 | Texas | | 1 | Texas | | 1 | Texas | | 2 | Alabama | | 2 | California | | 2 | California | | 2 | California | | 2 | New York | | 2 | New York | | 2 | Texas | | 2 | Texas | | 2 | Texas | ### Expected Output | candidate_name | 1st_place | 2nd_place | 3rd_place | |---|---|---|---| | Davide Kentish | California (5) | Texas (3) | Alabama (2) | | Thorstein Bridge | California (3), Texas (3) | New York (2) | Alabama (1) | Write a SQL query that produces this report. --- ## Task 2: PySpark Banking Data Mining Implement a PySpark job that reads account and transaction CSV files, filters valid transactions, and computes account-level transaction statistics. ### Input Files #### `accounts.csv` One row per account. | Column | Description | |---|---| | `accountNumber` | Account identifier | | `balance` | Account balance | #### `transactions.csv` One row per transfer transaction. | Column | Description | |---|---| | `fromAccountNumber` | Source account number | | `toAccountNumber` | Destination account number | | `transferAmount` | Amount transferred | ### Rules A transaction is valid if all of the following are true: 1. `fromAccountNumber` exists in `accounts.csv`. 2. `toAccountNumber` exists in `accounts.csv`. 3. `transferAmount` is less than or equal to the `balance` of the source account. ### Methods to Implement You are given a partially completed class `main.job.pipeline.PySparkJob`. Implement the following methods: ```python class PySparkJob: def init_spark_session(self) -> SparkSession: """Create and return a SparkSession with master 'local' and app name 'Banking Data Mining'.""" pass def extract_valid_transactions(self, accounts: DataFrame, transactions: DataFrame) -> DataFrame: """Return only valid transactions according to the validation rules above.""" pass def distinct_transactions(self, transactions: DataFrame) -> int: """Return the number of distinct source accounts, based on fromAccountNumber.""" pass def transactions_per_account(self, transactions: DataFrame) -> dict: """Return the top 10 fromAccountNumber values by transaction count as a dictionary.""" pass ``` The helper method `read_csv(file_path)` reads CSV files with headers. Unit tests are provided and must pass.

Quick Answer: This multipart question evaluates data engineering competencies including SQL report design for complex aggregation and ranking, plus PySpark-based ETL and transaction-level analytics.

Part 1: Election Exit Poll State Rankings

You are given a list of candidates and a list of votes. Each vote is recorded as a pair of candidate_id and state. Build a report with one row per candidate. For each candidate, count how many votes they received in each state, then determine the top three distinct vote totals. Each place should include all states that share that total, sorted alphabetically and formatted as 'State (votes)'. If a candidate has fewer than three distinct vote totals, use None for the missing places. Return rows sorted by candidate id ascending.

Constraints

  • 0 <= len(candidates) <= 10^4
  • 0 <= len(results) <= 2 * 10^5
  • Candidate ids in candidates are unique
  • State names are non-empty strings
  • If a vote references a candidate id not present in candidates, ignore it

Examples

Input: ([(1, 'Davide Kentish'), (2, 'Thorstein Bridge')], [(1, 'California'), (1, 'Texas'), (2, 'California'), (1, 'California'), (2, 'Texas'), (1, 'Alabama'), (2, 'New York'), (2, 'Texas'), (1, 'California'), (2, 'California'), (1, 'Texas'), (2, 'Alabama'), (1, 'Texas'), (2, 'New York'), (1, 'Alabama'), (2, 'Texas'), (1, 'California'), (2, 'California'), (1, 'California')])

Expected Output: [{'candidate_name': 'Davide Kentish', '1st_place': 'California (5)', '2nd_place': 'Texas (3)', '3rd_place': 'Alabama (2)'}, {'candidate_name': 'Thorstein Bridge', '1st_place': 'California (3), Texas (3)', '2nd_place': 'New York (2)', '3rd_place': 'Alabama (1)'}]

Explanation: Davide Kentish has state counts California=5, Texas=3, Alabama=2. Thorstein Bridge has California=3, Texas=3, New York=2, Alabama=1, so California and Texas share 1st place.

Input: ([(1, 'Alice'), (2, 'Bob'), (3, 'Cara')], [(1, 'Utah'), (1, 'Ohio'), (3, 'Florida'), (1, 'Utah'), (1, 'Nevada'), (1, 'Ohio')])

Expected Output: [{'candidate_name': 'Alice', '1st_place': 'Ohio (2), Utah (2)', '2nd_place': 'Nevada (1)', '3rd_place': None}, {'candidate_name': 'Bob', '1st_place': None, '2nd_place': None, '3rd_place': None}, {'candidate_name': 'Cara', '1st_place': 'Florida (1)', '2nd_place': None, '3rd_place': None}]

Explanation: Alice has a tie for the top total with Ohio and Utah at 2 votes each. Bob has no votes, so all places are None. Cara has only one state with votes.

Input: ([(10, 'Lina'), (11, 'Mark')], [(10, 'Texas'), (11, 'Oregon'), (10, 'California'), (10, 'New York'), (11, 'Washington'), (10, 'Florida'), (10, 'California'), (11, 'Idaho'), (10, 'Texas'), (10, 'New York'), (11, 'Washington'), (10, 'Arizona'), (10, 'California'), (11, 'Oregon'), (10, 'Florida'), (10, 'New York'), (11, 'Idaho'), (10, 'Texas'), (10, 'California'), (10, 'New York')])

Expected Output: [{'candidate_name': 'Lina', '1st_place': 'California (4), New York (4)', '2nd_place': 'Texas (3)', '3rd_place': 'Florida (2)'}, {'candidate_name': 'Mark', '1st_place': 'Idaho (2), Oregon (2), Washington (2)', '2nd_place': None, '3rd_place': None}]

Explanation: Lina has four distinct totals but only the top three are returned, so Arizona (1) is omitted. Mark has only one distinct total shared by three states.

Input: ([(7, 'Solo')], [])

Expected Output: [{'candidate_name': 'Solo', '1st_place': None, '2nd_place': None, '3rd_place': None}]

Explanation: With no votes at all, the candidate still appears in the result with all places set to None.

Hints

  1. First count votes for each (candidate, state) pair.
  2. After counting, group states by their vote totals so you can rank distinct totals instead of individual states.

Part 2: Banking Transaction Validation and Source Account Stats

You are given account records and transfer transactions. A transaction is valid if the source account exists, the destination account exists, and the transfer amount is less than or equal to the source account's original balance from the accounts list. Valid transactions do not change balances for later checks; every transaction is validated independently against the original balances. Return the valid transactions in original order, the number of distinct source accounts among valid transactions, and the top 10 source accounts by valid transaction count. When counts tie, sort account numbers in ascending order.

Constraints

  • 0 <= len(accounts) <= 10^5
  • 0 <= len(transactions) <= 2 * 10^5
  • Account numbers in accounts are unique strings
  • Balances and transfer amounts are non-negative integers
  • Validation uses the original balance only; balances do not decrease as transactions are processed

Examples

Input: ([('A1', 100), ('A2', 50), ('A3', 200)], [('A1', 'A2', 70), ('A1', 'A4', 10), ('A2', 'A3', 60), ('A3', 'A1', 150), ('A1', 'A3', 70)])

Expected Output: {'valid_transactions': [('A1', 'A2', 70), ('A3', 'A1', 150), ('A1', 'A3', 70)], 'distinct_source_accounts': 2, 'top_source_accounts': {'A1': 2, 'A3': 1}}

Explanation: The second transaction is invalid because the destination account does not exist. The third is invalid because 60 exceeds A2's balance of 50.

Input: ([('X', 10)], [('X', 'Y', 5), ('Z', 'X', 1), ('X', 'X', 11)])

Expected Output: {'valid_transactions': [], 'distinct_source_accounts': 0, 'top_source_accounts': {}}

Explanation: This edge case has no valid transactions at all.

Input: ([('A', 100), ('B', 100), ('C', 100), ('D', 100)], [('B', 'A', 10), ('A', 'B', 20), ('B', 'C', 30), ('A', 'D', 40), ('C', 'C', 5)])

Expected Output: {'valid_transactions': [('B', 'A', 10), ('A', 'B', 20), ('B', 'C', 30), ('A', 'D', 40), ('C', 'C', 5)], 'distinct_source_accounts': 3, 'top_source_accounts': {'A': 2, 'B': 2, 'C': 1}}

Explanation: All transactions are valid. A and B tie with 2 transactions each, so the tie is broken by account number ascending.

Input: ([('A01', 100), ('A02', 100), ('A03', 100), ('A04', 100), ('A05', 100), ('A06', 100), ('A07', 100), ('A08', 100), ('A09', 100), ('A10', 100), ('A11', 100), ('A12', 100)], [('A01', 'A12', 1), ('A02', 'A12', 1), ('A03', 'A12', 1), ('A04', 'A12', 1), ('A05', 'A12', 1), ('A06', 'A12', 1), ('A07', 'A12', 1), ('A08', 'A12', 1), ('A09', 'A12', 1), ('A10', 'A12', 1), ('A11', 'A12', 1)])

Expected Output: {'valid_transactions': [('A01', 'A12', 1), ('A02', 'A12', 1), ('A03', 'A12', 1), ('A04', 'A12', 1), ('A05', 'A12', 1), ('A06', 'A12', 1), ('A07', 'A12', 1), ('A08', 'A12', 1), ('A09', 'A12', 1), ('A10', 'A12', 1), ('A11', 'A12', 1)], 'distinct_source_accounts': 11, 'top_source_accounts': {'A01': 1, 'A02': 1, 'A03': 1, 'A04': 1, 'A05': 1, 'A06': 1, 'A07': 1, 'A08': 1, 'A09': 1, 'A10': 1}}

Explanation: There are 11 distinct valid source accounts, but only the top 10 should be returned. Since all counts are equal, the lexicographically smallest 10 account numbers are kept.

Hints

  1. Build a hash map from account number to balance so account existence and balance checks are O(1).
  2. After filtering valid transactions, count source accounts and sort by count descending, then account number ascending.
Last updated: May 7, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Solve SQL and PySpark Data Tasks - Point72 (easy)
  • Implement Portfolio Trading Optimizer - Point72 (hard)
  • Find the Smallest String After One Decrement - Point72 (medium)
  • Implement composition and mixin utilities - Point72 (hard)
  • Find kth missing integer and redundant operations - Point72 (easy)