Solve SQL and PySpark Data Tasks
Company: Point72
Role: Data Engineer
Category: Coding & Algorithms
Difficulty: easy
Interview Round: Take-home Project
Quick Answer: This question evaluates proficiency in SQL query design and PySpark DataFrame manipulation, covering aggregation, filtering, joins, time and numeric formatting, duplicate handling, and Spark session setup.
Part 1: Detect Repeated Crypto-Transfer Patterns
Constraints
- 1 <= len(dt) == 19 and every dt is in 'YYYY-MM-DD HH:MM:SS' format.
- 0 <= number of transactions <= 200000.
- amount is a valid non-negative decimal string.
- Lexicographic order of dt matches chronological order because the format is fixed.
Examples
Input: ([['A', 'X', '12', '2024-01-01 10:00:00'], ['A', 'X', '8.5', '2024-01-01 11:00:00'], ['A', 'Y', '7', '2024-01-01 09:00:00'], ['A', 'Y', '3', '2024-01-01 12:00:00'], ['B', 'Z', '5', '2024-01-01 13:00:00'], ['B', 'Z', '5', '2024-01-01 14:00:00'], ['C', 'Q', '1', '2024-01-01 15:00:00']],)
Expected Output: [['A', '09:00:00 - 12:00:00', '12.000000 + 8.500000 + 7.000000 + .. 1 more = 30.500000'], ['B', '13:00:00 - 14:00:00', '5.000000 + 5.000000 = 10.000000']]
Explanation: A-X, A-Y, and B-Z are suspicious pairs. Sender A has 4 included transactions, so only the top 3 are shown before '.. 1 more'.
Input: ([['alice', 'bob', '10', '2024-05-01 08:00:00'], ['alice', 'bob', '10', '2024-05-01 07:00:00'], ['dave', 'eve', '20', '2024-05-01 06:00:00'], ['dave', 'eve', '1', '2024-05-01 09:00:00']],)
Expected Output: [['dave', '06:00:00 - 09:00:00', '20.000000 + 1.000000 = 21.000000'], ['alice', '07:00:00 - 08:00:00', '10.000000 + 10.000000 = 20.000000']]
Explanation: For alice, equal amounts are ordered by earlier timestamp first. Final rows are sorted by total amount descending.
Input: ([['A', 'X', '1', '2024-01-01 10:00:00'], ['A', 'Y', '2', '2024-01-01 11:00:00'], ['B', 'Z', '3', '2024-01-01 12:00:00']],)
Expected Output: []
Explanation: No sender-recipient pair appears at least twice, so nothing remains after filtering.
Input: ([['M', 'N', '2', '2024-07-01 10:00:00'], ['M', 'N', '4', '2024-07-01 09:00:00'], ['M', 'N', '3', '2024-07-01 12:00:00'], ['X', 'Y', '1', '2024-07-01 13:00:00']],)
Expected Output: [['M', '09:00:00 - 12:00:00', '4.000000 + 3.000000 + 2.000000 = 9.000000']]
Explanation: The pair M-N appears 3 times, so all 3 are included and all amounts are shown.
Hints
- A clean approach is two-pass: first count each (sender, recipient) pair, then aggregate only the rows whose pair count is at least 2.
- Because timestamps are in a sortable fixed-width format, you can use normal string min/max for earliest and latest datetimes.
Part 2: Build the MedicalBilling Spark Configuration
Constraints
- 0 <= len(configs) <= 100000.
- Each config row has exactly 2 strings: [key, value].
- If a key appears multiple times, the last occurrence wins before the forced values are applied.
Examples
Input: ([['master', 'yarn'], ['spark.sql.shuffle.partitions', '8']],)
Expected Output: [['appName', 'MedicalBilling'], ['master', 'local'], ['spark.sql.shuffle.partitions', '8']]
Explanation: The existing master value is overwritten, while unrelated settings are preserved.
Input: ([['appName', 'OldApp'], ['master', 'cluster'], ['master', 'k8s']],)
Expected Output: [['appName', 'MedicalBilling'], ['master', 'local']]
Explanation: Repeated keys collapse to one value, and then the required values are forced.
Input: ([],)
Expected Output: [['appName', 'MedicalBilling'], ['master', 'local']]
Explanation: With no input settings, only the two required ones remain.
Input: ([['spark.executor.memory', '2g'], ['spark.executor.memory', '4g'], ['foo', 'bar']],)
Expected Output: [['appName', 'MedicalBilling'], ['foo', 'bar'], ['master', 'local'], ['spark.executor.memory', '4g']]
Explanation: The last value for spark.executor.memory is kept.
Hints
- A dictionary naturally models configuration overwrites.
- Apply the forced 'master' and 'appName' values after processing the input.
Part 3: Filter Medical Claims by Eligibility
Constraints
- 0 <= len(medical_df), len(eligibility_df) <= 200000.
- Each medical row has length 4 and each eligibility row has length 3.
- memberId comparisons are exact string matches.
Examples
Input: ([['c1', 'm1', '99.0', 'stale'], ['c2', 'm2', '50.0', 'old'], ['c3', 'm3', '70.0', 'name']], [['m2', 'Ann', 'Lee'], ['m1', 'Bob', 'Ray'], ['m1', 'Bob', 'Ray']])
Expected Output: [['c1', 'm1', '99.0', 'stale'], ['c2', 'm2', '50.0', 'old']]
Explanation: m1 and m2 are eligible, and repeated eligibility rows do not duplicate medical rows.
Input: ([['c1', 'm9', '10.0', 'x']], [['m1', 'A', 'B']])
Expected Output: []
Explanation: No medical memberId appears in eligibility.
Input: ([], [['m1', 'A', 'B']])
Expected Output: []
Explanation: Filtering an empty medical table returns an empty result.
Input: ([['c1', 'm1', '10.0', 'x'], ['c1', 'm1', '10.0', 'x']], [['m1', 'A', 'B']])
Expected Output: [['c1', 'm1', '10.0', 'x'], ['c1', 'm1', '10.0', 'x']]
Explanation: Duplicate medical rows are preserved if they are eligible.
Hints
- This is a semi-join: you only need to know whether a memberId exists, not all matching rows.
- Build a set of eligible memberIds first so each medical row can be checked in O(1) average time.
Part 4: Generate Full Names for Medical Claims
Constraints
- 0 <= len(medical_df), len(eligibility_df) <= 200000.
- Each medical row has length 4 and each eligibility row has length 3.
- If a medical row has no matching memberId in eligibility_df, it is omitted from the result.
Examples
Input: ([['c1', 'm1', '99.0', 'OLD'], ['c2', 'm2', '50.0', '']], [['m1', 'Bob', 'Ray'], ['m2', 'Ann', 'Lee']])
Expected Output: [['c1', 'm1', '99.0', 'Bob Ray'], ['c2', 'm2', '50.0', 'Ann Lee']]
Explanation: Both medical rows match eligibility and receive regenerated full names.
Input: ([['c1', 'm1', '10.0', 'x'], ['c2', 'm3', '20.0', 'y']], [['m1', 'A', 'B']])
Expected Output: [['c1', 'm1', '10.0', 'A B']]
Explanation: The row for m3 is omitted because it has no eligibility match.
Input: ([['c1', 'm1', '10.0', 'stale']], [['m1', 'First', 'Person'], ['m1', 'Second', 'Person']])
Expected Output: [['c1', 'm1', '10.0', 'First Person']]
Explanation: When eligibility repeats a memberId, the first occurrence is used.
Input: ([['c1', 'm1', '10.0', 'stale']], [])
Expected Output: []
Explanation: No eligibility rows means no joined result.
Hints
- Build a memberId -> 'first last' lookup once, then scan medical rows in order.
- Because the problem asks you to replace the name, the original fullName column should never be reused.
Part 5: Find the Highest Paid Medical Claim
Constraints
- 0 <= len(medical_with_name_df) <= 200000.
- paidAmount is a valid decimal string.
- memberId comparisons for tie-breaking are standard lexicographic string comparisons.
Examples
Input: ([['c1', 'm2', '99.0', 'Ann Lee'], ['c2', 'm1', '100.0', 'Bob Ray'], ['c3', 'm3', '50.0', 'Cara Doe']],)
Expected Output: ['c2', 'm1', '100.0', 'Bob Ray']
Explanation: The row with paidAmount 100.0 is the maximum.
Input: ([['c1', 'm2', '100.0', 'A'], ['c2', 'm1', '100.0', 'B'], ['c3', 'm3', '99.0', 'C']],)
Expected Output: ['c2', 'm1', '100.0', 'B']
Explanation: Two rows tie on amount, so the lexicographically smaller memberId 'm1' wins.
Input: ([],)
Expected Output: []
Explanation: Empty input returns an empty list.
Input: ([['c1', 'm1', '10.0', 'First'], ['c2', 'm1', '10.0', 'Second']],)
Expected Output: ['c1', 'm1', '10.0', 'First']
Explanation: When amount and memberId both tie, keep the earliest row.
Hints
- Compare paidAmount as a number, not as a string, otherwise '99.0' could incorrectly beat '100.0'.
- A single linear scan is enough if you keep track of the best row seen so far.
Part 6: Find the Total Paid Amount
Constraints
- 0 <= len(medical_df) <= 200000.
- paidAmount is a valid decimal string representing a whole-number value.
- The final answer fits in standard integer ranges.
Examples
Input: ([['c1', 'm1', '99.0', 'x'], ['c2', 'm2', '1.0', 'y']],)
Expected Output: 100
Explanation: 99.0 + 1.0 = 100.
Input: ([],)
Expected Output: 0
Explanation: The sum of an empty list is 0.
Input: ([['c1', 'm1', '0.0', ''], ['c2', 'm2', '100.0', ''], ['c3', 'm3', '250.0', '']],)
Expected Output: 350
Explanation: All amounts are added numerically.
Input: ([['c1', 'm1', '7', ''], ['c2', 'm2', '8.0', '']],)
Expected Output: 15
Explanation: Integer-looking and decimal-suffixed strings both count as numeric values.
Hints
- Do not concatenate or compare the amount strings directly; convert them to numbers first.
- Because the amounts are guaranteed to represent whole-number values, the final numeric sum can be converted to int.