PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep

Quick Overview

This question evaluates proficiency in SQL query design and PySpark DataFrame manipulation, covering aggregation, filtering, joins, time and numeric formatting, duplicate handling, and Spark session setup.

  • easy
  • Point72
  • Coding & Algorithms
  • Data Engineer

Solve SQL and PySpark Data Tasks

Company: Point72

Role: Data Engineer

Category: Coding & Algorithms

Difficulty: easy

Interview Round: Take-home Project

Complete the following two data engineering coding tasks. ### Part A: Detect repeated crypto-transfer patterns in PostgreSQL You are given a PostgreSQL table `crypto_transactions` with columns: - `sender` TEXT: sender wallet address - `recipient` TEXT: recipient wallet address - `amount` NUMERIC: transaction amount - `dt` VARCHAR(19): timestamp formatted as `YYYY-MM-DD HH:MM:SS` A sender-recipient pair is considered suspicious if there are at least 2 transactions from that sender to that recipient. Write a single SQL query that: 1. Finds all suspicious sender-recipient pairs. 2. Keeps all transaction rows that belong to those suspicious pairs. 3. Aggregates the remaining rows by `sender`. 4. Returns exactly these columns: - `sender`: the sender wallet address. - `period`: the time portion of the earliest and latest transaction for that sender, formatted as `HH:MM:SS - HH:MM:SS`. Use only the time component, not the date. - `transactions`: a formatted summary of that sender's included transaction amounts. For `transactions`: - Sort amounts by numeric `amount` descending; break ties by `dt` ascending. - Format every amount and total with exactly 6 digits after the decimal point and no leading padding spaces. - If the sender has 3 or fewer included transactions, concatenate all formatted amounts with ` + `, followed by ` = total`. Example: `12.000000 + 8.500000 = 20.500000`. - If the sender has more than 3 included transactions, show only the top 3 formatted amounts, then append ` + .. X more = total`, where `X` is the number of omitted transactions. The ellipsis is exactly two dots: `..`. Order the final result by total included amount descending; for ties, order by `sender` ascending. ### Part B: Implement PySpark medical-billing utilities You are given two PySpark DataFrames loaded from CSV files. Assume string columns unless stated otherwise. `eligibility_df` schema: - `memberId` - `firstName` - `lastName` `medical_df` schema: - `claimId` - `memberId` - `paidAmount`: a decimal number stored as a string, such as `99.0` or `100.0` - `fullName`: may be empty or stale and should be replaced when generating names Implement the following functions: 1. `init_spark_session()` - Return a `SparkSession` configured with master `local` and application name `MedicalBilling`. 2. `filter_medical(medical_df, eligibility_df)` - Return only rows from `medical_df` whose `memberId` exists in `eligibility_df`. - Preserve only the original `medical_df` columns. - Do not duplicate medical rows if `eligibility_df` contains repeated `memberId` values. 3. `generate_full_name(medical_df, eligibility_df)` - Join medical rows to eligibility data on `memberId`. - Replace any existing `fullName` column with `firstName` followed by a single space followed by `lastName`. - Return columns in this order: `claimId`, `memberId`, `paidAmount`, `fullName`. 4. `find_max_paid_member(medical_with_name_df)` - Treat `paidAmount` as a numeric value, not as a string. - Return the single row with the largest `paidAmount`; if there is a tie, return the row with lexicographically smallest `memberId`. 5. `find_total_paid_amount(medical_df)` - Cast `paidAmount` to a numeric type, sum it over all rows, and return the result as a Python integer. Assume input amounts represent whole-number currency values even if written with a decimal suffix.

Quick Answer: This question evaluates proficiency in SQL query design and PySpark DataFrame manipulation, covering aggregation, filtering, joins, time and numeric formatting, duplicate handling, and Spark session setup.

Part 1: Detect Repeated Crypto-Transfer Patterns

You are given a list-based version of a transaction table. Each transaction is represented as [sender, recipient, amount, dt], where amount is a decimal string and dt has format 'YYYY-MM-DD HH:MM:SS'. A sender-recipient pair is suspicious if it appears at least twice. Keep only transactions from suspicious pairs, then aggregate the remaining rows by sender. For each sender, build period as 'HH:MM:SS - HH:MM:SS' using the earliest and latest included timestamps for that sender. Also build a transactions summary: sort included transactions by numeric amount descending and timestamp ascending for ties; format every shown amount and the total with exactly 6 decimal places. If the sender has 3 or fewer included transactions, show all amounts joined by ' + ' and then ' = total'. If the sender has more than 3 included transactions, show only the top 3 amounts and then append ' + .. X more = total'. Return the final rows ordered by total included amount descending, then sender ascending.

Constraints

  • 1 <= len(dt) == 19 and every dt is in 'YYYY-MM-DD HH:MM:SS' format.
  • 0 <= number of transactions <= 200000.
  • amount is a valid non-negative decimal string.
  • Lexicographic order of dt matches chronological order because the format is fixed.

Examples

Input: ([['A', 'X', '12', '2024-01-01 10:00:00'], ['A', 'X', '8.5', '2024-01-01 11:00:00'], ['A', 'Y', '7', '2024-01-01 09:00:00'], ['A', 'Y', '3', '2024-01-01 12:00:00'], ['B', 'Z', '5', '2024-01-01 13:00:00'], ['B', 'Z', '5', '2024-01-01 14:00:00'], ['C', 'Q', '1', '2024-01-01 15:00:00']],)

Expected Output: [['A', '09:00:00 - 12:00:00', '12.000000 + 8.500000 + 7.000000 + .. 1 more = 30.500000'], ['B', '13:00:00 - 14:00:00', '5.000000 + 5.000000 = 10.000000']]

Explanation: A-X, A-Y, and B-Z are suspicious pairs. Sender A has 4 included transactions, so only the top 3 are shown before '.. 1 more'.

Input: ([['alice', 'bob', '10', '2024-05-01 08:00:00'], ['alice', 'bob', '10', '2024-05-01 07:00:00'], ['dave', 'eve', '20', '2024-05-01 06:00:00'], ['dave', 'eve', '1', '2024-05-01 09:00:00']],)

Expected Output: [['dave', '06:00:00 - 09:00:00', '20.000000 + 1.000000 = 21.000000'], ['alice', '07:00:00 - 08:00:00', '10.000000 + 10.000000 = 20.000000']]

Explanation: For alice, equal amounts are ordered by earlier timestamp first. Final rows are sorted by total amount descending.

Input: ([['A', 'X', '1', '2024-01-01 10:00:00'], ['A', 'Y', '2', '2024-01-01 11:00:00'], ['B', 'Z', '3', '2024-01-01 12:00:00']],)

Expected Output: []

Explanation: No sender-recipient pair appears at least twice, so nothing remains after filtering.

Input: ([['M', 'N', '2', '2024-07-01 10:00:00'], ['M', 'N', '4', '2024-07-01 09:00:00'], ['M', 'N', '3', '2024-07-01 12:00:00'], ['X', 'Y', '1', '2024-07-01 13:00:00']],)

Expected Output: [['M', '09:00:00 - 12:00:00', '4.000000 + 3.000000 + 2.000000 = 9.000000']]

Explanation: The pair M-N appears 3 times, so all 3 are included and all amounts are shown.

Hints

  1. A clean approach is two-pass: first count each (sender, recipient) pair, then aggregate only the rows whose pair count is at least 2.
  2. Because timestamps are in a sortable fixed-width format, you can use normal string min/max for earliest and latest datetimes.

Part 2: Build the MedicalBilling Spark Configuration

This is a language-agnostic version of init_spark_session(). You are given a list of Spark config pairs [key, value]. Build the final configuration by keeping only the last value for each repeated key, then forcibly setting master = 'local' and appName = 'MedicalBilling'. Return the normalized configuration as a list of [key, value] pairs sorted by key ascending.

Constraints

  • 0 <= len(configs) <= 100000.
  • Each config row has exactly 2 strings: [key, value].
  • If a key appears multiple times, the last occurrence wins before the forced values are applied.

Examples

Input: ([['master', 'yarn'], ['spark.sql.shuffle.partitions', '8']],)

Expected Output: [['appName', 'MedicalBilling'], ['master', 'local'], ['spark.sql.shuffle.partitions', '8']]

Explanation: The existing master value is overwritten, while unrelated settings are preserved.

Input: ([['appName', 'OldApp'], ['master', 'cluster'], ['master', 'k8s']],)

Expected Output: [['appName', 'MedicalBilling'], ['master', 'local']]

Explanation: Repeated keys collapse to one value, and then the required values are forced.

Input: ([],)

Expected Output: [['appName', 'MedicalBilling'], ['master', 'local']]

Explanation: With no input settings, only the two required ones remain.

Input: ([['spark.executor.memory', '2g'], ['spark.executor.memory', '4g'], ['foo', 'bar']],)

Expected Output: [['appName', 'MedicalBilling'], ['foo', 'bar'], ['master', 'local'], ['spark.executor.memory', '4g']]

Explanation: The last value for spark.executor.memory is kept.

Hints

  1. A dictionary naturally models configuration overwrites.
  2. Apply the forced 'master' and 'appName' values after processing the input.

Part 3: Filter Medical Claims by Eligibility

You are given list-based versions of two DataFrames. medical_df contains rows [claimId, memberId, paidAmount, fullName]. eligibility_df contains rows [memberId, firstName, lastName]. Return only the rows from medical_df whose memberId exists somewhere in eligibility_df. Preserve the original medical_df row order and keep only the original medical_df columns. Do not duplicate medical rows even if eligibility_df contains repeated memberId values.

Constraints

  • 0 <= len(medical_df), len(eligibility_df) <= 200000.
  • Each medical row has length 4 and each eligibility row has length 3.
  • memberId comparisons are exact string matches.

Examples

Input: ([['c1', 'm1', '99.0', 'stale'], ['c2', 'm2', '50.0', 'old'], ['c3', 'm3', '70.0', 'name']], [['m2', 'Ann', 'Lee'], ['m1', 'Bob', 'Ray'], ['m1', 'Bob', 'Ray']])

Expected Output: [['c1', 'm1', '99.0', 'stale'], ['c2', 'm2', '50.0', 'old']]

Explanation: m1 and m2 are eligible, and repeated eligibility rows do not duplicate medical rows.

Input: ([['c1', 'm9', '10.0', 'x']], [['m1', 'A', 'B']])

Expected Output: []

Explanation: No medical memberId appears in eligibility.

Input: ([], [['m1', 'A', 'B']])

Expected Output: []

Explanation: Filtering an empty medical table returns an empty result.

Input: ([['c1', 'm1', '10.0', 'x'], ['c1', 'm1', '10.0', 'x']], [['m1', 'A', 'B']])

Expected Output: [['c1', 'm1', '10.0', 'x'], ['c1', 'm1', '10.0', 'x']]

Explanation: Duplicate medical rows are preserved if they are eligible.

Hints

  1. This is a semi-join: you only need to know whether a memberId exists, not all matching rows.
  2. Build a set of eligible memberIds first so each medical row can be checked in O(1) average time.

Part 4: Generate Full Names for Medical Claims

You are given list-based versions of medical and eligibility DataFrames. medical_df contains rows [claimId, memberId, paidAmount, fullName]. eligibility_df contains rows [memberId, firstName, lastName]. For every medical row whose memberId exists in eligibility_df, create a new row [claimId, memberId, paidAmount, generatedFullName], where generatedFullName is firstName + ' ' + lastName. Ignore the old fullName value. Preserve the original order of matching medical rows. To keep the output deterministic, if eligibility_df contains the same memberId multiple times, use the first occurrence.

Constraints

  • 0 <= len(medical_df), len(eligibility_df) <= 200000.
  • Each medical row has length 4 and each eligibility row has length 3.
  • If a medical row has no matching memberId in eligibility_df, it is omitted from the result.

Examples

Input: ([['c1', 'm1', '99.0', 'OLD'], ['c2', 'm2', '50.0', '']], [['m1', 'Bob', 'Ray'], ['m2', 'Ann', 'Lee']])

Expected Output: [['c1', 'm1', '99.0', 'Bob Ray'], ['c2', 'm2', '50.0', 'Ann Lee']]

Explanation: Both medical rows match eligibility and receive regenerated full names.

Input: ([['c1', 'm1', '10.0', 'x'], ['c2', 'm3', '20.0', 'y']], [['m1', 'A', 'B']])

Expected Output: [['c1', 'm1', '10.0', 'A B']]

Explanation: The row for m3 is omitted because it has no eligibility match.

Input: ([['c1', 'm1', '10.0', 'stale']], [['m1', 'First', 'Person'], ['m1', 'Second', 'Person']])

Expected Output: [['c1', 'm1', '10.0', 'First Person']]

Explanation: When eligibility repeats a memberId, the first occurrence is used.

Input: ([['c1', 'm1', '10.0', 'stale']], [])

Expected Output: []

Explanation: No eligibility rows means no joined result.

Hints

  1. Build a memberId -> 'first last' lookup once, then scan medical rows in order.
  2. Because the problem asks you to replace the name, the original fullName column should never be reused.

Part 5: Find the Highest Paid Medical Claim

You are given a list-based version of an already enriched medical table. Each row is [claimId, memberId, paidAmount, fullName], where paidAmount is stored as a string. Return the single row with the largest numeric paidAmount. If several rows tie on paidAmount, return the one with the lexicographically smallest memberId. If there is still a tie, keep the earliest such row from the input. If the input is empty, return an empty list.

Constraints

  • 0 <= len(medical_with_name_df) <= 200000.
  • paidAmount is a valid decimal string.
  • memberId comparisons for tie-breaking are standard lexicographic string comparisons.

Examples

Input: ([['c1', 'm2', '99.0', 'Ann Lee'], ['c2', 'm1', '100.0', 'Bob Ray'], ['c3', 'm3', '50.0', 'Cara Doe']],)

Expected Output: ['c2', 'm1', '100.0', 'Bob Ray']

Explanation: The row with paidAmount 100.0 is the maximum.

Input: ([['c1', 'm2', '100.0', 'A'], ['c2', 'm1', '100.0', 'B'], ['c3', 'm3', '99.0', 'C']],)

Expected Output: ['c2', 'm1', '100.0', 'B']

Explanation: Two rows tie on amount, so the lexicographically smaller memberId 'm1' wins.

Input: ([],)

Expected Output: []

Explanation: Empty input returns an empty list.

Input: ([['c1', 'm1', '10.0', 'First'], ['c2', 'm1', '10.0', 'Second']],)

Expected Output: ['c1', 'm1', '10.0', 'First']

Explanation: When amount and memberId both tie, keep the earliest row.

Hints

  1. Compare paidAmount as a number, not as a string, otherwise '99.0' could incorrectly beat '100.0'.
  2. A single linear scan is enough if you keep track of the best row seen so far.

Part 6: Find the Total Paid Amount

You are given a list-based version of medical_df. Each row is [claimId, memberId, paidAmount, fullName], where paidAmount is stored as a string such as '99.0' or '100.0'. Cast each paidAmount to a numeric value, sum all rows, and return the total as a Python integer. You may assume every amount represents a whole-number currency value even if it includes a decimal suffix like '.0'.

Constraints

  • 0 <= len(medical_df) <= 200000.
  • paidAmount is a valid decimal string representing a whole-number value.
  • The final answer fits in standard integer ranges.

Examples

Input: ([['c1', 'm1', '99.0', 'x'], ['c2', 'm2', '1.0', 'y']],)

Expected Output: 100

Explanation: 99.0 + 1.0 = 100.

Input: ([],)

Expected Output: 0

Explanation: The sum of an empty list is 0.

Input: ([['c1', 'm1', '0.0', ''], ['c2', 'm2', '100.0', ''], ['c3', 'm3', '250.0', '']],)

Expected Output: 350

Explanation: All amounts are added numerically.

Input: ([['c1', 'm1', '7', ''], ['c2', 'm2', '8.0', '']],)

Expected Output: 15

Explanation: Integer-looking and decimal-suffixed strings both count as numeric values.

Hints

  1. Do not concatenate or compare the amount strings directly; convert them to numbers first.
  2. Because the amounts are guaranteed to represent whole-number values, the final numeric sum can be converted to int.
Last updated: May 30, 2026

Loading coding console...

PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.

Related Coding Questions

  • Implement Portfolio Trading Optimizer - Point72 (hard)
  • Implement Election Report and Banking Pipeline - Point72 (hard)
  • Find the Smallest String After One Decrement - Point72 (medium)
  • Implement composition and mixin utilities - Point72 (hard)
  • Find kth missing integer and redundant operations - Point72 (easy)