Write SQL/Python for ACH fraud analytics
Company: Gemini
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Onsite
As of today (2025-09-01), use the following schema and tiny samples to answer. Provide SQL for the SQL parts and Python for the Python part.
Schema
- users(user_id INT, created_at TIMESTAMP, country TEXT)
- transactions(txn_id TEXT, user_id INT, direction TEXT, method TEXT, amount_cents INT, created_at TIMESTAMP, status TEXT)
- ach_returns(txn_id TEXT, return_code TEXT, returned_at TIMESTAMP)
- devices(device_id TEXT, user_id INT, fingerprint TEXT, first_seen_at TIMESTAMP)
- logins(login_id TEXT, user_id INT, device_id TEXT, ip TEXT, ts TIMESTAMP)
Sample rows
users
user_id | created_at | country
1 | 2025-06-10 08:00:00 | US
2 | 2025-08-01 12:00:00 | US
3 | 2025-08-20 09:00:00 | CA
transactions
txn_id | user_id | direction | method | amount_cents | created_at | status
t1 | 1 | credit | ACH | 50000 | 2025-08-31 10:00:00 | posted
t2 | 1 | credit | ACH | 40000 | 2025-08-31 18:00:00 | posted
t3 | 1 | credit | ACH | 30000 | 2025-09-01 09:30:00 | posted
t4 | 1 | debit | ACH | 20000 | 2025-09-02 12:00:00 | posted
t5 | 2 | credit | ACH | 150000 | 2025-08-30 02:00:00 | posted
t6 | 2 | credit | ACH | 150000 | 2025-09-01 02:30:00 | posted
t7 | 3 | credit | CARD | 70000 | 2025-09-01 11:00:00 | posted
ach_returns
txn_id | return_code | returned_at
t2 | R10 | 2025-09-03 08:00:00
t5 | R01 | 2025-09-05 09:00:00
devices
device_id | user_id | fingerprint | first_seen_at
d1 | 1 | abc123 | 2025-08-01 09:00:00
d2 | 1 | abc124 | 2025-08-31 17:30:00
d3 | 2 | zyx999 | 2025-08-29 22:00:00
d4 | 3 | abc123 | 2025-09-01 10:45:00
logins
login_id | user_id | device_id | ip | ts
l1 | 1 | d1 | 1.1.1.1 | 2025-08-31 09:50:00
l2 | 1 | d2 | 1.1.1.1 | 2025-08-31 17:40:00
l3 | 1 | d1 | 2.2.2.2 | 2025-09-01 09:25:00
l4 | 2 | d3 | 3.3.3.3 | 2025-09-01 02:25:00
l5 | 3 | d4 | 4.4.4.4 | 2025-09-01 10:50:00
Tasks
A) SQL (window + joins): For each user, find the earliest rolling 24h window that ends on or before 2025-09-01 23:59:59 in which they have at least 3 ACH credit transactions and at least one of those credits is returned within 5 days of its created_at. Output: user_id, window_start, window_end, num_ach_credits_in_window, num_returns_within_5d, net_exposure_cents. Define net_exposure_cents as sum(amount_cents) of the ACH credits in that 24h window minus sum(amount_cents) of any ACH debits by the same user occurring between window_start and the earliest return timestamp for those credits. Use only transactions and ach_returns.
B) SQL (ranking + window): As of 2025-09-01, for each user, rank their devices by the count of returned ACH credits linked to that device in the last 30 days (i.e., credits within 2025-08-02…2025-09-01 whose txn_id appears in ach_returns with returned_at in the same interval, where the device is the most recent login within 60 minutes before the credit). Output the top device per user with columns: user_id, device_id, fingerprint, returned_count_30d.
C) Python: Given logins (streaming, possibly out of order by ≤5 minutes), collapse device fingerprints that differ by exactly one character (case-sensitive) into a canonical fingerprint (e.g., union-find over Hamming distance 1). Return all clusters where the canonical fingerprint is used by ≥3 distinct users within any 7-day window ending on 2025-09-01. Output list of tuples (canonical_fingerprint, window_start, window_end, distinct_user_count). State time/space complexity and how you handle late events and ties.
Quick Answer: This question evaluates time-windowed analytics, SQL window functions and joins, ranking and event correlation, streaming de-duplication in Python, and domain knowledge of ACH payment and return behaviors within the Data Manipulation (SQL/Python) category.