How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

What difficulty level is this interview question?

This is a Medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Onsite rounds at Gemini.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Gemini during technical interviews.

Write SQL/Python for ACH fraud analytics | Gemini Interview Question

Q: Write SQL/Python for ACH fraud analytics

This question evaluates time-windowed analytics, SQL window functions and joins, ranking and event correlation, streaming de-duplication in Python, and domain knowledge of ACH payment and return behaviors within the Data Manipulation (SQL/Python) category.

As of today (2025-09-01), use the following schema and tiny samples to answer. Provide SQL for the SQL parts and Python for the Python part.

Schema

users(user_id INT, created_at TIMESTAMP, country TEXT)
transactions(txn_id TEXT, user_id INT, direction TEXT, method TEXT, amount_cents INT, created_at TIMESTAMP, status TEXT)
ach_returns(txn_id TEXT, return_code TEXT, returned_at TIMESTAMP)
devices(device_id TEXT, user_id INT, fingerprint TEXT, first_seen_at TIMESTAMP)
logins(login_id TEXT, user_id INT, device_id TEXT, ip TEXT, ts TIMESTAMP)

Sample rows users user_id | created_at | country 1 | 2025-06-10 08:00:00 | US 2 | 2025-08-01 12:00:00 | US 3 | 2025-08-20 09:00:00 | CA

transactions txn_id | user_id | direction | method | amount_cents | created_at | status t1 | 1 | credit | ACH | 50000 | 2025-08-31 10:00:00 | posted t2 | 1 | credit | ACH | 40000 | 2025-08-31 18:00:00 | posted t3 | 1 | credit | ACH | 30000 | 2025-09-01 09:30:00 | posted t4 | 1 | debit | ACH | 20000 | 2025-09-02 12:00:00 | posted t5 | 2 | credit | ACH | 150000 | 2025-08-30 02:00:00 | posted t6 | 2 | credit | ACH | 150000 | 2025-09-01 02:30:00 | posted t7 | 3 | credit | CARD | 70000 | 2025-09-01 11:00:00 | posted

ach_returns txn_id | return_code | returned_at t2 | R10 | 2025-09-03 08:00:00 t5 | R01 | 2025-09-05 09:00:00

devices device_id | user_id | fingerprint | first_seen_at d1 | 1 | abc123 | 2025-08-01 09:00:00 d2 | 1 | abc124 | 2025-08-31 17:30:00 d3 | 2 | zyx999 | 2025-08-29 22:00:00 d4 | 3 | abc123 | 2025-09-01 10:45:00

logins login_id | user_id | device_id | ip | ts l1 | 1 | d1 | 1.1.1.1 | 2025-08-31 09:50:00 l2 | 1 | d2 | 1.1.1.1 | 2025-08-31 17:40:00 l3 | 1 | d1 | 2.2.2.2 | 2025-09-01 09:25:00 l4 | 2 | d3 | 3.3.3.3 | 2025-09-01 02:25:00 l5 | 3 | d4 | 4.4.4.4 | 2025-09-01 10:50:00

Tasks A) SQL (window + joins): For each user, find the earliest rolling 24h window that ends on or before 2025-09-01 23:59:59 in which they have at least 3 ACH credit transactions and at least one of those credits is returned within 5 days of its created_at. Output: user_id, window_start, window_end, num_ach_credits_in_window, num_returns_within_5d, net_exposure_cents. Define net_exposure_cents as sum(amount_cents) of the ACH credits in that 24h window minus sum(amount_cents) of any ACH debits by the same user occurring between window_start and the earliest return timestamp for those credits. Use only transactions and ach_returns. B) SQL (ranking + window): As of 2025-09-01, for each user, rank their devices by the count of returned ACH credits linked to that device in the last 30 days (i.e., credits within 2025-08-02…2025-09-01 whose txn_id appears in ach_returns with returned_at in the same interval, where the device is the most recent login within 60 minutes before the credit). Output the top device per user with columns: user_id, device_id, fingerprint, returned_count_30d. C) Python: Given logins (streaming, possibly out of order by ≤5 minutes), collapse device fingerprints that differ by exactly one character (case-sensitive) into a canonical fingerprint (e.g., union-find over Hamming distance 1). Return all clusters where the canonical fingerprint is used by ≥3 distinct users within any 7-day window ending on 2025-09-01. Output list of tuples (canonical_fingerprint, window_start, window_end, distinct_user_count). State time/space complexity and how you handle late events and ties.