This question evaluates understanding of pandas and SQL fundamentals—specifically pandas Series vs DataFrame distinctions, SQL WHERE vs HAVING semantics, and practical duplicate-detection in transactional data—targeting data manipulation competency for Data Engineer roles.
You are interviewing for a Data Engineer internship. Answer the following short data-manipulation questions:
Series
and a
DataFrame
? Compare their dimensionality, indexing behavior, and common use cases.
WHERE
and
HAVING
? Explain when each filter is applied and whether aggregate expressions can be used.
transactions_raw
with the following schema:
ingest_id BIGINT
account_id BIGINT
transaction_id BIGINT
amount DECIMAL(12,2)
transaction_ts TIMESTAMP
transaction_ts
is stored in UTC, and there are no foreign-key relationships relevant to this task. Define a duplicate as multiple rows with the same
(account_id, transaction_id, amount, transaction_ts)
. Write a SQL query that returns all duplicate groups with the output columns:
account_id, transaction_id, amount, transaction_ts, duplicate_count
.