This question evaluates understanding of pandas data structures (Series vs DataFrame), SQL filtering and aggregation semantics (WHERE vs HAVING), and the ability to construct SQL queries to identify duplicate records.
Series
and a
pandas DataFrame
?
WHERE
and
HAVING
?
GROUP BY
and aggregates?
Assume a table of users:
users
(
user_id
BIGINT PRIMARY KEY,
email
VARCHAR,
created_at
TIMESTAMP
)
A. Return emails that appear more than once, with their duplicate count.
email
,
dup_count
B. Return the full rows for users whose email is duplicated.
user_id
,
email
,
created_at
NULL
emails as non-duplicates unless specified otherwise.