This question evaluates familiarity with pandas data structures (Series vs DataFrame), SQL filtering semantics (WHERE vs HAVING), and practical SQL aggregation for detecting duplicate records, assessing competencies in data manipulation, aggregation, and basic data quality checks.
You are interviewing for a Data Engineer co-op/intern role. Answer the following short technical questions.
Python / pandas:
Series
and a pandas
DataFrame
? Give one practical example of when you would use each.
SQL concepts:
2. What is the difference between WHERE and HAVING in SQL, and when should each be used?
SQL query task: 3. You have a table:
customer_events
event_id
BIGINT
customer_email
STRING
source_system
STRING
created_at
TIMESTAMP
Assume created_at is stored in UTC. Write a SQL query to find duplicate customer_email values in the table. A duplicate means the same customer_email appears more than once in the full table. Return these output columns:
customer_email
duplicate_count
Only include emails that appear more than once.