Explain Pandas and SQL Basics
Company: Rbcroyalbank
Role: Data Engineer
Category: Data Manipulation (SQL/Python)
Difficulty: easy
Interview Round: Technical Screen
You are interviewing for a Data Engineer internship. Answer the following short data-manipulation questions:
1. In pandas, what is the difference between a `Series` and a `DataFrame`? Compare their dimensionality, indexing behavior, and common use cases.
2. In SQL, what is the difference between `WHERE` and `HAVING`? Explain when each filter is applied and whether aggregate expressions can be used.
3. Consider a table `transactions_raw` with the following schema:
- `ingest_id BIGINT`
- `account_id BIGINT`
- `transaction_id BIGINT`
- `amount DECIMAL(12,2)`
- `transaction_ts TIMESTAMP`
Assume `transaction_ts` is stored in UTC, and there are no foreign-key relationships relevant to this task. Define a duplicate as multiple rows with the same `(account_id, transaction_id, amount, transaction_ts)`. Write a SQL query that returns all duplicate groups with the output columns: `account_id, transaction_id, amount, transaction_ts, duplicate_count`.
Quick Answer: This question evaluates understanding of pandas and SQL fundamentals—specifically pandas Series vs DataFrame distinctions, SQL WHERE vs HAVING semantics, and practical duplicate-detection in transactional data—targeting data manipulation competency for Data Engineer roles.