Audit and onboard unfamiliar datasets safely
Company: Expedia
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Onsite
You inherit unfamiliar hotel search data with sparse documentation. Provide a concrete, ordered checklist to: a) discover tables/columns and verify primary/foreign keys; b) profile distributions, missingness, and timezones/locale/currency issues; c) detect duplicates and many-to-many join inflation across searches, impressions, clicks, bookings, and cancellations; d) validate event sequencing (search → impression → click → booking → cancellation) with watermarking and late-arrival windows; e) compute metrics correctly (e.g., bookings per search within 7 days; margin = price − cost; GMV vs. contribution margin); f) write data quality tests/data contracts (not-null, uniqueness, referential integrity, numeric ranges). Include at least three pitfalls specific to travel data (e.g., multi-room bookings, partial cancellations/modifications, rebookings, cross-currency FX at booking vs. stay date, children vs. adults counts) and how you’d detect each.
Quick Answer: This question evaluates a data scientist's competency in dataset discovery, schema and key verification, profiling for distributions/missingness/timezones/currencies, duplicate and join-inflation detection, event-sequence validation, accurate metric computation, and the design of data-quality tests/data contracts within SQL/Python workflows.