This question evaluates the ability to define and compute shop-level visibility KPIs in SQL, testing skills in metric design (visibility_rate and impression_share), deduplication, session/window filtering, and data-quality concerns such as late-arriving events and bot filtering.

You own the 'shop visibility' KPI for a marketplace. Define a precise metric and write SQL to compute it over the last 7 days (use today = 2025-09-01, so the window is 2025-08-26 to 2025-09-01 inclusive). Provide two related definitions: (A) visibility_rate = unique US users who saw ≥1 active product from a shop ÷ active US users; (B) impression_share = impressions of a shop's active products ÷ total impressions of all active products. Exclude deactivated products, deduplicate multiple impressions of the same product by the same user within a session, and restrict the denominator to 'active users' (users with ≥1 session in the window). Return: shop_id, visibility_rate, impression_share, unique_viewers, active_users, window_start, window_end. Then: (1) show how you'd parameterize this per-surface (feed vs shoppage), (2) handle late-arriving impressions and bot traffic, and (3) outline indexes or clustering to make the query efficient at 1B+ rows.
Schema and small ASCII samples you must use: users(user_id INT, country STRING) +---------+---------+ | user_id | country | +---------+---------+ | 1 | US | | 2 | US | | 3 | CA | | 4 | US | +---------+---------+
shops(shop_id INT, name STRING) +---------+----------+ | shop_id | name | +---------+----------+ | 10 | Alpha | | 20 | Beta | +---------+----------+
products(product_id INT, shop_id INT, is_active BOOLEAN, created_at DATE) +------------+---------+-----------+------------+ | product_id | shop_id | is_active | created_at | +------------+---------+-----------+------------+ | 100 | 10 | true | 2025-08-01 | | 101 | 10 | true | 2025-08-20 | | 102 | 20 | true | 2025-07-15 | | 103 | 20 | false | 2025-06-10 | +------------+---------+-----------+------------+
impressions(impression_id BIGINT, ts TIMESTAMP, user_id INT, product_id INT, surface STRING) +---------------+---------------------+---------+------------+----------+ | impression_id | ts | user_id | product_id | surface | +---------------+---------------------+---------+------------+----------+ | 1 | 2025-08-26 10:00:00 | 1 | 100 | feed | | 2 | 2025-08-27 11:00:00 | 2 | 100 | feed | | 3 | 2025-08-27 12:00:00 | 2 | 101 | shoppage | | 4 | 2025-08-28 09:00:00 | 4 | 102 | feed | | 5 | 2025-09-01 13:00:00 | 1 | 102 | feed | +---------------+---------------------+---------+------------+----------+
sessions(session_id BIGINT, user_id INT, session_start TIMESTAMP) +------------+---------+---------------------+ | session_id | user_id | session_start | +------------+---------+---------------------+ | 500 | 1 | 2025-08-28 08:00:00 | | 501 | 2 | 2025-08-29 09:00:00 | | 502 | 3 | 2025-08-30 10:00:00 | | 503 | 4 | 2025-08-31 12:00:00 | +------------+---------+---------------------+
Assume a SQL dialect with CTEs and DATE/TIMESTAMP functions. Write the full query for (A) and (B) and briefly explain how you'd adapt it for incremental daily jobs and backfills.