Compute churn metrics and rank top students
Company: Flatiron Health
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
You have two tasks.
Part A (R): You receive a Customers CSV with columns: id (INT), signup_date (YYYY-MM-DD), last_active_date (YYYY-MM-DD), churned (0/1/NULL), pay (DECIMAL/NULL). Clean and compute metrics under these rules: (1) Drop rows where id is NULL/blank; (2) If pay is NULL, set pay = 0; (3) If churned is NULL but pay > 0, set churned = 0 (treat as active); (4) Define active_days = DATEDIFF(last_active_date, signup_date); if last_active_date < signup_date, set active_days = 0; (5) Treat last_active_date as the churn date when churned = 1. Produce: (a) the average active_days across all remaining customers; and (b) the average pay among customers with churned = 1. Return both rounded to 2 decimals and explain briefly how each rule is enforced in your code. Use idiomatic tidyverse R.
Sample data (for format clarity):
Customers
+-----+-------------+------------------+---------+------+
| id | signup_date | last_active_date | churned | pay |
+-----+-------------+------------------+---------+------+
| 1 | 2025-01-10 | 2025-03-05 | 1 | 99 |
| 2 | 2025-02-01 | 2025-02-20 | NULL | 15 |
| 3 | 2025-01-15 | 2025-01-18 | 0 | NULL |
| NULL| 2025-02-10 | 2025-02-12 | 1 | 50 |
| 4 | 2025-02-11 | 2025-02-05 | 1 | 25 |
+-----+-------------+------------------+---------+------+
Part B (SQL, MySQL 8.0): Using the schema below, write ONE query that: (i) computes each student’s avg_score across all grades; (ii) excludes students with zero completed assignments; (iii) keeps only the top CEIL(N/2) students by avg_score where N is the number of students with at least one grade; break ties using student_id ASC; (iv) outputs student_id, name, avg_score (rounded to 1 decimal), and a dense_rank over the selected students ordered by avg_score DESC, student_id ASC; and (v) still works correctly as more students/grades are added.
Schema:
Teachers(id PK, name, classroom)
Students(id PK, name, primary_teacher_id FK -> Teachers.id)
Assignments(id PK, teacher_id FK -> Teachers.id)
Grades(student_id FK -> Students.id, assignment_id FK -> Assignments.id, grade DECIMAL(4,1), PK(student_id, assignment_id))
Sample tables:
Teachers
+----+------------+----------+
| id | name | classroom|
+----+------------+----------+
| 4 | Mr. Feeny | 301 |
| 8 | Mr. Cooper | 260 |
+----+------------+----------+
Students
+----+---------+--------------------+
| id | name | primary_teacher_id |
+----+---------+--------------------+
| 1 | Bobby | 4 |
| 2 | Susie | 8 |
| 3 | Deborah | 8 |
| 9 | Bruce | NULL |
+----+---------+--------------------+
Assignments
+----+------------+
| id | teacher_id |
+----+------------+
| 1 | 4 |
| 2 | 8 |
+----+------------+
Grades
+------------+---------------+-------+
| student_id | assignment_id | grade |
+------------+---------------+-------+
| 1 | 1 | 100.0 |
| 1 | 2 | 50.0 |
| 2 | 1 | 100.0 |
| 2 | 2 | 100.0 |
| 3 | 1 | 40.0 |
| 3 | 2 | 8.0 |
| 9 | 1 | 65.0 |
| 9 | 2 | 65.0 |
+------------+---------------+-------+
Quick Answer: This question evaluates proficiency in data cleaning, transformation, aggregation, and deterministic ranking across R (tidyverse) and SQL, focusing on handling missing values, date arithmetic, aggregation and rounding of metrics, and top-N selection with tie-breaking.