Compute churn metrics and rank top students

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Technical Screen rounds at Flatiron Health.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Flatiron Health during technical interviews.

Question

You have two tasks.

Part A (R): You receive a Customers CSV with columns: id (INT), signup_date (YYYY-MM-DD), last_active_date (YYYY-MM-DD), churned (0/1/NULL), pay (DECIMAL/NULL). Clean and compute metrics under these rules: (1) Drop rows where id is NULL/blank; (2) If pay is NULL, set pay = 0; (3) If churned is NULL but pay > 0, set churned = 0 (treat as active); (4) Define active_days = DATEDIFF(last_active_date, signup_date); if last_active_date < signup_date, set active_days = 0; (5) Treat last_active_date as the churn date when churned = 1. Produce: (a) the average active_days across all remaining customers; and (b) the average pay among customers with churned = 1. Return both rounded to 2 decimals and explain briefly how each rule is enforced in your code. Use idiomatic tidyverse R.
Sample data (for format clarity):
Customers
+-----+-------------+------------------+---------+------+
| id | signup_date | last_active_date | churned | pay |
+-----+-------------+------------------+---------+------+
| 1 | 2025-01-10 | 2025-03-05 | 1 | 99 |
| 2 | 2025-02-01 | 2025-02-20 | NULL | 15 |
| 3 | 2025-01-15 | 2025-01-18 | 0 | NULL |
| NULL| 2025-02-10 | 2025-02-12 | 1 | 50 |
| 4 | 2025-02-11 | 2025-02-05 | 1 | 25 |
+-----+-------------+------------------+---------+------+

Part B (SQL, MySQL 8.0): Using the schema below, write ONE query that: (i) computes each student’s avg_score across all grades; (ii) excludes students with zero completed assignments; (iii) keeps only the top CEIL(N/2) students by avg_score where N is the number of students with at least one grade; break ties using student_id ASC; (iv) outputs student_id, name, avg_score (rounded to 1 decimal), and a dense_rank over the selected students ordered by avg_score DESC, student_id ASC; and (v) still works correctly as more students/grades are added.
Schema:
Teachers(id PK, name, classroom)
Students(id PK, name, primary_teacher_id FK -> Teachers.id)
Assignments(id PK, teacher_id FK -> Teachers.id)
Grades(student_id FK -> Students.id, assignment_id FK -> Assignments.id, grade DECIMAL(4,1), PK(student_id, assignment_id))
Sample tables:
Teachers
+----+------------+----------+
| id | name | classroom|
+----+------------+----------+
| 4 | Mr. Feeny | 301 |
| 8 | Mr. Cooper | 260 |
+----+------------+----------+
Students
+----+---------+--------------------+
| id | name | primary_teacher_id |
+----+---------+--------------------+
| 1 | Bobby | 4 |
| 2 | Susie | 8 |
| 3 | Deborah | 8 |
| 9 | Bruce | NULL |
+----+---------+--------------------+
Assignments
+----+------------+
| id | teacher_id |
+----+------------+
| 1 | 4 |
| 2 | 8 |
+----+------------+
Grades
+------------+---------------+-------+
| student_id | assignment_id | grade |
+------------+---------------+-------+
| 1 | 1 | 100.0 |
| 1 | 2 | 50.0 |
| 2 | 1 | 100.0 |
| 2 | 2 | 100.0 |
| 3 | 1 | 40.0 |
| 3 | 2 | 8.0 |
| 9 | 1 | 65.0 |
| 9 | 2 | 65.0 |
+------------+---------------+-------+

PracHub · Accepted Answer

This question evaluates proficiency in data cleaning, transformation, aggregation, and deterministic ranking across R (tidyverse) and SQL, focusing on handling missing values, date arithmetic, aggregation and rounding of metrics, and top-N selection with tie-breaking.

Quick Overview

Quick Overview