Design student–course data models and SQL
Company: Amazon
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Onsite
Scenario: Model a university domain with Students, Courses, Departments, Instructors, and Enrollments.
Tasks:
1) OLTP ERD: Specify normalized tables with PRIMARY/FOREIGN keys and constraints (e.g., a student cannot enroll in two sections of the same course in the same term; grade in {A,B,C,D,F,Pass,Fail,Incomplete}). Include indexes you would create and why.
2) Warehouse: Design a star schema for analytics (FactEnrollment with measures like credits_attempted, credits_earned, points; DimStudent, DimCourse, DimDepartment, DimTerm, DimInstructor). Explain your grain and surrogate keys. Model SCD Type 2 on DimStudent for major changes; show columns (effective_start, effective_end, is_current) and how you would join for point-in-time correctness.
3) Partitioning: Propose partitioning/clustering for FactEnrollment at 100M+ rows. Justify by common query predicates and maintenance.
4) Write SQL:
a) List students who took both 'CS101' and 'MATH201' in the same term.
b) For the most recent completed term, return the top 3 departments by average GPA (weighted by credits). Break ties deterministically.
5) Briefly compare OLTP vs OLAP design trade-offs in this scenario and when you would prefer snowflake over star.
Quick Answer: This question evaluates data modeling, database design, and SQL proficiency—covering normalization, schema constraints, indexing, star-schema warehousing, slowly changing dimensions, partitioning strategies, and complex analytic queries—in the domain of relational databases and data warehousing (Data Manipulation/SQL).