This question evaluates data manipulation and aggregation skills in SQL and Pandas, including joins/unions across disjoint member tables, computation of specialty spend shares, identification of top age bands by claim counts with tie handling, year-based date filtering, and categorical recoding with monthly grouping.
You are given healthcare claims data split across member tables.
Assume the following schemas (types may be adapted to your SQL dialect):
mem1member_id
INT PRIMARY KEY
age_band
VARCHAR
mem2member_id
INT PRIMARY KEY
age_band
VARCHAR
claimclaim_id
INT PRIMARY KEY
member_id
INT
specialty
VARCHAR
paid_amt
DECIMAL(12,2)
claim_date
DATE
risk (provided but not necessarily needed)member_id
INT
risk_score
DECIMAL(10,4)
Assume claim.member_id joins to either mem1.member_id or mem2.member_id.
Assume claim_date is in UTC and “year = 2017” means calendar year 2017.
For each specialty, compute:
total_paid_amt
= total paid amount for that specialty
paid_share
= that specialty’s share of total paid amount across all claims
Output columns: specialty, total_paid_amt, paid_share.
Considering only claims in calendar year 2017:
mem1
and
mem2
into a single member set.
age_band
.
Output columns: age_band, claim_cnt.
You are given a Pandas DataFrame df_claims with columns:
paid_amt
(numeric)
gender
(values are
'M'
or
'F'
)
claim_timestamp
(string or datetime-like)
Tasks:
gender
from
M/F
to
male/female
.
paid_amt
for
each month in 2017
, grouped by
gender
and month.
Output: a DataFrame with columns like gender, month (1–12), total_paid_amt (and optionally year if you keep it).