PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Data Manipulation (SQL/Python)/OpenAI

Train and analyze a classifier

Last updated: May 14, 2026

Quick Overview

This question evaluates proficiency in end-to-end machine learning engineering, including exploratory data analysis, time-aware train/validation/test splitting, model baseline and improvement, class imbalance strategies, leakage-safe hyperparameter tuning, metric computation and calibration, error analysis, reproducible training pipelines with CLI/config/seed control, explainability (feature importance/SHAP and ablations), and documentation of risks, fairness checks and monitoring hooks. It is commonly asked to assess practical ability to manage the full ML lifecycle and data hygiene in production-like scenarios, testing applied Data Manipulation (SQL/Python) and Machine Learning competencies with an emphasis on practical application while also requiring conceptual understanding of evaluation, calibration and fairness.

  • Medium
  • OpenAI
  • Data Manipulation (SQL/Python)
  • Machine Learning Engineer

Train and analyze a classifier

Company: OpenAI

Role: Machine Learning Engineer

Category: Data Manipulation (SQL/Python)

Difficulty: Medium

Interview Round: Technical Screen

Given a labeled dataset for binary classification, implement an end-to-end Python solution to train and analyze a classifier. Tasks: ( 1) perform EDA (missingness, outliers, leakage checks, target/feature drift over time), ( 2) create time-aware, stratified train/validation/test splits with proper cross-validation, ( 3) build a strong baseline and at least one improved model, ( 4) handle class imbalance (cost-sensitive loss, resampling, thresholds), ( 5) tune hyperparameters without leakage, ( 6) compute and compare metrics (ROC-AUC, PR-AUC, F1, calibration/Brier, confusion matrix at chosen threshold), ( 7) conduct error analysis by slice and feature, ( 8) produce a reproducible training script with CLI, config, and seed control, ( 9) explain feature importance/SHAP and validate with ablations, and ( 10) document risks, fairness checks, and monitoring hooks for production. Provide code snippets and explain your design choices.

Quick Answer: This question evaluates proficiency in end-to-end machine learning engineering, including exploratory data analysis, time-aware train/validation/test splitting, model baseline and improvement, class imbalance strategies, leakage-safe hyperparameter tuning, metric computation and calibration, error analysis, reproducible training pipelines with CLI/config/seed control, explainability (feature importance/SHAP and ablations), and documentation of risks, fairness checks and monitoring hooks. It is commonly asked to assess practical ability to manage the full ML lifecycle and data hygiene in production-like scenarios, testing applied Data Manipulation (SQL/Python) and Machine Learning competencies with an emphasis on practical application while also requiring conceptual understanding of evaluation, calibration and fairness.

Solution

### How to approach this in an interview This is an open-ended ML coding task disguised as a "train a classifier" exercise. The interviewer is not grading whether `roc_auc_score` is high — they are grading whether you build a **leakage-free, reproducible, production-aware pipeline** and can defend every design choice. The trap is rushing to `.fit()`. The signal is the order of operations: EDA → splits → leakage-safe preprocessing **inside CV** → imbalance/threshold handling → honest evaluation → error analysis → reproducibility → explainability → production risks. Two principles thread through all 10 tasks: 1. **The test set is touched exactly once**, at the very end. Validation/CV drives all decisions; the held-out test set only confirms the final number. 2. **Every transform that learns from data (imputation, scaling, encoding, resampling, calibration, threshold selection) is fit on training folds only** and applied to validation/test. This is the single most common way candidates leak — and the most common way real production models silently overstate their accuracy. I'll build this around scikit-learn `Pipeline` + a leak-safe CV, with a gradient-boosted tree as the workhorse (handles mixed types, missing values, nonlinearities, and needs little preprocessing — a good default for tabular binary classification). --- ### 1) EDA — missingness, outliers, leakage, drift EDA is run **only on the training split** for any decision that feeds the model; looking at the full dataset (including test) to decide features is itself leakage. I do a quick read on the whole frame for shape/dtypes, then split, then do "deep" EDA on train. ```python import pandas as pd, numpy as np def basic_profile(df: pd.DataFrame, target: str, time_col: str | None): print("shape:", df.shape) print("\ndtypes:\n", df.dtypes.value_counts()) print("\ntarget balance:\n", df[target].value_counts(normalize=True)) # Missingness per column miss = df.isna().mean().sort_values(ascending=False) print("\ntop missing:\n", miss[miss > 0].head(20)) # Numeric outlier sketch (robust z via IQR), reported not auto-removed num = df.select_dtypes("number").drop(columns=[target], errors="ignore") q1, q3 = num.quantile(.25), num.quantile(.75) iqr = (q3 - q1).replace(0, np.nan) out_rate = ((num < q1 - 1.5*iqr) | (num > q3 + 1.5*iqr)).mean() print("\noutlier rate (IQR):\n", out_rate.sort_values(ascending=False).head(10)) return miss ``` **Missingness.** I distinguish *missing-completely-at-random* from *informative missingness*. If `is_na(x)` correlates with the target, I add an explicit `x_was_missing` indicator rather than silently imputing the signal away. Imputation itself happens **inside the pipeline** (next sections), never as a one-shot fill on the whole frame. **Outliers.** I report them, I don't reflexively drop them. For tree models, monotone transforms and outliers barely matter, so I usually keep them. I only winsorize/clip for linear/calibration-sensitive models, and the clip bounds are learned on train folds. **Leakage checks** — the highest-value part of EDA: - **Target leakage:** any feature with a suspiciously high univariate AUC (e.g. a single column giving AUC ≈ 0.99) is a red flag. I check `roc_auc_score(y, x)` per feature and manually inspect the top ones — they're often post-outcome fields (e.g. "refund_amount" predicting "churned"). - **Time leakage:** features computed using information from the future relative to the prediction timestamp (aggregates that include the label window). - **Train/test contamination:** duplicate rows or the same entity (user/account) appearing in both splits. I dedupe and, if there's a group key, split by group. - **ID-as-feature:** dropping row IDs, hashes, and monotonically increasing keys that encode collection order. ```python from sklearn.metrics import roc_auc_score def univariate_auc(df, target): y = df[target] scores = {} for c in df.columns.drop(target): x = df[c] if x.nunique() < 2: continue if not np.issubdtype(x.dtype, np.number): x = x.astype("category").cat.codes m = x.notna() try: scores[c] = roc_auc_score(y[m], x[m]) except Exception: pass s = pd.Series(scores).map(lambda a: max(a, 1-a)) # direction-agnostic return s.sort_values(ascending=False) # inspect the top — likely leakers ``` **Drift over time.** If there's a timestamp, I check both *covariate drift* (feature distributions shifting across time buckets) and *label/prior drift* (base rate changing). A practical test: train a "domain classifier" to distinguish early vs. late periods — if it achieves high AUC, the data is non-stationary and a time-aware split is mandatory (not optional). I also plot rolling target mean by time bucket. --- ### 2) Splits — time-aware, stratified, leakage-free CV The right split depends on how the model is deployed. **If predictions are made on future data (almost always true in production), the test set must be the latest time period** — a random split overstates performance because the model sees the future during training. ```python def time_split(df, time_col, test_frac=0.15, val_frac=0.15): df = df.sort_values(time_col).reset_index(drop=True) n = len(df) tr_end = int(n * (1 - test_frac - val_frac)) va_end = int(n * (1 - test_frac)) return df.iloc[:tr_end], df.iloc[tr_end:va_end], df.iloc[va_end:] ``` - **Time-aware:** sort by timestamp; train = oldest, validation = middle, test = newest. No future row ever lands in train. - **Cross-validation:** use `TimeSeriesSplit` (expanding window) so each fold trains on the past and validates on the future. `StratifiedKFold` defaults to `shuffle=False`, so it doesn't reorder rows — but it still leaks time, because it interleaves early and late timestamps across folds, so a fold's training data can contain rows from the future relative to its validation rows. It's only acceptable when rows are genuinely i.i.d. and there's no temporal structure. - **Grouping:** if an entity recurs (same user across rows), wrap in `GroupKFold` / `StratifiedGroupKFold` so no entity straddles train and validation. Mixing time + group constraints, I order by time and assign whole groups to contiguous time blocks. - **Stratification:** with class imbalance, stratify on the target so rare-class proportion is stable across folds — but stratification must not override the time ordering; for temporal data I rely on the time order and accept slightly uneven fold priors, monitoring them. > Decision rule I'd state aloud: *"Is the deployment prediction made on future data?"* If yes → time split + `TimeSeriesSplit`. If the data is a static i.i.d. snapshot → stratified random k-fold. I'd ask the interviewer which regime we're in rather than assume. --- ### 3) Baseline + improved model A baseline exists to make "good" meaningful. I always report a trivial baseline first. - **Trivial baseline:** majority-class predictor and a marginal-rate predictor (predict the base rate for everyone). On imbalanced data, accuracy of the majority predictor can be 99% — which is exactly why accuracy is the wrong metric and PR-AUC matters. - **Strong linear baseline:** `LogisticRegression` with `class_weight="balanced"`, in a pipeline with imputation + scaling + one-hot. Interpretable, fast, and a genuinely competitive baseline on many tabular problems. - **Improved model:** gradient-boosted trees (`HistGradientBoostingClassifier`, or XGBoost/LightGBM if available). Handles missing values natively, captures interactions, robust to monotone transforms and outliers, and typically the top performer on tabular data. HGB has native categorical support, but it needs categoricals as **integer codes or pandas `category` dtype, not raw strings** — handing it a string column raises `ValueError: could not convert string to float`. So I wrap it in a thin preprocessor that ordinal-encodes the categoricals (numerics pass through untouched, NaNs preserved) and pass `categorical_features=cat_cols` so HGB still splits on them natively. The whole thing is a `Pipeline`, so it drops straight into `RandomizedSearchCV` and refits the encoder on each training fold. ```python from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer, make_column_selector from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder from sklearn.linear_model import LogisticRegression from sklearn.ensemble import HistGradientBoostingClassifier def make_preprocessor(num_cols, cat_cols): num = Pipeline([("imp", SimpleImputer(strategy="median", add_indicator=True)), ("sc", StandardScaler())]) cat = Pipeline([("imp", SimpleImputer(strategy="most_frequent")), ("oh", OneHotEncoder(handle_unknown="ignore", min_frequency=10))]) return ColumnTransformer([("num", num, num_cols), ("cat", cat, cat_cols)]) def baseline(num_cols, cat_cols): return Pipeline([("prep", make_preprocessor(num_cols, cat_cols)), ("clf", LogisticRegression(max_iter=1000, class_weight="balanced"))]) def gbm(num_cols, cat_cols): # HGB handles NaNs natively. Strings must be encoded to integer codes first; # then categorical_features tells HGB to split on them as true categories. # remainder="passthrough" keeps numerics (incl. NaN) untouched. enc = ColumnTransformer( [("cat", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1, encoded_missing_value=-1), cat_cols)], remainder="passthrough") # After the transform, the cat_cols occupy the first len(cat_cols) positions. cat_idx = list(range(len(cat_cols))) clf = HistGradientBoostingClassifier( learning_rate=0.05, max_iter=600, early_stopping=True, validation_fraction=0.1, l2_regularization=1.0, categorical_features=cat_idx, class_weight="balanced", random_state=42) return Pipeline([("enc", enc), ("clf", clf)]) ``` The whole point of the `Pipeline`/`ColumnTransformer` wrapper: when this object goes into cross-validation, **imputation statistics, scaling means, one-hot vocabularies, and the ordinal-encoder's category map are refit on each training fold**, so validation scores are leakage-free by construction. --- ### 4) Class imbalance — cost, resampling, thresholds Three independent levers; I prefer the first and third and treat resampling as a last resort. 1. **Cost-sensitive learning (preferred).** `class_weight="balanced"` (or an explicit `scale_pos_weight = n_neg/n_pos` for XGBoost) reweights the loss so the rare class isn't ignored. No data duplication, no distribution distortion. 2. **Resampling.** SMOTE/random over-sampling or under-sampling. **Critical leakage trap: resample inside CV, on training folds only — never before splitting.** Use `imblearn.pipeline.Pipeline` so SMOTE sits in the pipeline and is skipped at transform time on validation/test. Resampling distorts the predicted-probability scale, so it must be paired with calibration (section 6). 3. **Threshold tuning (almost always needed).** The default 0.5 threshold is arbitrary and usually wrong for imbalanced data. I choose the operating threshold from the **validation** set against the business cost — maximize F-beta, or hit a target precision/recall, or minimize expected cost given a cost matrix. The threshold is a *decision*, separate from the *model*. ```python from imblearn.pipeline import Pipeline as ImbPipeline from imblearn.over_sampling import SMOTE # only if resampling is justified def pick_threshold(y_true, proba, cost_fp=1.0, cost_fn=5.0): """Choose threshold minimizing expected cost on VALIDATION data.""" ts = np.linspace(0.01, 0.99, 99) best_t, best_cost = 0.5, np.inf for t in ts: pred = (proba >= t).astype(int) fp = ((pred == 1) & (y_true == 0)).sum() fn = ((pred == 0) & (y_true == 1)).sum() cost = cost_fp*fp + cost_fn*fn if cost < best_cost: best_cost, best_t = cost, t return best_t ``` I'd state the asymmetry explicitly: in fraud/medical/churn, a false negative usually costs far more than a false positive, so `cost_fn > cost_fp` and the threshold drops well below 0.5. --- ### 5) Hyperparameter tuning without leakage Tuning is where leakage creeps back in even after you've split correctly, because the search itself can overfit the validation data. - **Search inside CV.** `RandomizedSearchCV`/`GridSearchCV` (or Optuna) wrap the **entire pipeline** and use the **time-aware CV** from section 2. Because the pipeline includes preprocessing and any resampling, each candidate is scored leakage-free. - **Score on the right metric.** I tune for `average_precision` (PR-AUC) or `roc_auc` on imbalanced data, not accuracy. - **Nested CV when feasible.** To get an unbiased estimate of the *tuned* model's performance, use nested CV: an inner loop selects hyperparameters, an outer loop estimates generalization. If that's too expensive, the held-out **test set** (touched once) plays the outer role. - **Don't tune the threshold inside the AUC search.** AUC/PR-AUC are threshold-free; the threshold is fixed afterward on validation (section 4). Because `gbm(...)` is a `Pipeline` with the classifier under the `"clf"` step, the search keys are prefixed `clf__`: ```python from sklearn.model_selection import RandomizedSearchCV, TimeSeriesSplit param_dist = { "clf__learning_rate": [0.02, 0.05, 0.1], "clf__max_iter": [300, 600, 1000], "clf__max_leaf_nodes": [15, 31, 63], "clf__l2_regularization": [0.0, 1.0, 10.0], "clf__min_samples_leaf": [20, 50, 100], } search = RandomizedSearchCV( gbm(num_cols, cat_cols), param_distributions=param_dist, n_iter=30, scoring="average_precision", cv=TimeSeriesSplit(n_splits=5), random_state=42, n_jobs=-1, refit=True) ``` --- ### 6) Metrics — ROC-AUC, PR-AUC, F1, calibration/Brier, confusion matrix I report a **suite**, not one number, because each answers a different question. All numbers below come from the model trained on train(+val) and scored **once** on the held-out test set. | Metric | What it tells me | When it matters | |---|---|---| | **ROC-AUC** | Ranking quality across all thresholds | General discrimination; insensitive to base rate (can look good on imbalanced data) | | **PR-AUC / Average Precision** | Ranking quality focused on the positive class | **Primary metric under imbalance** — far more honest than ROC-AUC when positives are rare | | **F1 / F-beta @ chosen threshold** | Precision/recall balance at the operating point | Once a threshold is set; F-beta weights recall higher when FNs are costly | | **Brier score** | Squared error of predicted probabilities | When probabilities are used downstream (expected value, ranking by risk) | | **Calibration curve** | Do predicted probs match observed frequencies? | Resampling/`class_weight` distort probabilities; recalibrate if so | | **Confusion matrix @ threshold** | Concrete FP/FN counts at the deployed cutoff | Translating the model into business cost | ```python from sklearn.metrics import (roc_auc_score, average_precision_score, f1_score, brier_score_loss, confusion_matrix) from sklearn.calibration import CalibratedClassifierCV, calibration_curve def evaluate(model, X, y, threshold): p = model.predict_proba(X)[:, 1] pred = (p >= threshold).astype(int) return { "roc_auc": roc_auc_score(y, p), "pr_auc": average_precision_score(y, p), "f1": f1_score(y, pred), "brier": brier_score_loss(y, p), "cm": confusion_matrix(y, pred).tolist(), } ``` **Calibration.** If I used resampling or aggressive class weighting, the raw scores are not probabilities. I wrap the fitted model in `CalibratedClassifierCV` (isotonic for enough data, Platt/sigmoid for small data), **fit the calibrator on a held-out slice of training data, not on test**, and re-check the Brier score and reliability curve. I'd note the trade-off: `class_weight="balanced"` improves recall but worsens calibration, so when probabilities matter I either calibrate afterward or prefer threshold tuning over reweighting. --- ### 7) Error analysis by slice and feature Aggregate metrics hide where the model fails. I slice the **test** predictions and look for systematic patterns. - **By categorical slice:** compute PR-AUC / recall / FN-rate per segment (region, device, plan tier, time bucket). A model with 0.85 global AUC but 0.55 on the newest cohort is drifting. - **By feature range:** bin a numeric feature and check error rate per bin — reveals nonlinearity or threshold cliffs. - **By confidence:** are high-confidence errors clustered? Inspect the most confident false positives/negatives individually — these are often label errors or genuine edge cases worth a feature. - **Residual vs. feature:** plot error against each feature to find regions the model under-serves. ```python def slice_report(df, y_true, proba, threshold, by): pred = (proba >= threshold).astype(int) g = pd.DataFrame({"y": y_true, "pred": pred, "p": proba, "by": df[by]}) return g.groupby("by").apply(lambda d: pd.Series({ "n": len(d), "base_rate": d.y.mean(), "recall": d.pred[d.y == 1].mean() if (d.y == 1).any() else np.nan, "fp_rate": d.pred[d.y == 0].mean() if (d.y == 0).any() else np.nan, "pr_auc": average_precision_score(d.y, d.p) if d.y.nunique() > 1 else np.nan, })).sort_values("n", ascending=False) ``` I tie findings back to action: a weak slice means a missing feature, a sub-model, slice-specific thresholds, or more data for that segment. --- ### 8) Reproducible training script — CLI, config, seeds Reproducibility is a first-class deliverable here, not boilerplate. The script must produce byte-identical results given the same config and seed. Every helper it calls (`time_split`, `build_and_search`, `pick_threshold`, `evaluate`) is defined in this answer, so the script runs end-to-end. ```python # train.py import argparse, json, random, hashlib, numpy as np, pandas as pd, joblib, yaml def set_seeds(seed: int): random.seed(seed); np.random.seed(seed) # also set framework seeds (e.g. torch.manual_seed) if used def build_and_search(tr, cfg, seed): """Fit the tuned GBM pipeline via the §5 leakage-free RandomizedSearchCV.""" est = gbm(cfg["num_cols"], cfg["cat_cols"]) # §3 search = RandomizedSearchCV( est, param_distributions=cfg["param_dist"], n_iter=cfg.get("n_iter", 30), scoring="average_precision", cv=TimeSeriesSplit(n_splits=cfg.get("n_splits", 5)), random_state=seed, n_jobs=-1, refit=True) search.fit(tr[cfg["features"]], tr[cfg["target"]]) return search.best_estimator_ def main(): ap = argparse.ArgumentParser() ap.add_argument("--config", required=True) # YAML: paths, cols, model, search space ap.add_argument("--data", required=True) ap.add_argument("--out", default="artifacts/") ap.add_argument("--seed", type=int, default=42) args = ap.parse_args() cfg = yaml.safe_load(open(args.config)) set_seeds(args.seed) df = pd.read_csv(args.data) tr, va, te = time_split(df, cfg["time_col"]) model = build_and_search(tr, cfg, seed=args.seed) # CV/tuning from §2/§5 thr = pick_threshold(va[cfg["target"]], model.predict_proba(va[cfg["features"]])[:, 1], **cfg["costs"]) metrics = evaluate(model, te[cfg["features"]], te[cfg["target"]], thr) # test: once # Persist everything needed to reproduce + audit joblib.dump({"model": model, "threshold": thr, "features": cfg["features"]}, f"{args.out}/model.joblib") json.dump({"metrics": metrics, "threshold": thr, "seed": args.seed, "config": cfg, "data_md5": hashlib.md5(open(args.data,'rb').read()).hexdigest()}, open(f"{args.out}/run.json", "w"), indent=2) if __name__ == "__main__": main() ``` Reproducibility checklist I'd mention: - **Single seed** threaded into NumPy, the model, the CV splitter, and any sampler. - **Config-as-code (YAML)** — no magic numbers in the body; the config is logged with the run. (`num_cols`/`cat_cols`/`features`/`param_dist`/`costs` all live there.) - **Pinned environment** (`requirements.txt`/lockfile, recorded library versions). - **Data hash** stored so the run is tied to an exact dataset snapshot. - **Versioned artifacts** — model, threshold, feature list, and metrics saved together; the threshold travels *with* the model (a model without its threshold is unusable). - Deterministic flags for GPU frameworks if applicable. --- ### 9) Feature importance / SHAP + ablation validation I never trust a single importance method — they disagree and several are biased. - **Permutation importance** on the **validation/test** set: shuffle one feature, measure metric drop. Model-agnostic and measures impact on generalization (unlike tree `feature_importances_`, which is biased toward high-cardinality features and computed on train). - **SHAP** for both global ranking (mean |SHAP|) and local explanations of individual predictions. `TreeExplainer` is exact and fast for GBMs. SHAP also surfaces interaction effects and direction, which permutation importance doesn't. - **Ablation validation** — the part that makes importance *credible*: drop the top-k "important" features, retrain, and confirm the metric actually falls. If removing a "top" feature doesn't move the score, the importance was an artifact (often correlated-feature splitting). Conversely, add a known-irrelevant random column as a control — anything ranked below it is noise. ```python from sklearn.inspection import permutation_importance import shap # permutation_importance works on the whole pipeline (encoder + classifier), # so feature names stay the raw column names. pi = permutation_importance(model, X_val, y_val, scoring="average_precision", n_repeats=20, random_state=42) # SHAP runs on the fitted HGB; transform X through the pipeline's encoder first. enc = model.named_steps["enc"] clf = model.named_steps["clf"] explainer = shap.TreeExplainer(clf) sv = explainer.shap_values(enc.transform(X_val)) # global: mean(|sv|); local: per-row # Ablation: retrain without top-k and confirm the metric drops def ablate(top_k_feats, train_fn, X_tr, y_tr, X_te, y_te): full = train_fn(X_tr, y_tr) red = train_fn(X_tr.drop(columns=top_k_feats), y_tr) return (average_precision_score(y_te, full.predict_proba(X_te)[:, 1]), average_precision_score(y_te, red.predict_proba(X_te.drop(columns=top_k_feats))[:, 1])) ``` I'd also flag: a feature that's hugely important *and* nearly perfect predictor is a leakage suspect — circle back to section 1. --- ### 10) Risks, fairness, monitoring for production A model is a liability until it's monitored. I'd close by enumerating what breaks in production and how I'd catch it. **Risks.** - **Train/serve skew:** the same preprocessing must run at train and serve time — shipping the whole `Pipeline` (not just the estimator) prevents the classic "scaler refit at serve" bug. - **Drift:** covariate and label drift degrade the model silently. The time-split evaluation is a preview; production needs ongoing checks. - **Stale thresholds:** if the base rate shifts, the fixed threshold's precision/recall move — revisit periodically. - **Feedback loops:** if the model's decisions influence future labels (blocking users you'd never see convert), the training distribution self-corrupts. **Fairness.** - Compute slice metrics across sensitive groups (where legally/ethically appropriate): equalized odds (TPR/FPR parity), demographic parity, calibration-within-group. Report disparities; they rarely all hold simultaneously, so I'd name which fairness criterion the use case demands. - Audit proxies — a feature correlated with a protected attribute can encode it even if the attribute is dropped. Use the same univariate/SHAP tooling to detect proxies. - Mitigation options: reweighting, slice-specific thresholds, or constrained optimization — chosen with stakeholders, not unilaterally. **Monitoring hooks.** - **Input monitoring:** per-feature distribution stats and missingness rates vs. a training baseline (PSI/KL); alert on drift. - **Prediction monitoring:** score distribution and positive-rate over time. - **Performance monitoring:** once labels arrive, rolling PR-AUC/recall/calibration by slice; alert on regression. - **Operational:** latency, error rate, fallback to base-rate prediction on failure. - **Retraining trigger:** scheduled + drift/performance-triggered, with a champion/challenger (shadow) evaluation before promotion and the ability to roll back to the previous versioned artifact. --- ### What ties it together If I had to compress the whole answer to one sentence for the interviewer: **split by time, fit every transform inside CV on training folds only, choose PR-AUC plus a cost-driven threshold over accuracy, validate explanations with ablations, and ship the pipeline plus its threshold behind drift and fairness monitoring.** The code is secondary; the discipline of *what you fit on what data, and what you measure* is the actual deliverable.

Related Interview Questions

  • Write SQL for repeat churn - OpenAI (hard)
  • Handle repeated churn in SQL - OpenAI (hard)
  • Compute churn with re-subscriptions - OpenAI (hard)
  • Debug and harden trial-assignment Python code - OpenAI (Medium)
  • Write SQL for post-trial conversion cohorts - OpenAI (Medium)
OpenAI logo
OpenAI
Jul 31, 2025, 12:00 AM
Machine Learning Engineer
Technical Screen
Data Manipulation (SQL/Python)
19
0

Given a labeled dataset for binary classification, implement an end-to-end Python solution to train and analyze a classifier. Tasks: (

  1. perform EDA (missingness, outliers, leakage checks, target/feature drift over time), (
  2. create time-aware, stratified train/validation/test splits with proper cross-validation, (
  3. build a strong baseline and at least one improved model, (
  4. handle class imbalance (cost-sensitive loss, resampling, thresholds), (
  5. tune hyperparameters without leakage, (
  6. compute and compare metrics (ROC-AUC, PR-AUC, F1, calibration/Brier, confusion matrix at chosen threshold), (
  7. conduct error analysis by slice and feature, (
  8. produce a reproducible training script with CLI, config, and seed control, (
  9. explain feature importance/SHAP and validate with ablations, and (
  10. document risks, fairness checks, and monitoring hooks for production. Provide code snippets and explain your design choices.

Solution

Show

Submit Your Answer

Sign in to leave a comment

Loading comments...

Browse More Questions

More Data Manipulation (SQL/Python)•More OpenAI•More Machine Learning Engineer•OpenAI Machine Learning Engineer•OpenAI Data Manipulation (SQL/Python)•Machine Learning Engineer Data Manipulation (SQL/Python)
PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.