How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Onsite rounds at Creditkarma.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Creditkarma during technical interviews.

Debug Sparse Multi-Task Ranking Models | Creditkarma Interview Question

Q: Debug Sparse Multi-Task Ranking Models

This question evaluates a candidate's ability to debug multi-task ranking models in production, focusing on training stability, extreme label sparsity, loss and optimization choices, and training–serving feature pipelines within the Machine Learning domain, and it tests both practical application and conceptual understanding.

You are a Machine Learning Engineer training a multi-task ranking model for a sparse recommendation funnel at a fintech product. A single model predicts several funnel outcomes per candidate item — click, application, conversion, and approval — and the scores are combined to rank items.

Two problems have surfaced:

Training instability. Training loss decreases for the first few hundred steps, then becomes unstable or turns into NaN .
Offline–online gap. Offline metrics (e.g. AUC, log loss) look strong, but the model underperforms in a live online experiment.

Walk through how you would debug both problems. Your answer should address: likely causes of NaN/unstable loss in a multi-task model; a systematic debugging procedure for instability that appears after several hundred steps; suitable losses and optimization approaches for sparse conversion prediction; how to investigate training–serving skew; where to look first when online feature distributions differ from training distributions; and which architecture components to inspect when chasing feature skew.

Constraints & Assumptions

Funnel sparsity is severe. Downstream labels are increasingly rare: clicks are common, but conversion/approval positive rates are on the order of $0.1\%$ – $0.5\%$ .
Delayed labels. Conversion and approval outcomes arrive days or weeks after the impression; some are still censored at training time.
Shared-bottom multi-task architecture with a shared feature/embedding trunk and one prediction head per task. High-cardinality categorical features (user id, item id, merchant) go through embedding tables.
Production serving uses an online feature store; training reads features from an offline store/warehouse. The two stores are populated by partly different pipelines.
Optimizer is adaptive (Adam/AdamW family); training may use mixed precision.

Clarifying Questions to Ask

What exactly does "unstable" mean — NaN , Inf , a sudden loss spike, or oscillation? Is it deterministic at the same step across reruns?
Is the offline–online gap measured on the same metric (e.g. business KPI vs. AUC), and is the online experiment correctly bucketed with enough power?
How are the per-task losses weighted today, and were any features, shards, or label pipelines changed shortly before the instability appeared?
What is the attribution/label-maturity window for conversion and approval, and how are still-censored examples labeled at training time?
Are offline and online features produced by the same transformation code, or two separate implementations? Is there a logged "serving features" dataset we can replay?

What a Strong Answer Covers

A localization-first mindset — reproducing the failure and isolating the first point of divergence with targeted instrumentation, rather than reciting a generic list of causes.
Breadth across the relevant failure modes for instability — numerics, optimization dynamics, data/inputs, and multi-task interaction — paired with a concrete bisection strategy to pin down which one fired.
Imbalance- and delay-aware loss/optimization design for an extreme positive rate, treating the choice of per-task loss and the handling of censored/delayed labels as distinct decisions.
A disciplined offline–online methodology that distinguishes the candidate explanations and proposes a single decisive test to cut between them.
A pipeline-level mental model of feature skew — the path a feature travels from event to served prediction — and an appreciation for the logging needed to reconstruct any single prediction after the fact.

Follow-up Questions

The instability disappears when you disable mixed precision but you need it for throughput — how do you keep mixed precision and stability?
Your offline replay shows a single high-importance feature is consistently a constant default at serving time but populated offline. What's your remediation, and how would you have caught it earlier?
Conversion labels mature 14 days after impression. How does this delay change both your loss/label construction and your offline evaluation protocol?
How would you set up continuous monitoring so the next training–serving skew is detected automatically before it ships?

Two problems have surfaced:

Training instability. Training loss decreases for the first few hundred steps, then becomes unstable or turns into NaN .
Offline–online gap. Offline metrics (e.g. AUC, log loss) look strong, but the model underperforms in a live online experiment.

Constraints & Assumptions

Funnel sparsity is severe. Downstream labels are increasingly rare: clicks are common, but conversion/approval positive rates are on the order of $0.1\%$ – $0.5\%$ .
Delayed labels. Conversion and approval outcomes arrive days or weeks after the impression; some are still censored at training time.
Shared-bottom multi-task architecture with a shared feature/embedding trunk and one prediction head per task. High-cardinality categorical features (user id, item id, merchant) go through embedding tables.
Production serving uses an online feature store; training reads features from an offline store/warehouse. The two stores are populated by partly different pipelines.
Optimizer is adaptive (Adam/AdamW family); training may use mixed precision.

Clarifying Questions to Ask

What exactly does "unstable" mean — NaN , Inf , a sudden loss spike, or oscillation? Is it deterministic at the same step across reruns?
Is the offline–online gap measured on the same metric (e.g. business KPI vs. AUC), and is the online experiment correctly bucketed with enough power?
How are the per-task losses weighted today, and were any features, shards, or label pipelines changed shortly before the instability appeared?
What is the attribution/label-maturity window for conversion and approval, and how are still-censored examples labeled at training time?
Are offline and online features produced by the same transformation code, or two separate implementations? Is there a logged "serving features" dataset we can replay?

What a Strong Answer Covers

A localization-first mindset — reproducing the failure and isolating the first point of divergence with targeted instrumentation, rather than reciting a generic list of causes.
Breadth across the relevant failure modes for instability — numerics, optimization dynamics, data/inputs, and multi-task interaction — paired with a concrete bisection strategy to pin down which one fired.
Imbalance- and delay-aware loss/optimization design for an extreme positive rate, treating the choice of per-task loss and the handling of censored/delayed labels as distinct decisions.
A disciplined offline–online methodology that distinguishes the candidate explanations and proposes a single decisive test to cut between them.
A pipeline-level mental model of feature skew — the path a feature travels from event to served prediction — and an appreciation for the logging needed to reconstruct any single prediction after the fact.

Follow-up Questions

The instability disappears when you disable mixed precision but you need it for throughput — how do you keep mixed precision and stability?
Your offline replay shows a single high-importance feature is consistently a constant default at serving time but populated offline. What's your remediation, and how would you have caught it earlier?
Conversion labels mature 14 days after impression. How does this delay change both your loss/label construction and your offline evaluation protocol?
How would you set up continuous monitoring so the next training–serving skew is detected automatically before it ships?

Debug Sparse Multi-Task Ranking Models

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Debug Sparse Multi-Task Ranking Models

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP