Debug Sparse Multi-Task Ranking Models
Company: Creditkarma
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Onsite
You are a Machine Learning Engineer training a **multi-task ranking model** for a sparse recommendation funnel at a fintech product. A single model predicts several funnel outcomes per candidate item — **click**, **application**, **conversion**, and **approval** — and the scores are combined to rank items.
Two problems have surfaced:
1. **Training instability.** Training loss decreases for the first few hundred steps, then becomes unstable or turns into `NaN`.
2. **Offline–online gap.** Offline metrics (e.g. AUC, log loss) look strong, but the model underperforms in a live online experiment.
Walk through how you would debug both problems. Your answer should address: likely causes of `NaN`/unstable loss in a multi-task model; a systematic debugging procedure for instability that appears after several hundred steps; suitable losses and optimization approaches for sparse conversion prediction; how to investigate training–serving skew; where to look first when online feature distributions differ from training distributions; and which architecture components to inspect when chasing feature skew.
```hint The shape of the failure is a clue
A config bug that's wrong from the start usually blows up immediately. A failure that waits a few hundred steps must be triggered by *something that only becomes true later*. What kinds of things change as training progresses — and what would you have to capture at the failure step to tell which one fired?
```
```hint Localize before you theorize
There's a long menu of plausible causes here. Before reciting it, ask what single piece of evidence would let you throw most of the menu away. What's the cheapest experiment that separates a model/loss bug from a data bug?
```
```hint Severe imbalance reshapes the loss question
At a $0.1\%$ positive rate, think about what a naive loss does to the few positives — and, separately, what it even *means* to call an example "negative" when its conversion label may simply not have arrived yet. Are those two issues the same problem?
```
```hint Strong offline, weak online is rarely the model
If the model trains well and scores well offline, the discrepancy probably lives between the two environments rather than inside the network. List the ways "offline" and "online" can see different inputs or be judged by different yardsticks — then ask which one test could distinguish "the model sees different data online" from "the data's fine but my offline measurement was optimistic."
```
```hint Skew is plumbing, not math
If a feature is computed differently in the two worlds, the bug isn't in the loss — it's somewhere along the path a value travels before it reaches the model. Picture that path end to end and ask, at each handoff, what could make the same logical feature come out differently.
```
### Constraints & Assumptions
- **Funnel sparsity is severe.** Downstream labels are increasingly rare: clicks are common, but conversion/approval positive rates are on the order of $0.1\%$–$0.5\%$.
- **Delayed labels.** Conversion and approval outcomes arrive days or weeks after the impression; some are still censored at training time.
- **Shared-bottom multi-task architecture** with a shared feature/embedding trunk and one prediction head per task. High-cardinality categorical features (user id, item id, merchant) go through embedding tables.
- **Production serving** uses an online feature store; training reads features from an offline store/warehouse. The two stores are populated by partly different pipelines.
- Optimizer is adaptive (Adam/AdamW family); training may use mixed precision.
### Clarifying Questions to Ask
- What exactly does "unstable" mean — `NaN`, `Inf`, a sudden loss spike, or oscillation? Is it deterministic at the same step across reruns?
- Is the offline–online gap measured on the *same* metric (e.g. business KPI vs. AUC), and is the online experiment correctly bucketed with enough power?
- How are the per-task losses weighted today, and were any features, shards, or label pipelines changed shortly before the instability appeared?
- What is the attribution/label-maturity window for conversion and approval, and how are still-censored examples labeled at training time?
- Are offline and online features produced by the same transformation code, or two separate implementations? Is there a logged "serving features" dataset we can replay?
### What a Strong Answer Covers
- **A localization-first mindset** — reproducing the failure and isolating the first point of divergence with targeted instrumentation, rather than reciting a generic list of causes.
- **Breadth across the relevant failure modes** for instability — numerics, optimization dynamics, data/inputs, and multi-task interaction — paired with a concrete bisection strategy to pin down which one fired.
- **Imbalance- and delay-aware loss/optimization design** for an extreme positive rate, treating the choice of per-task loss and the handling of censored/delayed labels as distinct decisions.
- **A disciplined offline–online methodology** that distinguishes the candidate explanations and proposes a single decisive test to cut between them.
- **A pipeline-level mental model of feature skew** — the path a feature travels from event to served prediction — and an appreciation for the logging needed to reconstruct any single prediction after the fact.
### Follow-up Questions
- The instability disappears when you disable mixed precision but you need it for throughput — how do you keep mixed precision *and* stability?
- Your offline replay shows a single high-importance feature is consistently a constant default at serving time but populated offline. What's your remediation, and how would you have caught it earlier?
- Conversion labels mature 14 days after impression. How does this delay change both your loss/label construction and your offline evaluation protocol?
- How would you set up continuous monitoring so the *next* training–serving skew is detected automatically before it ships?
Quick Answer: This question evaluates a candidate's ability to debug multi-task ranking models in production, focusing on training stability, extreme label sparsity, loss and optimization choices, and training–serving feature pipelines within the Machine Learning domain, and it tests both practical application and conceptual understanding.