Compare DCN v1 vs v2 and A/B test
Company: Apple
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
You are an ML engineer on a recommender/ads team building a CTR/CVR prediction model with a **Deep & Cross Network (DCN)**. Your team currently serves **DCN v1** and is considering migrating to **DCN v2**. You need to articulate the modeling tradeoffs and then design the online experiment that decides whether the new model ships.
### Constraints & Assumptions
- Large-scale industrial setting: high-cardinality sparse categorical features (user/item/context IDs) represented as embeddings, plus dense features.
- The model outputs a calibrated probability (CTR and/or CVR) consumed by a downstream ranking/bidding stage.
- Online serving has a strict latency budget (e.g., p99 within a few milliseconds for the model forward pass).
- You have an experimentation platform that supports sticky, hash-based bucketing and standard frequentist analysis.
### Clarifying Questions to Ask
- What is the **objective** the downstream system optimizes — ranking quality, revenue/ROAS, or a multi-objective blend — and what is the single business KPI the A/B test must move?
- How **calibration-sensitive** is the consumer of the score (e.g., a second-price auction or pacing controller that needs absolute pCTR, not just relative order)?
- What is the available **training data volume** and feature cardinality, and is the current v1 model **underfitting** (metrics plateau) or **overfitting**?
- What is the **serving latency and parameter budget**, and is there headroom for a larger cross network or low-rank approximations?
- What is the **randomization granularity** the platform supports (user, device, request) and is there shared state (auctions, pacing, budgets) that couples units?
- How fast does the **training loop ingest serving logs** — does the candidate model's own exposures feed back into its next training window during the experiment?
### Part A — DCN v1 vs DCN v2
Explain the **key architectural differences** between **DCN v1** and **DCN v2** (the cross-network design), and for each version discuss (1) what feature interactions it models well, (2) training/serving cost and stability, and (3) when you would prefer it in production.
```hint Cross layer math
Write out the cross-layer update for each version. v1's per-layer weight is a **vector** $\mathbf{w}_l$, so the cross term scales $\mathbf{x}_0$ by a single scalar $\mathbf{x}_l^\top\mathbf{w}_l$ — think about the *rank* of that interaction.
```
```hint What v2 changes
v1's rank bottleneck comes from its scalar gate. Consider how raising the *rank* of the per-layer interaction term would change what feature combinations the cross network can represent — and what that implies for parameter count and serving cost.
```
```hint Architecture topology
Don't forget the two ways the cross and deep towers can be combined (one feeding the other vs. side-by-side and concatenated) — and note that this topological choice is largely orthogonal to which cross-layer design you pick.
```
#### What This Part Should Cover
- Correct, explicit cross-layer formulas for v1 (vector weight, rank-1 cross) and v2 (matrix weight with elementwise product), and why v2 is strictly more expressive.
- The low-rank / mixture-of-experts variant of v2 and the cost/quality tradeoff it enables.
- Stacked vs. parallel deep-and-cross topologies.
- Parameter count, FLOPs, latency, overfitting/regularization, and a clear "prefer v1 when… / prefer v2 when…" decision grounded in data scale and latency budget.
### Part B — Online A/B Test for the New Model
Design an **end-to-end A/B test** to decide whether DCN v2 replaces DCN v1: experiment design (randomization unit, traffic split, duration), primary and guardrail metrics, how you handle novelty effects / interference / learning-to-rank feedback loops, and how you determine significance and decide to launch or roll back.
```hint Pick the unit first
The randomization unit is the load-bearing decision. Ask what shared resource could leak between treatment and control (auctions, budgets, pacing, a shared candidate pool) — that determines whether user/device-level randomization is enough or you need cluster/budget-split designs.
```
```hint Metrics hierarchy
Separate **offline** model metrics (AUC/LogLoss/calibration — diagnostics, not launch criteria) from the **online primary** business KPI and the **guardrails**. Power the test against a pre-registered minimum detectable effect on the primary metric.
```
```hint Feedback loops
A ranking model changes what it logs, and those logs train the next model. Think about exposure bias, consistent logging across arms, and ramp/duration choices to separate a real lift from a novelty spike.
```
#### Clarifying Questions for this Part
- Are budgets/pacing **shared** across the treatment and control populations (the classic interference trap for ads experiments)?
- Will the candidate model be **retrained on its own experiment logs** mid-flight, or is the training data frozen for the duration?
#### What This Part Should Cover
- A justified randomization unit and split/ramp plan, with explicit reasoning about interference and sticky bucketing.
- A pre-registered primary KPI, an MDE-based power/duration calculation, and a concrete guardrail set (latency, errors, calibration, content/policy health).
- Correct handling of novelty effects, exposure bias, and learning-to-rank feedback loops (consistent logging, ramping, counterfactual/interleaving diagnostics where appropriate).
- A sound significance methodology (per-unit aggregation, heavy-tail-aware tests, multiple-comparison and peeking corrections) plus explicit launch/rollback thresholds and post-launch monitoring.
### Follow-up Questions
- Your A/B test shows a clear **offline AUC gain** for v2 but a **flat or negative online primary KPI**. Walk through the diagnoses you would rule out, in order.
- The pCTR consumer is a **second-price auction**. How does a miscalibrated-but-higher-AUC model affect bidding, and how would you detect and fix the calibration regression?
- You can only afford a **fixed parameter/latency budget**. How would you decide between (a) DCN v2 with a low-rank cross network, (b) a deeper v1, and (c) v2 with fewer cross layers — and what offline ablation would inform the choice before any online test?
Quick Answer: This question evaluates understanding of deep learning model architectures (DCN v1 vs v2), feature interaction modeling, production training and serving trade-offs, and end-to-end online experimentation for CTR/CVR recommender and ads systems, and sits in the Machine Learning domain focused on ranking and personalization.