How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Technical Screen rounds at Apple.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Apple during technical interviews.

Compare DCN v1 vs v2 and A/B test | Apple Interview Question

Q: Compare DCN v1 vs v2 and A/B test

This question evaluates understanding of deep learning model architectures (DCN v1 vs v2), feature interaction modeling, production training and serving trade-offs, and end-to-end online experimentation for CTR/CVR recommender and ads systems, and sits in the Machine Learning domain focused on ranking and personalization.

You are an ML engineer on a recommender/ads team building a CTR/CVR prediction model with a Deep & Cross Network (DCN). Your team currently serves DCN v1 and is considering migrating to DCN v2. You need to articulate the modeling tradeoffs and then design the online experiment that decides whether the new model ships.

Constraints & Assumptions

Large-scale industrial setting: high-cardinality sparse categorical features (user/item/context IDs) represented as embeddings, plus dense features.
The model outputs a calibrated probability (CTR and/or CVR) consumed by a downstream ranking/bidding stage.
Online serving has a strict latency budget (e.g., p99 within a few milliseconds for the model forward pass).
You have an experimentation platform that supports sticky, hash-based bucketing and standard frequentist analysis.

Clarifying Questions to Ask

What is the objective the downstream system optimizes — ranking quality, revenue/ROAS, or a multi-objective blend — and what is the single business KPI the A/B test must move?
How calibration-sensitive is the consumer of the score (e.g., a second-price auction or pacing controller that needs absolute pCTR, not just relative order)?
What is the available training data volume and feature cardinality, and is the current v1 model underfitting (metrics plateau) or overfitting ?
What is the serving latency and parameter budget , and is there headroom for a larger cross network or low-rank approximations?
What is the randomization granularity the platform supports (user, device, request) and is there shared state (auctions, pacing, budgets) that couples units?
How fast does the training loop ingest serving logs — does the candidate model's own exposures feed back into its next training window during the experiment?

Part A — DCN v1 vs DCN v2

Explain the key architectural differences between DCN v1 and DCN v2 (the cross-network design), and for each version discuss (1) what feature interactions it models well, (2) training/serving cost and stability, and (3) when you would prefer it in production.

What This Part Should Cover

Correct, explicit cross-layer formulas for v1 (vector weight, rank-1 cross) and v2 (matrix weight with elementwise product), and why v2 is strictly more expressive.
The low-rank / mixture-of-experts variant of v2 and the cost/quality tradeoff it enables.
Stacked vs. parallel deep-and-cross topologies.
Parameter count, FLOPs, latency, overfitting/regularization, and a clear "prefer v1 when… / prefer v2 when…" decision grounded in data scale and latency budget.

Part B — Online A/B Test for the New Model

Design an end-to-end A/B test to decide whether DCN v2 replaces DCN v1: experiment design (randomization unit, traffic split, duration), primary and guardrail metrics, how you handle novelty effects / interference / learning-to-rank feedback loops, and how you determine significance and decide to launch or roll back.

Clarifying Questions for this Part

Are budgets/pacing shared across the treatment and control populations (the classic interference trap for ads experiments)?
Will the candidate model be retrained on its own experiment logs mid-flight, or is the training data frozen for the duration?

What This Part Should Cover

A justified randomization unit and split/ramp plan, with explicit reasoning about interference and sticky bucketing.
A pre-registered primary KPI, an MDE-based power/duration calculation, and a concrete guardrail set (latency, errors, calibration, content/policy health).
Correct handling of novelty effects, exposure bias, and learning-to-rank feedback loops (consistent logging, ramping, counterfactual/interleaving diagnostics where appropriate).
A sound significance methodology (per-unit aggregation, heavy-tail-aware tests, multiple-comparison and peeking corrections) plus explicit launch/rollback thresholds and post-launch monitoring.

Follow-up Questions

Your A/B test shows a clear offline AUC gain for v2 but a flat or negative online primary KPI . Walk through the diagnoses you would rule out, in order.
The pCTR consumer is a second-price auction . How does a miscalibrated-but-higher-AUC model affect bidding, and how would you detect and fix the calibration regression?
You can only afford a fixed parameter/latency budget . How would you decide between (a) DCN v2 with a low-rank cross network, (b) a deeper v1, and (c) v2 with fewer cross layers — and what offline ablation would inform the choice before any online test?

Constraints & Assumptions

Large-scale industrial setting: high-cardinality sparse categorical features (user/item/context IDs) represented as embeddings, plus dense features.
The model outputs a calibrated probability (CTR and/or CVR) consumed by a downstream ranking/bidding stage.
Online serving has a strict latency budget (e.g., p99 within a few milliseconds for the model forward pass).
You have an experimentation platform that supports sticky, hash-based bucketing and standard frequentist analysis.

Clarifying Questions to Ask

What is the objective the downstream system optimizes — ranking quality, revenue/ROAS, or a multi-objective blend — and what is the single business KPI the A/B test must move?
How calibration-sensitive is the consumer of the score (e.g., a second-price auction or pacing controller that needs absolute pCTR, not just relative order)?
What is the available training data volume and feature cardinality, and is the current v1 model underfitting (metrics plateau) or overfitting ?
What is the serving latency and parameter budget , and is there headroom for a larger cross network or low-rank approximations?
What is the randomization granularity the platform supports (user, device, request) and is there shared state (auctions, pacing, budgets) that couples units?
How fast does the training loop ingest serving logs — does the candidate model's own exposures feed back into its next training window during the experiment?

Part A — DCN v1 vs DCN v2

What This Part Should Cover

Correct, explicit cross-layer formulas for v1 (vector weight, rank-1 cross) and v2 (matrix weight with elementwise product), and why v2 is strictly more expressive.
The low-rank / mixture-of-experts variant of v2 and the cost/quality tradeoff it enables.
Stacked vs. parallel deep-and-cross topologies.
Parameter count, FLOPs, latency, overfitting/regularization, and a clear "prefer v1 when… / prefer v2 when…" decision grounded in data scale and latency budget.

Part B — Online A/B Test for the New Model

Clarifying Questions for this Part

Are budgets/pacing shared across the treatment and control populations (the classic interference trap for ads experiments)?
Will the candidate model be retrained on its own experiment logs mid-flight, or is the training data frozen for the duration?

What This Part Should Cover

A justified randomization unit and split/ramp plan, with explicit reasoning about interference and sticky bucketing.
A pre-registered primary KPI, an MDE-based power/duration calculation, and a concrete guardrail set (latency, errors, calibration, content/policy health).
Correct handling of novelty effects, exposure bias, and learning-to-rank feedback loops (consistent logging, ramping, counterfactual/interleaving diagnostics where appropriate).
A sound significance methodology (per-unit aggregation, heavy-tail-aware tests, multiple-comparison and peeking corrections) plus explicit launch/rollback thresholds and post-launch monitoring.

Follow-up Questions

Your A/B test shows a clear offline AUC gain for v2 but a flat or negative online primary KPI . Walk through the diagnoses you would rule out, in order.
The pCTR consumer is a second-price auction . How does a miscalibrated-but-higher-AUC model affect bidding, and how would you detect and fix the calibration regression?
You can only afford a fixed parameter/latency budget . How would you decide between (a) DCN v2 with a low-rank cross network, (b) a deeper v1, and (c) v2 with fewer cross layers — and what offline ablation would inform the choice before any online test?

Compare DCN v1 vs v2 and A/B test

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

Part A — DCN v1 vs DCN v2

What This Part Should Cover

Part B — Online A/B Test for the New Model

Clarifying Questions for this Part

What This Part Should Cover

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Compare DCN v1 vs v2 and A/B test

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

Part A — DCN v1 vs DCN v2

What This Part Should Cover

Part B — Online A/B Test for the New Model

Clarifying Questions for this Part

What This Part Should Cover

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP