Define QKV for recommender cross-attention
Company: TikTok
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
You are designing a deep-learning–based recommendation system that uses a Transformer-style **cross-attention** block to model the interaction between a user and a candidate item.
The model has these typical inputs:
- A **user behavior sequence**: a list of items the user has interacted with in the past, each already embedded as a vector (e.g., size `d`).
- A **candidate item** whose relevance score you want to predict, also embedded as a vector of size `d`.
- Optional **context features** (time, device, location, etc.) that can also be embedded.
You decide to use a cross-attention layer somewhere in the model rather than only self-attention.
1. Propose a concrete way to define the **Query (Q)**, **Key (K)**, and **Value (V)** tensors in this cross-attention block using the inputs above. Explain what each of Q, K, and V represents semantically.
2. Give at least **two different reasonable design choices** for how to set up Q, K, and V (for example, one where the candidate item is the query and one where the user history is the query). For each design, explain:
- What is used as Q, K, and V.
- What interaction the attention mechanism is modeling.
- Pros and cons or when that design is preferable.
3. Briefly explain how cross-attention here differs from self-attention within the user behavior sequence, and why cross-attention can be useful in recommendation systems.
Quick Answer: This question evaluates understanding of Transformer-style cross-attention and the concrete design of Query, Key, and Value tensors for deep-learning recommender systems, testing representation semantics, embedding alignment, and interaction modeling between user history, candidate items, and context.