How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Onsite rounds at TikTok.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at TikTok during technical interviews.

Write self-attention and cross-entropy pseudocode

Quick Overview

This question evaluates understanding of Transformer internals—specifically scaled dot-product self-attention (including Q/K/V projections and attention masks), multi-class cross-entropy loss computation, and conceptual roles like the position-wise feed-forward network and attention score scaling—in the Machine Learning / deep learning domain.

You are asked to explain core Transformer / deep learning components.

Part A — Self-attention pseudocode

Write clear pseudocode (not full code) for scaled dot-product self-attention for a single attention head. Your pseudocode should include:

Inputs/outputs and tensor shapes (batch size B , sequence length T , model dim d_model , head dim d_k )
Computing Q, K, V via linear projections
Computing attention logits and applying scaling
Softmax and weighted sum
(Optional but recommended) Handling an attention mask (padding mask or causal mask)

Part B — Cross-entropy pseudocode

Write pseudocode for multi-class cross-entropy loss for a batch of examples, given:

Model logits z of shape [B, C]
Ground-truth labels either as class indices [B] or one-hot [B, C]
Return a scalar loss (mean over batch)

Part C — Concept questions

In a Transformer block, what is the role of the position-wise feed-forward network (FFN) relative to attention? Why is it needed?
Why do we scale the dot-product attention scores by 1 / sqrt(d_k) before applying softmax? What problem does it address?

Quick Overview

Part A — Self-attention pseudocode

Write clear pseudocode (not full code) for scaled dot-product self-attention for a single attention head. Your pseudocode should include:

Inputs/outputs and tensor shapes (batch size B , sequence length T , model dim d_model , head dim d_k )

Computing Q, K, V via linear projections

Computing attention logits and applying scaling

Softmax and weighted sum

(Optional but recommended) Handling an attention mask (padding mask or causal mask)

Write self-attention and cross-entropy pseudocode

Quick Overview

Part A — Self-attention pseudocode

Part B — Cross-entropy pseudocode

Part C — Concept questions

Solution

Comments (0)

Write self-attention and cross-entropy pseudocode

Quick Overview

Part A — Self-attention pseudocode

Part B — Cross-entropy pseudocode

Part C — Concept questions

Solution

Comments (0)