PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/Uber

Implement CLIP Contrastive Loss

Last updated: May 11, 2026

Quick Overview

This question evaluates understanding and implementation of contrastive representation learning concepts—specifically similarity matrices, symmetric image-text contrastive loss, normalization, temperature scaling, and label construction—testing competency in building losses for embedding alignment in the Machine Learning domain.

  • medium
  • Uber
  • Machine Learning
  • Machine Learning Engineer

Implement CLIP Contrastive Loss

Company: Uber

Role: Machine Learning Engineer

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

Given a minibatch of paired image and text embeddings, implement the symmetric contrastive loss used in CLIP-style image-text representation learning. You are given: - `image_embeddings`: a tensor of shape `(batch_size, embedding_dim)`. - `text_embeddings`: a tensor of shape `(batch_size, embedding_dim)`. - The `i`-th image corresponds to the `i`-th text. Compute a similarity matrix between every image embedding and every text embedding. Then compute: - Image-to-text loss: cross entropy over each image row, where the correct class for row `i` is text `i`. - Text-to-image loss: cross entropy over each text row, equivalently cross entropy on the transposed similarity matrix, where the correct class for row `i` is image `i`. - Final loss: the average of the two losses. Implement this loss function and explain any important details such as normalization, temperature scaling, and label construction.

Quick Answer: This question evaluates understanding and implementation of contrastive representation learning concepts—specifically similarity matrices, symmetric image-text contrastive loss, normalization, temperature scaling, and label construction—testing competency in building losses for embedding alignment in the Machine Learning domain.

Related Interview Questions

  • Evaluate Promotions for Uber Eats Users - Uber (medium)
  • Implement Streaming Clustering for Numbers - Uber
  • Build cold-start restaurant ratings - Uber (medium)
  • Predict driver acceptance - Uber (medium)
  • Explain and test completion-rate gaps - Uber (easy)
Uber logo
Uber
Apr 3, 2026, 12:00 AM
Machine Learning Engineer
Technical Screen
Machine Learning
0
0

Given a minibatch of paired image and text embeddings, implement the symmetric contrastive loss used in CLIP-style image-text representation learning.

You are given:

  • image_embeddings : a tensor of shape (batch_size, embedding_dim) .
  • text_embeddings : a tensor of shape (batch_size, embedding_dim) .
  • The i -th image corresponds to the i -th text.

Compute a similarity matrix between every image embedding and every text embedding. Then compute:

  • Image-to-text loss: cross entropy over each image row, where the correct class for row i is text i .
  • Text-to-image loss: cross entropy over each text row, equivalently cross entropy on the transposed similarity matrix, where the correct class for row i is image i .
  • Final loss: the average of the two losses.

Implement this loss function and explain any important details such as normalization, temperature scaling, and label construction.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Uber•More Machine Learning Engineer•Uber Machine Learning Engineer•Uber Machine Learning•Machine Learning Engineer Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.