How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a medium difficulty ML System Design question, commonly asked during Technical Screen rounds at TikTok.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at TikTok during technical interviews.

Design video captioning under compute limits

Q: Design video captioning under compute limits

This question evaluates expertise in ML system design for multimodal large models, covering deployment under compute and GPU memory constraints as well as large-scale retrieval and processing of video captions and embeddings.

Scenario

You are deploying a multimodal large model that generates captions for videos.

Part A — Deployment under compute / VRAM constraints

The model takes video (frames/audio optional) and outputs a text caption.
You must meet latency/throughput goals while staying within tight compute and GPU memory (VRAM) limits.

Prompt: Describe how you would design the end-to-end system (modeling + serving) to reliably deploy this capability under constrained compute/VRAM.

Part B — Fast retrieval for brand ads + watermarking

Assume you already have:

A caption for each video (possibly multiple captions per video)
An embedding vector per video (or per caption)

A brand advertiser wants to quickly find videos relevant to a query (text and/or example creative) and then apply a watermark to matched videos.

Prompt: How would you design and optimize the retrieval + processing pipeline to make this search and watermarking fast at scale? Include indexing, filtering/ranking, caching, and system trade-offs.

Scenario

You are deploying a multimodal large model that generates captions for videos.

Part A — Deployment under compute / VRAM constraints

The model takes video (frames/audio optional) and outputs a text caption.

You must meet latency/throughput goals while staying within tight compute and GPU memory (VRAM) limits.

Prompt: Describe how you would design the end-to-end system (modeling + serving) to reliably deploy this capability under constrained compute/VRAM.

Part B — Fast retrieval for brand ads + watermarking

Assume you already have:

A caption for each video (possibly multiple captions per video)

An embedding vector per video (or per caption)

A brand advertiser wants to quickly find videos relevant to a query (text and/or example creative) and then apply a watermark to matched videos.

Design video captioning under compute limits

Quick Overview

Scenario

Part A — Deployment under compute / VRAM constraints

Part B — Fast retrieval for brand ads + watermarking

Solution

Comments (0)

Design video captioning under compute limits

Quick Overview

Scenario

Part A — Deployment under compute / VRAM constraints

Part B — Fast retrieval for brand ads + watermarking

Solution

Comments (0)