PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/ML System Design/TikTok

Design video captioning under compute limits

Last updated: Jun 21, 2026

Quick Overview

This question evaluates expertise in ML system design for multimodal large models, covering deployment under compute and GPU memory constraints as well as large-scale retrieval and processing of video captions and embeddings.

  • medium
  • TikTok
  • ML System Design
  • Machine Learning Engineer

Design video captioning under compute limits

Company: TikTok

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: medium

Interview Round: Technical Screen

## Scenario You work on a multimodal team at a large short-video platform. The team has a **multimodal large model** that takes a video (sampled frames, with audio as an optional input) and generates a **text caption** describing the content. You now own the problem of turning this model into a production capability, and then building a feature on top of it. The work splits into two parts: - **Part A** — deploy the captioning model so it meets latency/throughput goals while staying inside tight **compute** and **GPU memory (VRAM)** budgets. - **Part B** — given captions and embeddings already exist, build a fast pipeline that lets a brand advertiser find relevant videos for a query and then watermark the matches at scale. ### Constraints & Assumptions - The platform ingests a very high volume of new videos continuously; the captioning model must serve both backfill of the existing corpus and a steady stream of new uploads. - GPU fleet is finite and shared with other workloads, so VRAM per replica and total GPU-hours are hard constraints — you cannot simply scale out infinitely. - Videos vary widely in length (seconds to many minutes) and resolution. - For Part B, assume a corpus on the order of hundreds of millions to billions of videos, each with one or more captions and at least one embedding vector. - Watermarking re-encodes or overlays media; it is meaningfully more expensive than a metadata write. ### Clarifying Questions to Ask - **Latency mode:** Is captioning needed online (on upload / on request, with a per-video SLO) or is an offline/batch path acceptable, with online reserved for new or priority content? - **Caption shape:** One short caption per video, multi-sentence, or per-segment captions for long videos? How many languages? - **Quality bar & eval:** How is caption quality measured and what is the minimum acceptable quality after any compression (quantization/distillation)? - **Part B query type:** Are advertiser queries text only, or also example creatives (image/video)? What recall vs. latency does the advertiser tooling need? - **Watermark semantics:** Is the watermark a visible overlay, an invisible/forensic mark, or both? Does it require re-encoding the full video or can it be applied to a derivative/preview? - **Brand safety:** Must matched videos pass policy/brand-safety checks before a watermark is applied, and who owns that gate? ### Part A — Deployment under compute / VRAM constraints **Prompt:** Design the end-to-end system (modeling + serving) to reliably deploy video captioning under constrained compute and VRAM. Cover how you reduce the cost of the multimodal input, how you fit the model in memory, how you structure the serving path (online vs. offline/batch), and how you store outputs and monitor the system. ```hint Where to start The dominant cost in video captioning is usually the *input*, not the text decode. Think about how much of the video the model actually needs to see before you think about the LLM. ``` ```hint Fitting the model List the independent levers that trade quality/latency for VRAM: weight precision (quantization), the KV-cache during decoding, model partitioning across GPUs, and replacing the model itself (distillation / a smaller student, adapters like LoRA). ``` ```hint Serving shape Captioning rarely needs to be strictly online. Consider an offline-first / batch design with an online fallback only for new or priority content — it makes GPU load predictable and SLOs achievable. ``` #### What This Part Should Cover - **Input cost reduction:** frame sampling / keyframe selection, spatial downsampling, clip-based encoding of long videos, conditional use of the audio branch. - **Memory-fitting levers:** quantization (8/4-bit), KV-cache control, tensor/pipeline parallelism, distillation, adapters — with the trade-off each makes. - **Serving architecture:** a justified choice between online, offline/batch, and a hybrid, plus batching/async overlap of vision-encode and text-decode. - **Storage & observability:** versioned captions and embeddings, plus quality eval and serving health (GPU utilization, OOM rate, queue time, p95). ### Part B — Fast retrieval for brand ads + watermarking You already have a caption for each video (possibly several) and an embedding vector per video (or per caption). A brand advertiser wants to **quickly find videos relevant to a query** (text and/or an example creative) and then **apply a watermark** to the matched videos. **Prompt:** Design and optimize the retrieval + processing pipeline so that both the search and the watermarking are fast at scale. Address indexing, filtering/ranking, caching, the watermarking path, and the main system trade-offs and failure modes. ```hint Retrieval structure At billions of items, exact search is infeasible. Think in two stages — a cheap candidate generator over the whole corpus, then a more expensive re-ranker over a small top-K. ``` ```hint Narrowing the search You have more than embeddings: captions (text) and metadata. Combining approximate vector search with lexical matching and hard metadata prefilters cuts both cost and false positives. ``` ```hint Decoupling the heavy work Watermarking re-encodes media and must not block search latency. Think about pushing it to an async, idempotent job pipeline keyed so the same video isn't processed twice. ``` #### Clarifying Questions for this Part - Does the advertiser need real-time interactive search, or is a batched campaign-level run acceptable? - Must the index reflect brand-new videos within minutes (streaming updates), or is a periodic rebuild fine? #### What This Part Should Cover - **Two-stage retrieval:** ANN candidate generation (e.g., HNSW or IVF-PQ) followed by heavier re-ranking, with a clear latency/recall rationale. - **Hybrid search & filtering:** semantic + lexical + metadata prefilters; handling of multiple embedding types and long-video segment-level retrieval. - **Caching & sharding:** query/result and query-embedding caches; index sharding by language/region. - **Watermarking at scale:** async idempotent job queue, batching per campaign, pre-generated overlays, CDN/object-store derivatives with access control, and a brand-safety gate before marking. - **Trade-offs & failure modes:** index freshness vs. cost, embedding-version compatibility between query and index, false positives, segment-to-video mapping. ### What a Strong Answer Covers These dimensions span both parts: - **Cost as the first-class constraint:** every design choice is justified against the stated compute/VRAM (Part A) or query/processing-cost (Part B) budget, not just accuracy. - **Versioning discipline:** captions and embeddings are versioned, and a query in Part B is only compared against an index built with a compatible embedding version. - **Graceful degradation & observability:** the system has a fallback path under load and exposes the metrics (latency percentiles, GPU/OOM, queue depth, retrieval recall) needed to detect regressions. ### Follow-up Questions - In Part A, if a 4-bit quantized model drops caption quality below the bar for a minority of videos, how would you detect those cases and route them to a higher-fidelity path? - In Part B, how do you keep the ANN index fresh as new videos arrive continuously without rebuilding it from scratch, and what staleness would the advertiser actually observe? - The original interview also asked supporting ML-fundamentals questions (e.g., dropout, normalization differences and how norms are handled at inference, and RL in LLM post-training). Pick one and explain how it connects to deploying or fine-tuning the captioning model.

Quick Answer: This question evaluates expertise in ML system design for multimodal large models, covering deployment under compute and GPU memory constraints as well as large-scale retrieval and processing of video captions and embeddings.

Related Interview Questions

  • Design a model to choose dynamic K - TikTok (medium)
  • Design training for multimodal embedding model - TikTok (medium)
  • What skills are needed for AI infra roles? - TikTok (hard)
  • Design system to detect privacy-leak records - TikTok (medium)
  • Design LLM-enhanced recommendation solutions - TikTok (hard)
|Home/ML System Design/TikTok

Design video captioning under compute limits

TikTok logo
TikTok
Feb 12, 2026, 12:00 AM
mediumMachine Learning EngineerTechnical ScreenML System Design
5
0

Scenario

You work on a multimodal team at a large short-video platform. The team has a multimodal large model that takes a video (sampled frames, with audio as an optional input) and generates a text caption describing the content. You now own the problem of turning this model into a production capability, and then building a feature on top of it.

The work splits into two parts:

  • Part A — deploy the captioning model so it meets latency/throughput goals while staying inside tight compute and GPU memory (VRAM) budgets.
  • Part B — given captions and embeddings already exist, build a fast pipeline that lets a brand advertiser find relevant videos for a query and then watermark the matches at scale.

Constraints & Assumptions

  • The platform ingests a very high volume of new videos continuously; the captioning model must serve both backfill of the existing corpus and a steady stream of new uploads.
  • GPU fleet is finite and shared with other workloads, so VRAM per replica and total GPU-hours are hard constraints — you cannot simply scale out infinitely.
  • Videos vary widely in length (seconds to many minutes) and resolution.
  • For Part B, assume a corpus on the order of hundreds of millions to billions of videos, each with one or more captions and at least one embedding vector.
  • Watermarking re-encodes or overlays media; it is meaningfully more expensive than a metadata write.

Clarifying Questions to Ask

  • Latency mode: Is captioning needed online (on upload / on request, with a per-video SLO) or is an offline/batch path acceptable, with online reserved for new or priority content?
  • Caption shape: One short caption per video, multi-sentence, or per-segment captions for long videos? How many languages?
  • Quality bar & eval: How is caption quality measured and what is the minimum acceptable quality after any compression (quantization/distillation)?
  • Part B query type: Are advertiser queries text only, or also example creatives (image/video)? What recall vs. latency does the advertiser tooling need?
  • Watermark semantics: Is the watermark a visible overlay, an invisible/forensic mark, or both? Does it require re-encoding the full video or can it be applied to a derivative/preview?
  • Brand safety: Must matched videos pass policy/brand-safety checks before a watermark is applied, and who owns that gate?

Part A — Deployment under compute / VRAM constraints

Prompt: Design the end-to-end system (modeling + serving) to reliably deploy video captioning under constrained compute and VRAM. Cover how you reduce the cost of the multimodal input, how you fit the model in memory, how you structure the serving path (online vs. offline/batch), and how you store outputs and monitor the system.

What This Part Should Cover

  • Input cost reduction: frame sampling / keyframe selection, spatial downsampling, clip-based encoding of long videos, conditional use of the audio branch.
  • Memory-fitting levers: quantization (8/4-bit), KV-cache control, tensor/pipeline parallelism, distillation, adapters — with the trade-off each makes.
  • Serving architecture: a justified choice between online, offline/batch, and a hybrid, plus batching/async overlap of vision-encode and text-decode.
  • Storage & observability: versioned captions and embeddings, plus quality eval and serving health (GPU utilization, OOM rate, queue time, p95).

Part B — Fast retrieval for brand ads + watermarking

You already have a caption for each video (possibly several) and an embedding vector per video (or per caption). A brand advertiser wants to quickly find videos relevant to a query (text and/or an example creative) and then apply a watermark to the matched videos.

Prompt: Design and optimize the retrieval + processing pipeline so that both the search and the watermarking are fast at scale. Address indexing, filtering/ranking, caching, the watermarking path, and the main system trade-offs and failure modes.

Clarifying Questions for this Part

  • Does the advertiser need real-time interactive search, or is a batched campaign-level run acceptable?
  • Must the index reflect brand-new videos within minutes (streaming updates), or is a periodic rebuild fine?

What This Part Should Cover

  • Two-stage retrieval: ANN candidate generation (e.g., HNSW or IVF-PQ) followed by heavier re-ranking, with a clear latency/recall rationale.
  • Hybrid search & filtering: semantic + lexical + metadata prefilters; handling of multiple embedding types and long-video segment-level retrieval.
  • Caching & sharding: query/result and query-embedding caches; index sharding by language/region.
  • Watermarking at scale: async idempotent job queue, batching per campaign, pre-generated overlays, CDN/object-store derivatives with access control, and a brand-safety gate before marking.
  • Trade-offs & failure modes: index freshness vs. cost, embedding-version compatibility between query and index, false positives, segment-to-video mapping.

What a Strong Answer Covers

These dimensions span both parts:

  • Cost as the first-class constraint: every design choice is justified against the stated compute/VRAM (Part A) or query/processing-cost (Part B) budget, not just accuracy.
  • Versioning discipline: captions and embeddings are versioned, and a query in Part B is only compared against an index built with a compatible embedding version.
  • Graceful degradation & observability: the system has a fallback path under load and exposes the metrics (latency percentiles, GPU/OOM, queue depth, retrieval recall) needed to detect regressions.

Follow-up Questions

  • In Part A, if a 4-bit quantized model drops caption quality below the bar for a minority of videos, how would you detect those cases and route them to a higher-fidelity path?
  • In Part B, how do you keep the ANN index fresh as new videos arrive continuously without rebuilding it from scratch, and what staleness would the advertiser actually observe?
  • The original interview also asked supporting ML-fundamentals questions (e.g., dropout, normalization differences and how norms are handled at inference, and RL in LLM post-training). Pick one and explain how it connects to deploying or fine-tuning the captioning model.

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More TikTok•More Machine Learning Engineer•TikTok Machine Learning Engineer•TikTok ML System Design•Machine Learning Engineer ML System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.