PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/ML System Design/Amazon

Design a Multimodal Neural Network

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's competency in designing production-grade multimodal machine learning systems, including architecture choices for text and image encoders, cross-modal fusion strategies, training objectives for joint retrieval and classification, robustness to missing or noisy modalities, and considerations for scalability and low-latency serving. Commonly asked in ML system design interviews to assess both conceptual understanding and practical application of machine learning engineering principles, it falls under the ML System Design domain and probes abilities in model alignment, evaluation metrics, deployment trade-offs, and domain adaptation strategies.

  • hard
  • Amazon
  • ML System Design
  • Machine Learning Engineer

Design a Multimodal Neural Network

Company: Amazon

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

Design a multimodal neural network that fuses text and images to perform retrieval and classification. Specify encoders for each modality, the fusion strategy (early fusion, late fusion, or cross-attention), training objectives (e.g., contrastive loss, joint classification), handling of missing or noisy modalities, data alignment/augmentation, and evaluation metrics. Discuss scalability, serving latency, and how you would fine-tune on new domains.

Quick Answer: This question evaluates a candidate's competency in designing production-grade multimodal machine learning systems, including architecture choices for text and image encoders, cross-modal fusion strategies, training objectives for joint retrieval and classification, robustness to missing or noisy modalities, and considerations for scalability and low-latency serving. Commonly asked in ML system design interviews to assess both conceptual understanding and practical application of machine learning engineering principles, it falls under the ML System Design domain and probes abilities in model alignment, evaluation metrics, deployment trade-offs, and domain adaptation strategies.

Related Interview Questions

  • Design systems for global request detection and labeling - Amazon (hard)
  • Design a computer-use agent end-to-end - Amazon (medium)
  • Debug online worse than offline model performance - Amazon (medium)
  • Approach an ambiguous business problem - Amazon (medium)
  • Explain parallelism and collectives in training - Amazon (medium)
Amazon logo
Amazon
Sep 6, 2025, 12:00 AM
Machine Learning Engineer
Onsite
ML System Design
5
0

Design Prompt: Multimodal Text–Image Retrieval and Classification

Context

You are building a production system that uses both text (titles/descriptions/queries) and images to support:

  • Cross-modal retrieval (e.g., text-to-image and image-to-text search)
  • Item classification (e.g., product category or attributes)

Assume you have paired (image, text) examples for training and must serve at scale with low latency.

Requirements

  1. Encoders
    • Specify the encoder architecture for each modality (text and image), embedding dimensionality, and any projection layers.
  2. Fusion strategy
    • Choose and justify early fusion, late fusion, or cross-attention (or a hybrid), and describe how features are combined.
  3. Training objectives
    • Include contrastive alignment for retrieval and a joint classification objective. Detail the loss functions.
  4. Missing or noisy modalities
    • Describe how to train for and handle cases where a modality is missing or low quality at inference time.
  5. Data alignment and augmentation
    • Explain how to align training data across modalities and augment it to improve robustness.
  6. Evaluation metrics
    • Define metrics for retrieval and classification, plus robustness and efficiency metrics.
  7. Scalability and serving latency
    • Discuss training/inference scale-out, approximate nearest-neighbor search, batching, quantization, and latency budgets.
  8. Fine-tuning on new domains
    • Outline approaches for rapid domain adaptation with limited labels.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Amazon•More Machine Learning Engineer•Amazon Machine Learning Engineer•Amazon ML System Design•Machine Learning Engineer ML System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.