Design pipeline using classification and embedding services
Company: Scale AI
Role: Software Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Onsite
You are given two **black-box ML services**:
1. **Classification Service**
- Input: One or more text documents.
- Output: A label for each document (e.g., topic or category).
2. **Embedding Service**
- Input: One or more text documents.
- Output: A vector embedding (e.g., 768-dim float vector) for each document.
You need to design a system that:
- Accepts file uploads from users (each file contains one or more text documents).
- Supports both **single-file** and **bulk** upload (up to **1,000 files** in one request).
- For each document:
- Computes a classification label using the classification service.
- Computes an embedding using the embedding service.
- Stores results so they can be queried later (e.g., by user, file, or semantic search).
- Satisfies both:
- **Low latency** for small/single uploads.
- **High throughput** for large/bulk uploads.
**Task**
Design the end-to-end pipeline and APIs. Specifically address:
1. **API Design**
- How clients upload files (single and bulk up to 1,000 files).
- What responses they receive (synchronous vs asynchronous).
2. **Architecture**
- How you orchestrate calls to the classification and embedding services.
- How you store raw files, parsed text, labels, and embeddings.
- How you achieve both low latency and high throughput.
3. **Scalability & Performance**
- How to handle 1,000-file uploads without running out of memory or violating latency goals.
- Batching, queuing, and concurrency strategies when talking to the ML services.
4. **Reliability & Observability**
- Error handling for partial failures (e.g., some files fail to process).
- Monitoring, logging, and metrics.
Assume you cannot change the internals of the classification and embedding services; you may only call their APIs.
Quick Answer: This question evaluates a candidate's proficiency in ML system design, covering API design, service orchestration, data storage modeling, scalability strategies, and reliability/observability when integrating black-box classification and embedding services.