System Design: End-to-End Email Spam Detection
Context
Design an end-to-end system that detects and handles spam emails at scale. Assume you are building for a large consumer email service handling high throughput and strict latency requirements. The design should cover data, ML, serving, experimentation, and operations.
Requirements
-
Problem Definition and Labeling
-
Define the objective(s) and action outcomes (e.g., block, quarantine, inbox with banner).
-
Labeling sources and policies.
-
Data Sources and Collection
-
Inbound traffic, user reports, honeypots, abuse teams, reputation feeds.
-
Collection, sampling, retention, and governance.
-
Feature Engineering
-
Content features (text, URLs, attachments), headers, sender/domain/IP reputation, network/behavioral signals.
-
Model Choices and Training
-
Baseline rules, supervised ML models, online learning.
-
Handling class imbalance, feature hashing, model calibration.
-
Serving Architecture and Constraints
-
Placement in the mail pipeline, APIs, latency/throughput targets, caching, fallbacks.
-
Thresholding and Calibration
-
Score-to-action mapping, per-segment thresholds, calibration methods.
-
Evaluation Metrics
-
Precision, recall, ROC/PR analysis, and cost-weighted metrics.
-
Abuse/Adversarial Defenses and Feedback Loops
-
Evasion tactics, spoofing defenses, URL/attachment handling, user feedback integration.
-
Cold Start, Concept Drift, Retraining Cadence
-
New senders/domains, seasonal drift, automated retraining.
-
Online Experimentation
-
A/B testing, ramp strategies, guardrails.
-
Monitoring, Logging, Rollback
-
Real-time and batch monitoring, alerting, safe rollback.
-
Privacy and Compliance
-
Data minimization, encryption, regional residency, user controls.