System Design: Hierarchical Multi-Label Classifier for Noisy Taxonomy
Context
You have a catalog of items with hierarchical tags (e.g., Category → Subcategory → Leaf). Tags are:
-
Not mutually exclusive (an item can belong to multiple leaves/paths).
-
Inconsistent across levels (naming, missing parents, duplicate/overlapping nodes).
Design a production-ready classifier that predicts consistent hierarchical labels for new items, given raw item data (e.g., title, description, images, structured attributes).
Requirements
-
Clarify and define the label space (multi-label vs. multi-class) and decision about predicting leaves vs. all ancestors.
-
Propose data cleaning and taxonomy normalization steps (deduplication, synonym mapping, cycle detection, DAG enforcement, multi-parent handling).
-
Choose model architecture(s) that capture label dependencies (e.g., top-down, multi-task per level, label-graph models) and explain trade-offs.
-
Specify loss functions (binary cross-entropy) and any hierarchical/constraint-aware losses to enforce parent–child consistency and capture co-occurrences.
-
Define thresholding, calibration, and decoding to turn scores into a valid hierarchical set (e.g., per-label thresholds, hierarchical closure, beam search).
-
Handle severe class imbalance and long-tail labels.
-
Propose evaluation metrics at leaf and hierarchy levels (micro/macro F1, PR-AUC, hierarchical precision/recall, path metrics) and how to construct validation splits.
-
Explain training data requirements and strategies to obtain labels at scale (weak supervision, semi-supervised, active learning, PU-learning).
-
Set realistic inference latency/throughput targets and model size constraints, plus optimization tactics.
-
Monitoring and maintenance: data/label drift, calibration, constraint violations, taxonomy updates, human-in-the-loop.