Design a reaction-factor prediction system
Company: Google
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Technical Screen
You are given a dataset of chemical reactions with columns: molecule1_name, molecule2_name, reaction_factor (real-valued). Design an end-to-end approach to predict reaction_factor. Cover:
1) EDA: checks/plots for label distribution, missingness, outliers, duplicates, leakage; normalization of molecule naming (synonyms, casing), and detection of duplicate reactions in reversed order.
2) Feature engineering: converting molecule names to usable representations (lookups to SMILES/InChI, RDKit descriptors, learned embeddings, graph encodings); constructing pairwise interaction features (concatenation, differences, products, attention/cross features); handling unseen molecules and symmetry (A+B ≈ B+A).
3) Data splitting: propose and justify random, grouped-by-molecule, leave-one-molecule-out (cold-start), scaffold-based, or time-based splits; explain how each estimates generalization and prevents leakage.
4) Modeling: baselines (linear/elastic net), tree ensembles (XGBoost/LightGBM), neural models (MLP on descriptors, Transformer on SMILES, GNN/Graph Transformer on molecular graphs); uncertainty estimation and calibration.
5) Training and evaluation: loss/metrics (MAE, RMSE, R²), cross-validation strategy, hyperparameter search, early stopping, error analysis by scaffold/family/frequency.
6) Generalization and robustness: standardization, regularization, data augmentation (e.g., SMILES enumeration), pretraining on large molecular corpora, semi-supervised learning, ensembling, adversarial validation, domain shift checks; design of ablations and offline A/B comparisons.
7) Production: feature pipelines, reproducibility, monitoring, retraining triggers, and scientific safety considerations.
Quick Answer: This question evaluates competency in designing end-to-end machine learning systems for molecular pair regression tasks, including data quality and normalization, molecular representation and feature engineering, model selection, evaluation metrics, uncertainty estimation, and production considerations.