Design a reaction-factor prediction system

Q: Design a reaction-factor prediction system

This is a ML System Design interview question from Google for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

End-to-End System Design: Predicting a Reaction Factor from Molecule Pairs

Context and goal

You have a tabular dataset with columns:
- molecule1_name (string)
- molecule2_name (string)
- reaction_factor (real-valued target)
Task: Design an end-to-end ML approach to predict reaction_factor for a pair of molecules. Assume you can query public chem informatics resources (e.g., to map names to SMILES/InChI) and use standard cheminformatics tooling.

Explicit assumptions (state any others you need)

reaction_factor is a continuous scalar (e.g., yield, rate, affinity) with potential skew and outliers.
The order of molecules is not expected to change the physics significantly (A+B ≈ B+A), but the dataset may contain both orders.
No reaction conditions are provided; if present, treat them as additional features but avoid leakage.

Requirements

Exploratory Data Analysis (EDA)
- Label checks: distribution/transformations; outliers; replicate consistency.
- Data quality: missingness; duplicates; reversed-order duplicates; potential leakage.
- Name normalization: casing, synonyms, canonical identifiers; detect inconsistent naming and collisions.
Feature engineering
- Map names to molecular representations (SMILES/InChI), compute descriptors (RDKit), fingerprints, embeddings, or graph encodings.
- Construct pairwise features: concatenation, symmetric aggregations (sum/max/abs-diff/product), cross/attention features.
- Handle unseen molecules; enforce symmetry (A+B ≈ B+A) in the design.
Data splitting
- Propose and justify: random split; grouped-by-molecule split; leave-one-molecule-out (cold-start); scaffold-based split; time-based split.
- Explain what each split estimates and how it mitigates leakage.
Modeling
- Baselines: linear/ridge/elastic net.
- Tree ensembles: XGBoost/LightGBM/CatBoost.
- Neural models: MLP on descriptors; Transformer on SMILES; GNN/Graph Transformer on molecular graphs with a pairwise architecture.
- Uncertainty estimation and calibration strategies.
Training and evaluation
- Loss and metrics: MAE, RMSE, R²; robust losses.
- Cross-validation strategy aligned to your split choices; hyperparameter search; early stopping.
- Error analysis: by scaffold/family/frequency and other covariates.
Generalization and robustness
- Standardization, regularization.
- Data augmentation (e.g., SMILES enumeration), pretraining on large corpora, semi-supervised learning, ensembling.
- Adversarial validation, domain-shift checks.
- Ablations and offline A/B comparisons.
Production considerations
- Feature pipelines, caching, reproducibility/versioning.
- Monitoring (data/label drift, uncertainty), retraining triggers.
- Scientific safety checks and guardrails.

Design a reaction-factor prediction system

End-to-End System Design: Predicting a Reaction Factor from Molecule Pairs

Solution (Locked)

Comments (0)