PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/ML System Design/Google

Design a reaction-factor prediction system

Last updated: Mar 29, 2026

Quick Overview

This question evaluates competency in designing end-to-end machine learning systems for molecular pair regression tasks, including data quality and normalization, molecular representation and feature engineering, model selection, evaluation metrics, uncertainty estimation, and production considerations.

  • hard
  • Google
  • ML System Design
  • Machine Learning Engineer

Design a reaction-factor prediction system

Company: Google

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Technical Screen

You are given a dataset of chemical reactions with columns: molecule1_name, molecule2_name, reaction_factor (real-valued). Design an end-to-end approach to predict reaction_factor. Cover: 1) EDA: checks/plots for label distribution, missingness, outliers, duplicates, leakage; normalization of molecule naming (synonyms, casing), and detection of duplicate reactions in reversed order. 2) Feature engineering: converting molecule names to usable representations (lookups to SMILES/InChI, RDKit descriptors, learned embeddings, graph encodings); constructing pairwise interaction features (concatenation, differences, products, attention/cross features); handling unseen molecules and symmetry (A+B ≈ B+A). 3) Data splitting: propose and justify random, grouped-by-molecule, leave-one-molecule-out (cold-start), scaffold-based, or time-based splits; explain how each estimates generalization and prevents leakage. 4) Modeling: baselines (linear/elastic net), tree ensembles (XGBoost/LightGBM), neural models (MLP on descriptors, Transformer on SMILES, GNN/Graph Transformer on molecular graphs); uncertainty estimation and calibration. 5) Training and evaluation: loss/metrics (MAE, RMSE, R²), cross-validation strategy, hyperparameter search, early stopping, error analysis by scaffold/family/frequency. 6) Generalization and robustness: standardization, regularization, data augmentation (e.g., SMILES enumeration), pretraining on large molecular corpora, semi-supervised learning, ensembling, adversarial validation, domain shift checks; design of ablations and offline A/B comparisons. 7) Production: feature pipelines, reproducibility, monitoring, retraining triggers, and scientific safety considerations.

Quick Answer: This question evaluates competency in designing end-to-end machine learning systems for molecular pair regression tasks, including data quality and normalization, molecular representation and feature engineering, model selection, evaluation metrics, uncertainty estimation, and production considerations.

Related Interview Questions

  • Design an app-store app recommendation system - Google (medium)
  • Design a chatbot over structured and unstructured data - Google (medium)
  • Design a fraud detection system - Google (medium)
  • Choose Fast or Cheap Models - Google
  • Design ML system for self-driving perception - Google (medium)
Google logo
Google
Sep 6, 2025, 12:00 AM
Machine Learning Engineer
Technical Screen
ML System Design
2
0

End-to-End System Design: Predicting a Reaction Factor from Molecule Pairs

Context and goal

  • You have a tabular dataset with columns:
    • molecule1_name (string)
    • molecule2_name (string)
    • reaction_factor (real-valued target)
  • Task: Design an end-to-end ML approach to predict reaction_factor for a pair of molecules. Assume you can query public chem informatics resources (e.g., to map names to SMILES/InChI) and use standard cheminformatics tooling.

Explicit assumptions (state any others you need)

  • reaction_factor is a continuous scalar (e.g., yield, rate, affinity) with potential skew and outliers.
  • The order of molecules is not expected to change the physics significantly (A+B ≈ B+A), but the dataset may contain both orders.
  • No reaction conditions are provided; if present, treat them as additional features but avoid leakage.

Requirements

  1. Exploratory Data Analysis (EDA)
    • Label checks: distribution/transformations; outliers; replicate consistency.
    • Data quality: missingness; duplicates; reversed-order duplicates; potential leakage.
    • Name normalization: casing, synonyms, canonical identifiers; detect inconsistent naming and collisions.
  2. Feature engineering
    • Map names to molecular representations (SMILES/InChI), compute descriptors (RDKit), fingerprints, embeddings, or graph encodings.
    • Construct pairwise features: concatenation, symmetric aggregations (sum/max/abs-diff/product), cross/attention features.
    • Handle unseen molecules; enforce symmetry (A+B ≈ B+A) in the design.
  3. Data splitting
    • Propose and justify: random split; grouped-by-molecule split; leave-one-molecule-out (cold-start); scaffold-based split; time-based split.
    • Explain what each split estimates and how it mitigates leakage.
  4. Modeling
    • Baselines: linear/ridge/elastic net.
    • Tree ensembles: XGBoost/LightGBM/CatBoost.
    • Neural models: MLP on descriptors; Transformer on SMILES; GNN/Graph Transformer on molecular graphs with a pairwise architecture.
    • Uncertainty estimation and calibration strategies.
  5. Training and evaluation
    • Loss and metrics: MAE, RMSE, R²; robust losses.
    • Cross-validation strategy aligned to your split choices; hyperparameter search; early stopping.
    • Error analysis: by scaffold/family/frequency and other covariates.
  6. Generalization and robustness
    • Standardization, regularization.
    • Data augmentation (e.g., SMILES enumeration), pretraining on large corpora, semi-supervised learning, ensembling.
    • Adversarial validation, domain-shift checks.
    • Ablations and offline A/B comparisons.
  7. Production considerations
    • Feature pipelines, caching, reproducibility/versioning.
    • Monitoring (data/label drift, uncertainty), retraining triggers.
    • Scientific safety checks and guardrails.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More Google•More Machine Learning Engineer•Google Machine Learning Engineer•Google ML System Design•Machine Learning Engineer ML System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.