Design a DNA-sequence optimization loop
Company: Lila
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Onsite
You are building an ML-driven platform to **optimize DNA sequences** (e.g., a promoter/enhancer/codon-optimized gene) for a target lab-measured property (e.g., expression level, binding strength, stability).
You have:
- A **robotic wet-lab** that can synthesize/run an assay on a *batch* of candidate sequences per day.
- Historical data: `(sequence, assay_result, metadata)` where assay results are **noisy** and may vary by batch.
- A **sequence model** (could be a Transformer/LLM-style model) that can generate or score sequences.
- Hard constraints (examples): GC content range, forbidden motifs, max homopolymer length, sequence length bounds.
Design an end-to-end system that repeatedly proposes sequences, runs experiments, and learns from results.
Address:
1. How you represent sequences and incorporate constraints.
2. How you generate candidate sequences (search / Bayesian optimization / evolutionary / RL / LLM prompting, etc.).
3. How you balance **exploration vs. exploitation** and handle noisy measurements.
4. How you choose a **batch** of sequences each round (not just one).
5. How you evaluate progress and decide when to stop.
6. Key failure modes (mode collapse, assay drift, data leakage, overfitting to simulator/predictor) and mitigations.
7. What you would log/monitor in production.
Quick Answer: This question evaluates a candidate's ability to design an end-to-end ML-driven experimental optimization loop for DNA sequence engineering, including sequence representation, constraint enforcement, candidate generation, batch experimental design, and learning from noisy assay measurements.