You are building an ML-driven platform to optimize DNA sequences (e.g., a promoter/enhancer/codon-optimized gene) for a target lab-measured property (e.g., expression level, binding strength, stability).
You have:
-
A
robotic wet-lab
that can synthesize/run an assay on a
batch
of candidate sequences per day.
-
Historical data:
(sequence, assay_result, metadata)
where assay results are
noisy
and may vary by batch.
-
A
sequence model
(could be a Transformer/LLM-style model) that can generate or score sequences.
-
Hard constraints (examples): GC content range, forbidden motifs, max homopolymer length, sequence length bounds.
Design an end-to-end system that repeatedly proposes sequences, runs experiments, and learns from results.
Address:
-
How you represent sequences and incorporate constraints.
-
How you generate candidate sequences (search / Bayesian optimization / evolutionary / RL / LLM prompting, etc.).
-
How you balance
exploration vs. exploitation
and handle noisy measurements.
-
How you choose a
batch
of sequences each round (not just one).
-
How you evaluate progress and decide when to stop.
-
Key failure modes (mode collapse, assay drift, data leakage, overfitting to simulator/predictor) and mitigations.
-
What you would log/monitor in production.