Build real-vs-fake DNA classifier
Company: Jane Street
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
Suppose you are given DNA sequences represented as strings over {A, C, G, T}, and sequence lengths may vary. You have:
- a small labeled dataset containing both real DNA sequences and fake DNA sequences,
- a much larger dataset containing only confirmed real DNA sequences.
Design a machine learning system that predicts whether a new DNA sequence is real or fake.
Discuss:
1. how you would represent the sequences (for example, k-mer features, handcrafted biological features, embeddings, or sequence models),
2. whether you would frame this as standard supervised classification, anomaly detection / one-class learning, or a hybrid approach,
3. how you would create or augment fake examples without making them unrealistically easy to distinguish,
4. what model(s) you would try first and why,
5. how you would evaluate performance under class imbalance and possible distribution shift,
6. how you would prevent train/test leakage from duplicated or highly similar sequences,
7. how you would calibrate the final decision threshold if falsely classifying a real sequence as fake is costly.
Quick Answer: This question evaluates competency in machine learning for biological sequence data, covering sequence representation, model framing (supervised vs anomaly detection), data augmentation, model choice, evaluation under class imbalance and distribution shift, and mechanisms to prevent train/test leakage.