Design a Real-vs-Fake DNA Classifier
Company: Jane Street
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
##### Question
You are given DNA sequences over the alphabet `{A, C, G, T}`, where sequence lengths may vary. You have:
- a **small labeled dataset** containing both **real** and **fake** DNA sequences, and
- a **much larger dataset** containing **only confirmed real** DNA sequences.
Design a machine learning system that predicts whether a new DNA sequence is real or fake. Address the following:
1. **Representation.** How would you represent DNA sequences for modeling (for example, k-mer features, handcrafted biological features, learned embeddings, or CNN/RNN/Transformer sequence models)?
2. **Problem framing.** Given that most of the extra data are positive (real) examples only, would you frame this as standard supervised classification, positive-unlabeled learning, anomaly detection / one-class learning, self-supervised pretraining, or a hybrid? How would you leverage the large real-only dataset?
3. **Synthetic negatives.** Would you generate or augment fake sequences? If so, how would you create realistic synthetic negatives that are not trivially easy to distinguish (i.e. that don't introduce artifacts making the problem artificially easy)?
4. **Model choice.** What model(s) would you try first, and why? When would you move from simple baselines to deeper sequence models?
5. **Training under imbalance.** How would you train given limited labeled data and likely class imbalance (and the risk that synthetic fakes may not match real-world fakes)?
6. **Evaluation.** How would you evaluate performance, choose metrics, and avoid leakage from duplicated or highly similar sequences?
7. **Threshold calibration.** How would you calibrate the final decision threshold if falsely classifying a real sequence as fake is costly?
8. **Failure modes.** What domain-specific issues would you watch for, such as reverse-complement symmetry, variable sequence length, GC-content shortcut learning, near-duplicate leakage, distribution shift, overfitting, and calibration?
Your answer should propose a practical modeling strategy, explain trade-offs, and describe how you would validate that the classifier is learning biological structure rather than artifacts.
Quick Answer: A Jane Street Data Scientist technical-screen question on building a classifier that distinguishes real from fake DNA sequences, given a small labeled real/fake set plus a large real-only corpus. It tests sequence representation (k-mers, embeddings, CNN/Transformer), hybrid supervised + anomaly-detection framing, realistic hard-negative generation, evaluation under class imbalance, leakage prevention, threshold calibration, and biological failure modes like reverse-complement symmetry and GC-content shortcuts.