Design a Real-vs-Fake DNA Classifier
Company: Jane Street
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
You are given DNA sequences over the alphabet {A, C, G, T}. A small labeled dataset contains both **real** and **fake** DNA sequences. In addition, you have a much larger dataset containing **only real** DNA sequences.
Design a machine learning approach to classify whether a DNA sequence is real or fake.
Address the following:
1. How would you represent DNA sequences for modeling (for example, k-mer features, learned embeddings, CNN/RNN/Transformer-based sequence models, or biologically motivated features)?
2. How would you leverage the large real-only dataset? Would you frame this as standard supervised learning, positive-unlabeled learning, anomaly detection / one-class classification, self-supervised pretraining, or a hybrid approach?
3. Would you generate additional fake sequences to augment training? If yes, how would you create synthetic negatives that are realistic and not trivially easy for the model to distinguish?
4. How would you train and evaluate the model given limited labeled data, likely class imbalance, and the risk that synthetic fake sequences may not match real-world fake sequences?
5. What important failure modes or domain-specific issues would you check for, such as reverse-complement symmetry, variable sequence length, GC-content bias, duplicated or near-duplicated sequences, and distribution shift between train and test data?
Your answer should propose a practical modeling strategy, explain trade-offs, and describe how you would validate that the classifier is learning biological structure rather than artifacts.
Quick Answer: This question evaluates a candidate's expertise in machine learning for biological sequence data, covering representation learning for DNA, use of large unlabeled datasets (e.g.