##### Question You are given DNA sequences over the alphabet `{A, C, G, T}`, where sequence lengths may vary. You have: - a **small labeled dataset** containing both **real** and **fake** DNA sequences, and - a **much larger dataset** containing **only confirmed real** DNA sequences. Design a machine learning system that predicts whether a new DNA sequence is real or fake. Address the following: 1. **Representation.** How would you represent DNA sequences for modeling (for example, k-mer features, handcrafted biological features, learned embeddings, or CNN/RNN/Transformer sequence models)? 2. **Problem framing.** Given that most of the extra data are positive (real) examples only, would you frame this as standard supervised classification, positive-unlabeled learning, anomaly detection / one-class learning, self-supervised pretraining, or a hybrid? How would you leverage the large real-only dataset? 3. **Synthetic negatives.** Would you generate or augment fake sequences? If so, how would you create realistic synthetic negatives that are not trivially easy to distinguish (i.e. that don't introduce artifacts making the problem artificially easy)? 4. **Model choice.** What model(s) would you try first, and why? When would you move from simple baselines to deeper sequence models? 5. **Training under imbalance.** How would you train given limited labeled data and likely class imbalance (and the risk that synthetic fakes may not match real-world fakes)? 6. **Evaluation.** How would you evaluate performance, choose metrics, and avoid leakage from duplicated or highly similar sequences? 7. **Threshold calibration.** How would you calibrate the final decision threshold if falsely classifying a real sequence as fake is costly? 8. **Failure modes.** What domain-specific issues would you watch for, such as reverse-complement symmetry, variable sequence length, GC-content shortcut learning, near-duplicate leakage, distribution shift, overfitting, and calibration? Your answer should propose a practical modeling strategy, explain trade-offs, and describe how you would validate that the classifier is learning biological structure rather than artifacts.

A Jane Street Data Scientist technical-screen question on building a classifier that distinguishes real from fake DNA sequences, given a small labeled real/fake set plus a large real-only corpus. It tests sequence representation (k-mers, embeddings, CNN/Transformer), hybrid supervised + anomaly-detection framing, realistic hard-negative generation, evaluation under class imbalance, leakage prevention, threshold calibration, and biological failure modes like reverse-complement symmetry and GC-content shortcuts.

How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Technical Screen rounds at Jane Street.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Jane Street during technical interviews.

Design a Real-vs-Fake DNA Classifier | Jane Street Interview Question

Question

You are given DNA sequences over the alphabet {A, C, G, T}, where sequence lengths may vary. You have:

a small labeled dataset containing both real and fake DNA sequences, and
a much larger dataset containing only confirmed real DNA sequences.

Design a machine learning system that predicts whether a new DNA sequence is real or fake. Address the following:

Representation. How would you represent DNA sequences for modeling (for example, k-mer features, handcrafted biological features, learned embeddings, or CNN/RNN/Transformer sequence models)?
Problem framing. Given that most of the extra data are positive (real) examples only, would you frame this as standard supervised classification, positive-unlabeled learning, anomaly detection / one-class learning, self-supervised pretraining, or a hybrid? How would you leverage the large real-only dataset?
Synthetic negatives. Would you generate or augment fake sequences? If so, how would you create realistic synthetic negatives that are not trivially easy to distinguish (i.e. that don't introduce artifacts making the problem artificially easy)?
Model choice. What model(s) would you try first, and why? When would you move from simple baselines to deeper sequence models?
Training under imbalance. How would you train given limited labeled data and likely class imbalance (and the risk that synthetic fakes may not match real-world fakes)?
Evaluation. How would you evaluate performance, choose metrics, and avoid leakage from duplicated or highly similar sequences?
Threshold calibration. How would you calibrate the final decision threshold if falsely classifying a real sequence as fake is costly?
Failure modes. What domain-specific issues would you watch for, such as reverse-complement symmetry, variable sequence length, GC-content shortcut learning, near-duplicate leakage, distribution shift, overfitting, and calibration?

Your answer should propose a practical modeling strategy, explain trade-offs, and describe how you would validate that the classifier is learning biological structure rather than artifacts.

Question

You are given DNA sequences over the alphabet {A, C, G, T}, where sequence lengths may vary. You have:

a small labeled dataset containing both real and fake DNA sequences, and
a much larger dataset containing only confirmed real DNA sequences.

Design a machine learning system that predicts whether a new DNA sequence is real or fake. Address the following:

Representation. How would you represent DNA sequences for modeling (for example, k-mer features, handcrafted biological features, learned embeddings, or CNN/RNN/Transformer sequence models)?
Problem framing. Given that most of the extra data are positive (real) examples only, would you frame this as standard supervised classification, positive-unlabeled learning, anomaly detection / one-class learning, self-supervised pretraining, or a hybrid? How would you leverage the large real-only dataset?
Synthetic negatives. Would you generate or augment fake sequences? If so, how would you create realistic synthetic negatives that are not trivially easy to distinguish (i.e. that don't introduce artifacts making the problem artificially easy)?
Model choice. What model(s) would you try first, and why? When would you move from simple baselines to deeper sequence models?
Training under imbalance. How would you train given limited labeled data and likely class imbalance (and the risk that synthetic fakes may not match real-world fakes)?
Evaluation. How would you evaluate performance, choose metrics, and avoid leakage from duplicated or highly similar sequences?
Threshold calibration. How would you calibrate the final decision threshold if falsely classifying a real sequence as fake is costly?
Failure modes. What domain-specific issues would you watch for, such as reverse-complement symmetry, variable sequence length, GC-content shortcut learning, near-duplicate leakage, distribution shift, overfitting, and calibration?

Your answer should propose a practical modeling strategy, explain trade-offs, and describe how you would validate that the classifier is learning biological structure rather than artifacts.

Design a Real-vs-Fake DNA Classifier

Quick Overview

Question

Solution

Submit Your Answer to Earn 20XP

Design a Real-vs-Fake DNA Classifier

Quick Overview

Question

Solution

Submit Your Answer to Earn 20XP