PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Jane Street

Design a Real-vs-Fake DNA Classifier

Last updated: Jun 15, 2026

Quick Overview

A Jane Street Data Scientist technical-screen question on building a classifier that distinguishes real from fake DNA sequences, given a small labeled real/fake set plus a large real-only corpus. It tests sequence representation (k-mers, embeddings, CNN/Transformer), hybrid supervised + anomaly-detection framing, realistic hard-negative generation, evaluation under class imbalance, leakage prevention, threshold calibration, and biological failure modes like reverse-complement symmetry and GC-content shortcuts.

  • medium
  • Jane Street
  • Machine Learning
  • Data Scientist

Design a Real-vs-Fake DNA Classifier

Company: Jane Street

Role: Data Scientist

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

##### Question You are given DNA sequences over the alphabet `{A, C, G, T}`, where sequence lengths may vary. You have: - a **small labeled dataset** containing both **real** and **fake** DNA sequences, and - a **much larger dataset** containing **only confirmed real** DNA sequences. Design a machine learning system that predicts whether a new DNA sequence is real or fake. Address the following: 1. **Representation.** How would you represent DNA sequences for modeling (for example, k-mer features, handcrafted biological features, learned embeddings, or CNN/RNN/Transformer sequence models)? 2. **Problem framing.** Given that most of the extra data are positive (real) examples only, would you frame this as standard supervised classification, positive-unlabeled learning, anomaly detection / one-class learning, self-supervised pretraining, or a hybrid? How would you leverage the large real-only dataset? 3. **Synthetic negatives.** Would you generate or augment fake sequences? If so, how would you create realistic synthetic negatives that are not trivially easy to distinguish (i.e. that don't introduce artifacts making the problem artificially easy)? 4. **Model choice.** What model(s) would you try first, and why? When would you move from simple baselines to deeper sequence models? 5. **Training under imbalance.** How would you train given limited labeled data and likely class imbalance (and the risk that synthetic fakes may not match real-world fakes)? 6. **Evaluation.** How would you evaluate performance, choose metrics, and avoid leakage from duplicated or highly similar sequences? 7. **Threshold calibration.** How would you calibrate the final decision threshold if falsely classifying a real sequence as fake is costly? 8. **Failure modes.** What domain-specific issues would you watch for, such as reverse-complement symmetry, variable sequence length, GC-content shortcut learning, near-duplicate leakage, distribution shift, overfitting, and calibration? Your answer should propose a practical modeling strategy, explain trade-offs, and describe how you would validate that the classifier is learning biological structure rather than artifacts.

Quick Answer: A Jane Street Data Scientist technical-screen question on building a classifier that distinguishes real from fake DNA sequences, given a small labeled real/fake set plus a large real-only corpus. It tests sequence representation (k-mers, embeddings, CNN/Transformer), hybrid supervised + anomaly-detection framing, realistic hard-negative generation, evaluation under class imbalance, leakage prevention, threshold calibration, and biological failure modes like reverse-complement symmetry and GC-content shortcuts.

Related Interview Questions

  • Analyze trading RFQ competitiveness data - Jane Street (medium)
  • Build a time-series forecasting model - Jane Street (hard)
Jane Street logo
Jane Street
Jan 27, 2026, 12:00 AM
Data Scientist
Technical Screen
Machine Learning
5
0
Question

You are given DNA sequences over the alphabet {A, C, G, T}, where sequence lengths may vary. You have:

  • a small labeled dataset containing both real and fake DNA sequences, and
  • a much larger dataset containing only confirmed real DNA sequences.

Design a machine learning system that predicts whether a new DNA sequence is real or fake. Address the following:

  1. Representation. How would you represent DNA sequences for modeling (for example, k-mer features, handcrafted biological features, learned embeddings, or CNN/RNN/Transformer sequence models)?
  2. Problem framing. Given that most of the extra data are positive (real) examples only, would you frame this as standard supervised classification, positive-unlabeled learning, anomaly detection / one-class learning, self-supervised pretraining, or a hybrid? How would you leverage the large real-only dataset?
  3. Synthetic negatives. Would you generate or augment fake sequences? If so, how would you create realistic synthetic negatives that are not trivially easy to distinguish (i.e. that don't introduce artifacts making the problem artificially easy)?
  4. Model choice. What model(s) would you try first, and why? When would you move from simple baselines to deeper sequence models?
  5. Training under imbalance. How would you train given limited labeled data and likely class imbalance (and the risk that synthetic fakes may not match real-world fakes)?
  6. Evaluation. How would you evaluate performance, choose metrics, and avoid leakage from duplicated or highly similar sequences?
  7. Threshold calibration. How would you calibrate the final decision threshold if falsely classifying a real sequence as fake is costly?
  8. Failure modes. What domain-specific issues would you watch for, such as reverse-complement symmetry, variable sequence length, GC-content shortcut learning, near-duplicate leakage, distribution shift, overfitting, and calibration?

Your answer should propose a practical modeling strategy, explain trade-offs, and describe how you would validate that the classifier is learning biological structure rather than artifacts.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Jane Street•More Data Scientist•Jane Street Data Scientist•Jane Street Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.