PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Jane Street

Design a Real-vs-Fake DNA Classifier

Last updated: Jun 15, 2026

Quick Overview

A Jane Street Data Scientist technical-screen question on building a classifier that distinguishes real from fake DNA sequences, given a small labeled real/fake set plus a large real-only corpus. It tests sequence representation (k-mers, embeddings, CNN/Transformer), hybrid supervised + anomaly-detection framing, realistic hard-negative generation, evaluation under class imbalance, leakage prevention, threshold calibration, and biological failure modes like reverse-complement symmetry and GC-content shortcuts.

  • medium
  • Jane Street
  • Machine Learning
  • Data Scientist

Design a Real-vs-Fake DNA Classifier

Company: Jane Street

Role: Data Scientist

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

##### Question You are given DNA sequences over the alphabet `{A, C, G, T}`, where sequence lengths may vary. You have: - a **small labeled dataset** containing both **real** and **fake** DNA sequences, and - a **much larger dataset** containing **only confirmed real** DNA sequences. Design a machine learning system that predicts whether a new DNA sequence is real or fake. Address the following: 1. **Representation.** How would you represent DNA sequences for modeling (for example, k-mer features, handcrafted biological features, learned embeddings, or CNN/RNN/Transformer sequence models)? 2. **Problem framing.** Given that most of the extra data are positive (real) examples only, would you frame this as standard supervised classification, positive-unlabeled learning, anomaly detection / one-class learning, self-supervised pretraining, or a hybrid? How would you leverage the large real-only dataset? 3. **Synthetic negatives.** Would you generate or augment fake sequences? If so, how would you create realistic synthetic negatives that are not trivially easy to distinguish (i.e. that don't introduce artifacts making the problem artificially easy)? 4. **Model choice.** What model(s) would you try first, and why? When would you move from simple baselines to deeper sequence models? 5. **Training under imbalance.** How would you train given limited labeled data and likely class imbalance (and the risk that synthetic fakes may not match real-world fakes)? 6. **Evaluation.** How would you evaluate performance, choose metrics, and avoid leakage from duplicated or highly similar sequences? 7. **Threshold calibration.** How would you calibrate the final decision threshold if falsely classifying a real sequence as fake is costly? 8. **Failure modes.** What domain-specific issues would you watch for, such as reverse-complement symmetry, variable sequence length, GC-content shortcut learning, near-duplicate leakage, distribution shift, overfitting, and calibration? Your answer should propose a practical modeling strategy, explain trade-offs, and describe how you would validate that the classifier is learning biological structure rather than artifacts.

Quick Answer: A Jane Street Data Scientist technical-screen question on building a classifier that distinguishes real from fake DNA sequences, given a small labeled real/fake set plus a large real-only corpus. It tests sequence representation (k-mers, embeddings, CNN/Transformer), hybrid supervised + anomaly-detection framing, realistic hard-negative generation, evaluation under class imbalance, leakage prevention, threshold calibration, and biological failure modes like reverse-complement symmetry and GC-content shortcuts.

Related Interview Questions

  • Analyze trading RFQ competitiveness data - Jane Street (medium)
  • Build a time-series forecasting model - Jane Street (hard)
|Home/Machine Learning/Jane Street

Design a Real-vs-Fake DNA Classifier

Jane Street logo
Jane Street
Jan 27, 2026, 12:00 AM
mediumData ScientistTechnical ScreenMachine Learning
6
0
Question

You are given DNA sequences over the alphabet {A, C, G, T}, where sequence lengths may vary. You have:

  • a small labeled dataset containing both real and fake DNA sequences, and
  • a much larger dataset containing only confirmed real DNA sequences.

Design a machine learning system that predicts whether a new DNA sequence is real or fake. Address the following:

  1. Representation. How would you represent DNA sequences for modeling (for example, k-mer features, handcrafted biological features, learned embeddings, or CNN/RNN/Transformer sequence models)?
  2. Problem framing. Given that most of the extra data are positive (real) examples only, would you frame this as standard supervised classification, positive-unlabeled learning, anomaly detection / one-class learning, self-supervised pretraining, or a hybrid? How would you leverage the large real-only dataset?
  3. Synthetic negatives. Would you generate or augment fake sequences? If so, how would you create realistic synthetic negatives that are not trivially easy to distinguish (i.e. that don't introduce artifacts making the problem artificially easy)?
  4. Model choice. What model(s) would you try first, and why? When would you move from simple baselines to deeper sequence models?
  5. Training under imbalance. How would you train given limited labeled data and likely class imbalance (and the risk that synthetic fakes may not match real-world fakes)?
  6. Evaluation. How would you evaluate performance, choose metrics, and avoid leakage from duplicated or highly similar sequences?
  7. Threshold calibration. How would you calibrate the final decision threshold if falsely classifying a real sequence as fake is costly?
  8. Failure modes. What domain-specific issues would you watch for, such as reverse-complement symmetry, variable sequence length, GC-content shortcut learning, near-duplicate leakage, distribution shift, overfitting, and calibration?

Your answer should propose a practical modeling strategy, explain trade-offs, and describe how you would validate that the classifier is learning biological structure rather than artifacts.

Loading comments...

Browse More Questions

More Machine Learning•More Jane Street•More Data Scientist•Jane Street Data Scientist•Jane Street Machine Learning•Data Scientist Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.