PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/Jane Street

Design a Real-vs-Fake DNA Classifier

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's expertise in machine learning for biological sequence data, covering representation learning for DNA, use of large unlabeled datasets (e.g.

  • medium
  • Jane Street
  • Machine Learning
  • Data Scientist

Design a Real-vs-Fake DNA Classifier

Company: Jane Street

Role: Data Scientist

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

You are given DNA sequences over the alphabet {A, C, G, T}. A small labeled dataset contains both **real** and **fake** DNA sequences. In addition, you have a much larger dataset containing **only real** DNA sequences. Design a machine learning approach to classify whether a DNA sequence is real or fake. Address the following: 1. How would you represent DNA sequences for modeling (for example, k-mer features, learned embeddings, CNN/RNN/Transformer-based sequence models, or biologically motivated features)? 2. How would you leverage the large real-only dataset? Would you frame this as standard supervised learning, positive-unlabeled learning, anomaly detection / one-class classification, self-supervised pretraining, or a hybrid approach? 3. Would you generate additional fake sequences to augment training? If yes, how would you create synthetic negatives that are realistic and not trivially easy for the model to distinguish? 4. How would you train and evaluate the model given limited labeled data, likely class imbalance, and the risk that synthetic fake sequences may not match real-world fake sequences? 5. What important failure modes or domain-specific issues would you check for, such as reverse-complement symmetry, variable sequence length, GC-content bias, duplicated or near-duplicated sequences, and distribution shift between train and test data? Your answer should propose a practical modeling strategy, explain trade-offs, and describe how you would validate that the classifier is learning biological structure rather than artifacts.

Quick Answer: This question evaluates a candidate's expertise in machine learning for biological sequence data, covering representation learning for DNA, use of large unlabeled datasets (e.g.

Related Interview Questions

  • Build real-vs-fake DNA classifier - Jane Street (medium)
  • Build a DNA authenticity classifier - Jane Street (medium)
  • Analyze trading RFQ competitiveness data - Jane Street (medium)
  • Build a time-series forecasting model - Jane Street (hard)
Jane Street logo
Jane Street
Jan 27, 2026, 12:00 AM
Data Scientist
Technical Screen
Machine Learning
3
0

You are given DNA sequences over the alphabet {A, C, G, T}. A small labeled dataset contains both real and fake DNA sequences. In addition, you have a much larger dataset containing only real DNA sequences.

Design a machine learning approach to classify whether a DNA sequence is real or fake.

Address the following:

  1. How would you represent DNA sequences for modeling (for example, k-mer features, learned embeddings, CNN/RNN/Transformer-based sequence models, or biologically motivated features)?
  2. How would you leverage the large real-only dataset? Would you frame this as standard supervised learning, positive-unlabeled learning, anomaly detection / one-class classification, self-supervised pretraining, or a hybrid approach?
  3. Would you generate additional fake sequences to augment training? If yes, how would you create synthetic negatives that are realistic and not trivially easy for the model to distinguish?
  4. How would you train and evaluate the model given limited labeled data, likely class imbalance, and the risk that synthetic fake sequences may not match real-world fake sequences?
  5. What important failure modes or domain-specific issues would you check for, such as reverse-complement symmetry, variable sequence length, GC-content bias, duplicated or near-duplicated sequences, and distribution shift between train and test data?

Your answer should propose a practical modeling strategy, explain trade-offs, and describe how you would validate that the classifier is learning biological structure rather than artifacts.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Jane Street•More Data Scientist•Jane Street Data Scientist•Jane Street Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.