PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Machine Learning/Jane Street

Build real-vs-fake DNA classifier

Last updated: Mar 29, 2026

Quick Overview

This question evaluates competency in machine learning for biological sequence data, covering sequence representation, model framing (supervised vs anomaly detection), data augmentation, model choice, evaluation under class imbalance and distribution shift, and mechanisms to prevent train/test leakage.

  • medium
  • Jane Street
  • Machine Learning
  • Data Scientist

Build real-vs-fake DNA classifier

Company: Jane Street

Role: Data Scientist

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

Suppose you are given DNA sequences represented as strings over {A, C, G, T}, and sequence lengths may vary. You have: - a small labeled dataset containing both real DNA sequences and fake DNA sequences, - a much larger dataset containing only confirmed real DNA sequences. Design a machine learning system that predicts whether a new DNA sequence is real or fake. Discuss: 1. how you would represent the sequences (for example, k-mer features, handcrafted biological features, embeddings, or sequence models), 2. whether you would frame this as standard supervised classification, anomaly detection / one-class learning, or a hybrid approach, 3. how you would create or augment fake examples without making them unrealistically easy to distinguish, 4. what model(s) you would try first and why, 5. how you would evaluate performance under class imbalance and possible distribution shift, 6. how you would prevent train/test leakage from duplicated or highly similar sequences, 7. how you would calibrate the final decision threshold if falsely classifying a real sequence as fake is costly.

Quick Answer: This question evaluates competency in machine learning for biological sequence data, covering sequence representation, model framing (supervised vs anomaly detection), data augmentation, model choice, evaluation under class imbalance and distribution shift, and mechanisms to prevent train/test leakage.

Related Interview Questions

  • Design a Real-vs-Fake DNA Classifier - Jane Street (medium)
  • Build a DNA authenticity classifier - Jane Street (medium)
  • Analyze trading RFQ competitiveness data - Jane Street (medium)
  • Build a time-series forecasting model - Jane Street (hard)
Jane Street logo
Jane Street
Feb 3, 2026, 12:00 AM
Data Scientist
Technical Screen
Machine Learning
4
0

Suppose you are given DNA sequences represented as strings over {A, C, G, T}, and sequence lengths may vary. You have:

  • a small labeled dataset containing both real DNA sequences and fake DNA sequences,
  • a much larger dataset containing only confirmed real DNA sequences.

Design a machine learning system that predicts whether a new DNA sequence is real or fake.

Discuss:

  1. how you would represent the sequences (for example, k-mer features, handcrafted biological features, embeddings, or sequence models),
  2. whether you would frame this as standard supervised classification, anomaly detection / one-class learning, or a hybrid approach,
  3. how you would create or augment fake examples without making them unrealistically easy to distinguish,
  4. what model(s) you would try first and why,
  5. how you would evaluate performance under class imbalance and possible distribution shift,
  6. how you would prevent train/test leakage from duplicated or highly similar sequences,
  7. how you would calibrate the final decision threshold if falsely classifying a real sequence as fake is costly.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Jane Street•More Data Scientist•Jane Street Data Scientist•Jane Street Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.