PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Two Sigma

Predict Stock Prices from Google Search Data

Last updated: Jul 2, 2026

Predict Stock Prices from Google Search Data

Company: Two Sigma

Role: Data Scientist

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

You are given access to historical Google search data — relative search volume over time for arbitrary query terms — along with standard historical market data (daily prices and trading volume) for a universe of stocks. Design a model that uses the search data to predict stock prices. Walk through your approach end to end: how you frame the prediction target, which search queries you would use and what features you would build from them, what model you would fit, how you would train and validate it without fooling yourself, and how you would decide whether the resulting signal is genuinely useful. Expect the interviewer to keep pressing with "what else?" after each answer — be prepared to go deeper on features, failure modes, and validation at every step. ```hint Pick the right target Raw price levels are non-stationary (close to a random walk), so a model that "predicts the price" mostly learns yesterday's price. Think about predicting **forward returns** (or even volatility/volume) over a chosen horizon instead — and whether you want a time-series forecast for one asset or a **cross-sectional ranking** across many stocks. ``` ```hint What search data actually gives you Public search-volume data is **relative, normalized, sampled, and revised** — the number for a given week can change when it is re-downloaded later. Useful features are usually about *abnormal attention*: current search intensity versus that query's own trailing history (e.g., a z-score or log-change), not the raw level. ``` ```hint The validation trap The fastest way to a fake result here is temporal leakage: shuffled K-fold, features built with future information, or query terms picked because you already know they "worked" historically. Think **walk-forward evaluation**, point-in-time data, and an honest accounting of how many hypotheses you tested. ``` ### Constraints & Assumptions - Search data is available at daily (or weekly) granularity per query term, as a normalized relative-volume index rather than absolute counts, and only becomes available with some delay after the period it covers. - The stock universe is a set of reasonably liquid equities with standard daily open/high/low/close/volume history. - You may assume enough history (several years) to fit and evaluate a model, but the signal-to-noise ratio of any return-prediction problem is very low. - The interviewer cares about modeling judgment and statistical honesty, not about production infrastructure. ### Clarifying Questions to Ask - What prediction horizon do you have in mind — next day, next week, longer? Search data is slow-moving, so the horizon constrains everything downstream. - Are we predicting a single stock's price path, or ranking a cross-section of many stocks (e.g., for a long-short strategy)? - Should the output be a price level, a return, a direction (up/down), or would predicting volatility/volume also count as success? - Exactly when does the search data for day $t$ become available, and can historical values be revised after the fact? - Is success measured by statistical accuracy (e.g., correlation with realized returns) or by economic value (a cost-aware backtest)? ### What a Strong Answer Covers - **Problem framing:** predicting returns (or volatility) rather than price levels, an explicit horizon, and a deliberate choice between time-series and cross-sectional formulations. - **Data realism:** normalization/sampling/revision quirks of public search data, availability lag, point-in-time discipline, and query-selection issues (ambiguous tickers, company vs. product terms). - **Feature construction:** abnormal-attention measures (z-scores, log-changes vs. trailing history), lags, spike indicators, and controls for known drivers like past returns, volume, and volatility. - **Model choice matched to signal-to-noise:** starting simple and regularized before reaching for complex models, with a reasoned justification. - **Leakage-free validation:** walk-forward or purged time-series splits, no shuffled cross-validation, hyperparameters tuned only on past data. - **Honest evaluation and skepticism:** out-of-sample rank correlation / direction accuracy, cost-aware backtest versus simple baselines, and explicit treatment of multiple testing, reverse causality (attention chasing past returns), non-stationarity, and signal decay. ### Follow-up Questions - With thousands of stocks and an unlimited choice of query terms, how do you select queries without turning the whole exercise into data snooping? - Your backtest shows a strong in-sample Sharpe ratio that collapses out of sample. Walk through how you would diagnose what went wrong. - Search attention often *reacts* to price moves rather than leading them. How would you establish that your feature actually leads returns instead of lagging them? - Suppose the firm already runs momentum and reversal signals. How would you test whether your search-based signal adds incremental value rather than repackaging what they already have?

Related Interview Questions

  • Analyze Temperatures and Update Regression - Two Sigma (medium)
  • How would you forecast bike demand? - Two Sigma (hard)
  • Predict Bike Dock Demand - Two Sigma (hard)
  • Predict bike demand and avoid overfitting - Two Sigma (hard)
  • How detect duplicate card records? - Two Sigma (medium)
|Home/Machine Learning/Two Sigma

Predict Stock Prices from Google Search Data

Two Sigma logo
Two Sigma
Dec 1, 2024, 12:00 AM
mediumData ScientistTechnical ScreenMachine Learning
0
0

You are given access to historical Google search data — relative search volume over time for arbitrary query terms — along with standard historical market data (daily prices and trading volume) for a universe of stocks. Design a model that uses the search data to predict stock prices.

Walk through your approach end to end: how you frame the prediction target, which search queries you would use and what features you would build from them, what model you would fit, how you would train and validate it without fooling yourself, and how you would decide whether the resulting signal is genuinely useful. Expect the interviewer to keep pressing with "what else?" after each answer — be prepared to go deeper on features, failure modes, and validation at every step.

Constraints & Assumptions

  • Search data is available at daily (or weekly) granularity per query term, as a normalized relative-volume index rather than absolute counts, and only becomes available with some delay after the period it covers.
  • The stock universe is a set of reasonably liquid equities with standard daily open/high/low/close/volume history.
  • You may assume enough history (several years) to fit and evaluate a model, but the signal-to-noise ratio of any return-prediction problem is very low.
  • The interviewer cares about modeling judgment and statistical honesty, not about production infrastructure.

Clarifying Questions to Ask

  • What prediction horizon do you have in mind — next day, next week, longer? Search data is slow-moving, so the horizon constrains everything downstream.
  • Are we predicting a single stock's price path, or ranking a cross-section of many stocks (e.g., for a long-short strategy)?
  • Should the output be a price level, a return, a direction (up/down), or would predicting volatility/volume also count as success?
  • Exactly when does the search data for day ttt become available, and can historical values be revised after the fact?
  • Is success measured by statistical accuracy (e.g., correlation with realized returns) or by economic value (a cost-aware backtest)?

What a Strong Answer Covers

  • Problem framing: predicting returns (or volatility) rather than price levels, an explicit horizon, and a deliberate choice between time-series and cross-sectional formulations.
  • Data realism: normalization/sampling/revision quirks of public search data, availability lag, point-in-time discipline, and query-selection issues (ambiguous tickers, company vs. product terms).
  • Feature construction: abnormal-attention measures (z-scores, log-changes vs. trailing history), lags, spike indicators, and controls for known drivers like past returns, volume, and volatility.
  • Model choice matched to signal-to-noise: starting simple and regularized before reaching for complex models, with a reasoned justification.
  • Leakage-free validation: walk-forward or purged time-series splits, no shuffled cross-validation, hyperparameters tuned only on past data.
  • Honest evaluation and skepticism: out-of-sample rank correlation / direction accuracy, cost-aware backtest versus simple baselines, and explicit treatment of multiple testing, reverse causality (attention chasing past returns), non-stationarity, and signal decay.

Follow-up Questions

  • With thousands of stocks and an unlimited choice of query terms, how do you select queries without turning the whole exercise into data snooping?
  • Your backtest shows a strong in-sample Sharpe ratio that collapses out of sample. Walk through how you would diagnose what went wrong.
  • Search attention often reacts to price moves rather than leading them. How would you establish that your feature actually leads returns instead of lagging them?
  • Suppose the firm already runs momentum and reversal signals. How would you test whether your search-based signal adds incremental value rather than repackaging what they already have?
Loading comments...

Browse More Questions

More Machine Learning•More Two Sigma•More Data Scientist•Two Sigma Data Scientist•Two Sigma Machine Learning•Data Scientist Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.