Build a regression model for wind power output

Q: Build a regression model for wind power output

This is a ML System Design interview question from Citadel for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Task: Snapshot Regression for Turbine-Level Power Prediction (Non–Time-Series)

You are given turbine-level SCADA snapshots and concurrent weather data. Build a non–time-series regression model that predicts instantaneous (e.g., 1–10 minute averaged) turbine power output using only features available at that same snapshot.

Assume data may include: wind speed and direction (from nacelle and/or met mast), air temperature, pressure, humidity, turbulence intensity (TI), turbine operational signals (e.g., rotor speed, pitch, yaw), turbine metadata (rated power, rotor diameter, hub height, model), and site metadata (elevation, terrain roughness). No sequence modeling is allowed.

Describe and justify the following:

Candidate Features and Preprocessing

Weather and turbine features, including derived physics-based features (e.g., air density, dynamic pressure, power-curve proxies).
Encoding of wind direction, yaw misalignment, and turbulence/shear.
Normalization/standardization choices and handling of categorical/site/turbine identifiers.

Handling Data Issues

Strategy for missing or noisy sensors; imputations and quality flags.
Outlier detection and treatment, including curtailment or abnormal operating modes.

Model Choices and Physics Encoding (no sequence models)

Compare: regularized linear models, gradient boosting, random forest, shallow MLP, GAMs.
How to encode known physics (e.g., approximate power curve, monotonicity to wind speed before rated, saturation at rated power) via features, constraints, or loss design.

Validation Strategy for Generalization

Cross-validation across sites/turbines and across wind-speed regimes (e.g., below cut-in, near rated, above rated) to ensure robustness and avoid leakage.

Evaluation Metrics and Error Structure

Metrics: RMSE, MAE, MAPE and their pitfalls; alternatives for low-power regimes.
Treatment of heteroscedastic errors and the cap at rated power.

Uncertainty Estimation and Calibration

Methods to produce and calibrate predictive intervals/uncertainty.

Safeguards and Edge Cases

Extrapolation detection and fallbacks.
Curtailment and availability scenarios: detect, model, or exclude.

Provide a structured, engineering-ready plan with formulas when relevant, and note key pitfalls and validation guardrails.

Build a regression model for wind power output

Task: Snapshot Regression for Turbine-Level Power Prediction (Non–Time-Series)

Solution (Locked)

Comments (0)