Senior Data Scientist Studio
Enterprise-grade skill for world, class, data, science. Includes structured workflows, validation checks, and reusable patterns for development.
Senior Data Scientist Studio
A comprehensive skill for senior data scientists covering experiment design, statistical modeling, feature engineering, model evaluation, and research-to-production workflows using Python's scientific computing ecosystem.
When to Use This Skill
Choose this skill when:
- Designing A/B tests and statistical experiments with proper sample sizes
- Building feature engineering pipelines for machine learning models
- Evaluating model performance with appropriate metrics and cross-validation
- Creating reproducible research notebooks and experiment tracking
- Communicating model results to stakeholders with visualizations
Consider alternatives when:
- Deploying models to production → use an ML engineering skill
- Building deep learning models → use a deep learning skill
- Working on computer vision → use a CV engineering skill
- Need data pipeline infrastructure → use a data engineering skill
Quick Start
# End-to-end experiment workflow import pandas as pd import numpy as np from sklearn.model_selection import cross_val_score, StratifiedKFold from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.ensemble import GradientBoostingClassifier import mlflow mlflow.set_experiment("churn_prediction_v2") with mlflow.start_run(run_name="gbm_baseline"): # Feature engineering pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', GradientBoostingClassifier( n_estimators=200, max_depth=5, learning_rate=0.1, )), ]) # Stratified cross-validation cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='roc_auc') mlflow.log_params({'n_estimators': 200, 'max_depth': 5, 'lr': 0.1}) mlflow.log_metric('cv_auc_mean', scores.mean()) mlflow.log_metric('cv_auc_std', scores.std()) print(f"CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")
Core Concepts
Experiment Design Checklist
| Step | Question | Tool |
|---|---|---|
| Hypothesis | What specific effect are we measuring? | Domain knowledge |
| Metric | What's the primary success metric? | Business alignment |
| Sample Size | How many observations do we need? | Power analysis |
| Randomization | How do we assign to treatment/control? | Stratified random |
| Duration | How long do we run the experiment? | Variance estimation |
| Analysis | What statistical test do we use? | t-test, chi-squared, bootstrap |
Feature Engineering Patterns
import pandas as pd from sklearn.base import BaseEstimator, TransformerMixin class TemporalFeatureExtractor(BaseEstimator, TransformerMixin): """Extract temporal features from datetime columns.""" def __init__(self, datetime_col: str): self.datetime_col = datetime_col def fit(self, X, y=None): return self def transform(self, X: pd.DataFrame) -> pd.DataFrame: df = X.copy() dt = pd.to_datetime(df[self.datetime_col]) df[f'{self.datetime_col}_hour'] = dt.dt.hour df[f'{self.datetime_col}_dayofweek'] = dt.dt.dayofweek df[f'{self.datetime_col}_month'] = dt.dt.month df[f'{self.datetime_col}_is_weekend'] = (dt.dt.dayofweek >= 5).astype(int) # Cyclical encoding for hour df[f'{self.datetime_col}_hour_sin'] = np.sin(2 * np.pi * dt.dt.hour / 24) df[f'{self.datetime_col}_hour_cos'] = np.cos(2 * np.pi * dt.dt.hour / 24) return df.drop(columns=[self.datetime_col]) class AggregateFeatures(BaseEstimator, TransformerMixin): """Create aggregate features grouped by a key.""" def __init__(self, group_col: str, agg_col: str): self.group_col = group_col self.agg_col = agg_col def fit(self, X, y=None): self.agg_stats_ = X.groupby(self.group_col)[self.agg_col].agg( ['mean', 'std', 'min', 'max', 'count'] ) return self def transform(self, X: pd.DataFrame) -> pd.DataFrame: return X.merge(self.agg_stats_, on=self.group_col, how='left', suffixes=('', '_agg'))
Model Evaluation Framework
from sklearn.metrics import ( classification_report, roc_auc_score, precision_recall_curve, average_precision_score, confusion_matrix ) import matplotlib.pyplot as plt def evaluate_classifier(model, X_test, y_test, threshold=0.5): y_prob = model.predict_proba(X_test)[:, 1] y_pred = (y_prob >= threshold).astype(int) print(classification_report(y_test, y_pred)) print(f"ROC AUC: {roc_auc_score(y_test, y_prob):.4f}") print(f"Average Precision: {average_precision_score(y_test, y_prob):.4f}") # Precision-Recall curve for imbalanced datasets precision, recall, thresholds = precision_recall_curve(y_test, y_prob) fig, axes = plt.subplots(1, 2, figsize=(12, 5)) axes[0].plot(recall, precision) axes[0].set_xlabel('Recall') axes[0].set_ylabel('Precision') axes[0].set_title('Precision-Recall Curve') # Threshold analysis f1_scores = 2 * (precision[:-1] * recall[:-1]) / (precision[:-1] + recall[:-1] + 1e-8) optimal_idx = f1_scores.argmax() optimal_threshold = thresholds[optimal_idx] axes[1].plot(thresholds, f1_scores) axes[1].axvline(optimal_threshold, color='r', linestyle='--') axes[1].set_title(f'Optimal threshold: {optimal_threshold:.3f}') plt.tight_layout() return fig, optimal_threshold
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
experimentTracker | string | 'mlflow' | Experiment tracking: mlflow, wandb, or neptune |
crossValidationFolds | number | 5 | Number of CV folds |
randomSeed | number | 42 | Global random seed for reproducibility |
significanceLevel | number | 0.05 | Statistical significance threshold (alpha) |
imbalanceStrategy | string | 'smote' | Class imbalance handling: smote, undersample, or weight |
featureSelection | string | 'importance' | Feature selection: importance, correlation, or mutual_info |
Best Practices
-
Always split data before any transformation or feature engineering — Data leakage is the most common cause of models that perform well in notebooks but fail in production. Hold out test data completely before fitting scalers, encoders, or computing aggregate features.
-
Use stratified cross-validation for imbalanced datasets — Random splits can create folds where the minority class is absent. Stratified K-fold preserves class proportions in every fold. Use ROC AUC or Average Precision instead of accuracy for imbalanced problems.
-
Track every experiment with hyperparameters and metrics — Log model type, hyperparameters, feature set, training data version, and all evaluation metrics. Six months from now, you need to reproduce or improve on your best model without guessing what you tried.
-
Calculate sample size before running A/B tests — Under-powered experiments waste time and resources. Use power analysis to determine the minimum sample size for detecting your minimum detectable effect. Running tests too short produces false negatives.
-
Communicate results with calibrated uncertainty — Report confidence intervals, not just point estimates. Say "we're 95% confident the conversion rate increase is between 1.2% and 3.8%" rather than "the conversion rate increased by 2.5%." Stakeholders need to understand the uncertainty.
Common Issues
Model performs well on CV but poorly in production — This usually indicates data leakage or distribution shift. Verify that your CV mimics production conditions (temporal splits for time-series data). Check if features available in training are unavailable at prediction time.
Feature importance varies across different methods — Permutation importance, SHAP values, and built-in feature importance give different rankings. Permutation importance is the most reliable for production decisions. Use SHAP for individual prediction explanations.
A/B test shows significant result but effect disappears — Novelty effects and selection bias can inflate early results. Run tests for at least two full business cycles. Check for Simpson's paradox by analyzing results across segments.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.