Senior Data Scientist Studio

A comprehensive skill for senior data scientists covering experiment design, statistical modeling, feature engineering, model evaluation, and research-to-production workflows using Python's scientific computing ecosystem.

When to Use This Skill

Choose this skill when:

Designing A/B tests and statistical experiments with proper sample sizes
Building feature engineering pipelines for machine learning models
Evaluating model performance with appropriate metrics and cross-validation
Creating reproducible research notebooks and experiment tracking
Communicating model results to stakeholders with visualizations

Consider alternatives when:

Deploying models to production → use an ML engineering skill
Building deep learning models → use a deep learning skill
Working on computer vision → use a CV engineering skill
Need data pipeline infrastructure → use a data engineering skill

Quick Start


# End-to-end experiment workflow
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
import mlflow

mlflow.set_experiment("churn_prediction_v2")

with mlflow.start_run(run_name="gbm_baseline"):
    # Feature engineering pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', GradientBoostingClassifier(
            n_estimators=200, max_depth=5, learning_rate=0.1,
        )),
    ])

    # Stratified cross-validation
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='roc_auc')

    mlflow.log_params({'n_estimators': 200, 'max_depth': 5, 'lr': 0.1})
    mlflow.log_metric('cv_auc_mean', scores.mean())
    mlflow.log_metric('cv_auc_std', scores.std())
    print(f"CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")

Core Concepts

Experiment Design Checklist

Step	Question	Tool
Hypothesis	What specific effect are we measuring?	Domain knowledge
Metric	What's the primary success metric?	Business alignment
Sample Size	How many observations do we need?	Power analysis
Randomization	How do we assign to treatment/control?	Stratified random
Duration	How long do we run the experiment?	Variance estimation
Analysis	What statistical test do we use?	t-test, chi-squared, bootstrap

Feature Engineering Patterns


import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class TemporalFeatureExtractor(BaseEstimator, TransformerMixin):
    """Extract temporal features from datetime columns."""
    def __init__(self, datetime_col: str):
        self.datetime_col = datetime_col

    def fit(self, X, y=None):
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        df = X.copy()
        dt = pd.to_datetime(df[self.datetime_col])
        df[f'{self.datetime_col}_hour'] = dt.dt.hour
        df[f'{self.datetime_col}_dayofweek'] = dt.dt.dayofweek
        df[f'{self.datetime_col}_month'] = dt.dt.month
        df[f'{self.datetime_col}_is_weekend'] = (dt.dt.dayofweek >= 5).astype(int)
        # Cyclical encoding for hour
        df[f'{self.datetime_col}_hour_sin'] = np.sin(2 * np.pi * dt.dt.hour / 24)
        df[f'{self.datetime_col}_hour_cos'] = np.cos(2 * np.pi * dt.dt.hour / 24)
        return df.drop(columns=[self.datetime_col])

class AggregateFeatures(BaseEstimator, TransformerMixin):
    """Create aggregate features grouped by a key."""
    def __init__(self, group_col: str, agg_col: str):
        self.group_col = group_col
        self.agg_col = agg_col

    def fit(self, X, y=None):
        self.agg_stats_ = X.groupby(self.group_col)[self.agg_col].agg(
            ['mean', 'std', 'min', 'max', 'count']
        )
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        return X.merge(self.agg_stats_, on=self.group_col, how='left', suffixes=('', '_agg'))

Model Evaluation Framework


from sklearn.metrics import (
    classification_report, roc_auc_score, precision_recall_curve,
    average_precision_score, confusion_matrix
)
import matplotlib.pyplot as plt

def evaluate_classifier(model, X_test, y_test, threshold=0.5):
    y_prob = model.predict_proba(X_test)[:, 1]
    y_pred = (y_prob >= threshold).astype(int)

    print(classification_report(y_test, y_pred))
    print(f"ROC AUC: {roc_auc_score(y_test, y_prob):.4f}")
    print(f"Average Precision: {average_precision_score(y_test, y_prob):.4f}")

    # Precision-Recall curve for imbalanced datasets
    precision, recall, thresholds = precision_recall_curve(y_test, y_prob)
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    axes[0].plot(recall, precision)
    axes[0].set_xlabel('Recall')
    axes[0].set_ylabel('Precision')
    axes[0].set_title('Precision-Recall Curve')

    # Threshold analysis
    f1_scores = 2 * (precision[:-1] * recall[:-1]) / (precision[:-1] + recall[:-1] + 1e-8)
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]
    axes[1].plot(thresholds, f1_scores)
    axes[1].axvline(optimal_threshold, color='r', linestyle='--')
    axes[1].set_title(f'Optimal threshold: {optimal_threshold:.3f}')
    plt.tight_layout()
    return fig, optimal_threshold

Configuration

Parameter	Type	Default	Description
`experimentTracker`	string	`'mlflow'`	Experiment tracking: mlflow, wandb, or neptune
`crossValidationFolds`	number	`5`	Number of CV folds
`randomSeed`	number	`42`	Global random seed for reproducibility
`significanceLevel`	number	`0.05`	Statistical significance threshold (alpha)
`imbalanceStrategy`	string	`'smote'`	Class imbalance handling: smote, undersample, or weight
`featureSelection`	string	`'importance'`	Feature selection: importance, correlation, or mutual_info

Best Practices

Always split data before any transformation or feature engineering — Data leakage is the most common cause of models that perform well in notebooks but fail in production. Hold out test data completely before fitting scalers, encoders, or computing aggregate features.
Use stratified cross-validation for imbalanced datasets — Random splits can create folds where the minority class is absent. Stratified K-fold preserves class proportions in every fold. Use ROC AUC or Average Precision instead of accuracy for imbalanced problems.
Track every experiment with hyperparameters and metrics — Log model type, hyperparameters, feature set, training data version, and all evaluation metrics. Six months from now, you need to reproduce or improve on your best model without guessing what you tried.
Calculate sample size before running A/B tests — Under-powered experiments waste time and resources. Use power analysis to determine the minimum sample size for detecting your minimum detectable effect. Running tests too short produces false negatives.
Communicate results with calibrated uncertainty — Report confidence intervals, not just point estimates. Say "we're 95% confident the conversion rate increase is between 1.2% and 3.8%" rather than "the conversion rate increased by 2.5%." Stakeholders need to understand the uncertainty.

Common Issues

Model performs well on CV but poorly in production — This usually indicates data leakage or distribution shift. Verify that your CV mimics production conditions (temporal splits for time-series data). Check if features available in training are unavailable at prediction time.

Feature importance varies across different methods — Permutation importance, SHAP values, and built-in feature importance give different rankings. Permutation importance is the most reliable for production decisions. Use SHAP for individual prediction explanations.

A/B test shows significant result but effect disappears — Novelty effects and selection bias can inflate early results. Run tests for at least two full business cycles. Check for Simpson's paradox by analyzing results across segments.

⚠️ Loading Issue

Senior Data Scientist Studio

Senior Data Scientist Studio

When to Use This Skill

Quick Start

Core Concepts

Experiment Design Checklist

Feature Engineering Patterns

Model Evaluation Framework

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace