S

Senior Data Scientist Studio

Enterprise-grade skill for world, class, data, science. Includes structured workflows, validation checks, and reusable patterns for development.

SkillClipticsdevelopmentv1.0.0MIT
0 views0 copies

Senior Data Scientist Studio

A comprehensive skill for senior data scientists covering experiment design, statistical modeling, feature engineering, model evaluation, and research-to-production workflows using Python's scientific computing ecosystem.

When to Use This Skill

Choose this skill when:

  • Designing A/B tests and statistical experiments with proper sample sizes
  • Building feature engineering pipelines for machine learning models
  • Evaluating model performance with appropriate metrics and cross-validation
  • Creating reproducible research notebooks and experiment tracking
  • Communicating model results to stakeholders with visualizations

Consider alternatives when:

  • Deploying models to production → use an ML engineering skill
  • Building deep learning models → use a deep learning skill
  • Working on computer vision → use a CV engineering skill
  • Need data pipeline infrastructure → use a data engineering skill

Quick Start

# End-to-end experiment workflow import pandas as pd import numpy as np from sklearn.model_selection import cross_val_score, StratifiedKFold from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.ensemble import GradientBoostingClassifier import mlflow mlflow.set_experiment("churn_prediction_v2") with mlflow.start_run(run_name="gbm_baseline"): # Feature engineering pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', GradientBoostingClassifier( n_estimators=200, max_depth=5, learning_rate=0.1, )), ]) # Stratified cross-validation cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring='roc_auc') mlflow.log_params({'n_estimators': 200, 'max_depth': 5, 'lr': 0.1}) mlflow.log_metric('cv_auc_mean', scores.mean()) mlflow.log_metric('cv_auc_std', scores.std()) print(f"CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")

Core Concepts

Experiment Design Checklist

StepQuestionTool
HypothesisWhat specific effect are we measuring?Domain knowledge
MetricWhat's the primary success metric?Business alignment
Sample SizeHow many observations do we need?Power analysis
RandomizationHow do we assign to treatment/control?Stratified random
DurationHow long do we run the experiment?Variance estimation
AnalysisWhat statistical test do we use?t-test, chi-squared, bootstrap

Feature Engineering Patterns

import pandas as pd from sklearn.base import BaseEstimator, TransformerMixin class TemporalFeatureExtractor(BaseEstimator, TransformerMixin): """Extract temporal features from datetime columns.""" def __init__(self, datetime_col: str): self.datetime_col = datetime_col def fit(self, X, y=None): return self def transform(self, X: pd.DataFrame) -> pd.DataFrame: df = X.copy() dt = pd.to_datetime(df[self.datetime_col]) df[f'{self.datetime_col}_hour'] = dt.dt.hour df[f'{self.datetime_col}_dayofweek'] = dt.dt.dayofweek df[f'{self.datetime_col}_month'] = dt.dt.month df[f'{self.datetime_col}_is_weekend'] = (dt.dt.dayofweek >= 5).astype(int) # Cyclical encoding for hour df[f'{self.datetime_col}_hour_sin'] = np.sin(2 * np.pi * dt.dt.hour / 24) df[f'{self.datetime_col}_hour_cos'] = np.cos(2 * np.pi * dt.dt.hour / 24) return df.drop(columns=[self.datetime_col]) class AggregateFeatures(BaseEstimator, TransformerMixin): """Create aggregate features grouped by a key.""" def __init__(self, group_col: str, agg_col: str): self.group_col = group_col self.agg_col = agg_col def fit(self, X, y=None): self.agg_stats_ = X.groupby(self.group_col)[self.agg_col].agg( ['mean', 'std', 'min', 'max', 'count'] ) return self def transform(self, X: pd.DataFrame) -> pd.DataFrame: return X.merge(self.agg_stats_, on=self.group_col, how='left', suffixes=('', '_agg'))

Model Evaluation Framework

from sklearn.metrics import ( classification_report, roc_auc_score, precision_recall_curve, average_precision_score, confusion_matrix ) import matplotlib.pyplot as plt def evaluate_classifier(model, X_test, y_test, threshold=0.5): y_prob = model.predict_proba(X_test)[:, 1] y_pred = (y_prob >= threshold).astype(int) print(classification_report(y_test, y_pred)) print(f"ROC AUC: {roc_auc_score(y_test, y_prob):.4f}") print(f"Average Precision: {average_precision_score(y_test, y_prob):.4f}") # Precision-Recall curve for imbalanced datasets precision, recall, thresholds = precision_recall_curve(y_test, y_prob) fig, axes = plt.subplots(1, 2, figsize=(12, 5)) axes[0].plot(recall, precision) axes[0].set_xlabel('Recall') axes[0].set_ylabel('Precision') axes[0].set_title('Precision-Recall Curve') # Threshold analysis f1_scores = 2 * (precision[:-1] * recall[:-1]) / (precision[:-1] + recall[:-1] + 1e-8) optimal_idx = f1_scores.argmax() optimal_threshold = thresholds[optimal_idx] axes[1].plot(thresholds, f1_scores) axes[1].axvline(optimal_threshold, color='r', linestyle='--') axes[1].set_title(f'Optimal threshold: {optimal_threshold:.3f}') plt.tight_layout() return fig, optimal_threshold

Configuration

ParameterTypeDefaultDescription
experimentTrackerstring'mlflow'Experiment tracking: mlflow, wandb, or neptune
crossValidationFoldsnumber5Number of CV folds
randomSeednumber42Global random seed for reproducibility
significanceLevelnumber0.05Statistical significance threshold (alpha)
imbalanceStrategystring'smote'Class imbalance handling: smote, undersample, or weight
featureSelectionstring'importance'Feature selection: importance, correlation, or mutual_info

Best Practices

  1. Always split data before any transformation or feature engineering — Data leakage is the most common cause of models that perform well in notebooks but fail in production. Hold out test data completely before fitting scalers, encoders, or computing aggregate features.

  2. Use stratified cross-validation for imbalanced datasets — Random splits can create folds where the minority class is absent. Stratified K-fold preserves class proportions in every fold. Use ROC AUC or Average Precision instead of accuracy for imbalanced problems.

  3. Track every experiment with hyperparameters and metrics — Log model type, hyperparameters, feature set, training data version, and all evaluation metrics. Six months from now, you need to reproduce or improve on your best model without guessing what you tried.

  4. Calculate sample size before running A/B tests — Under-powered experiments waste time and resources. Use power analysis to determine the minimum sample size for detecting your minimum detectable effect. Running tests too short produces false negatives.

  5. Communicate results with calibrated uncertainty — Report confidence intervals, not just point estimates. Say "we're 95% confident the conversion rate increase is between 1.2% and 3.8%" rather than "the conversion rate increased by 2.5%." Stakeholders need to understand the uncertainty.

Common Issues

Model performs well on CV but poorly in production — This usually indicates data leakage or distribution shift. Verify that your CV mimics production conditions (temporal splits for time-series data). Check if features available in training are unavailable at prediction time.

Feature importance varies across different methods — Permutation importance, SHAP values, and built-in feature importance give different rankings. Permutation importance is the most reliable for production decisions. Use SHAP for individual prediction explanations.

A/B test shows significant result but effect disappears — Novelty effects and selection bias can inflate early results. Run tests for at least two full business cycles. Check for Simpson's paradox by analyzing results across segments.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates