Architect Data Scientist

An agent for rigorous data science work covering statistical analysis, machine learning model development, experimentation design, and translating complex data patterns into actionable business recommendations.

When to Use This Agent

Choose Data Scientist when:

Designing and analyzing A/B tests or experiments with statistical rigor
Building predictive or classification models from structured data
Performing exploratory data analysis to uncover business insights
Evaluating model performance and selecting optimal approaches
Communicating technical findings to non-technical stakeholders

Consider alternatives when:

Building production ML infrastructure (use an ML engineer agent)
Doing basic data cleaning without statistical analysis (use a data analyst agent)
Deploying models to production servers (use an AI engineer agent)

Quick Start


# .claude/agents/architect-data-scientist.yml
name: Data Scientist
model: claude-sonnet-4-20250514
tools:
  - Read
  - Write
  - Bash
  - Glob
  - Grep
prompt: |
  You are a senior data scientist. Apply rigorous statistical
  methodology to analyze data, build models, and design
  experiments. Always validate assumptions, quantify uncertainty,
  and translate findings into actionable business recommendations.

Example invocation:


claude --agent architect-data-scientist "Analyze our user churn
  data, identify the top predictive features, build a classification
  model, and recommend interventions for the highest-risk segments"

Core Concepts

Data Science Workflow

Phase	Activities	Outputs
Problem Framing	Define business question, success metric	Problem statement, KPIs
Data Collection	Gather, join, validate data sources	Clean dataset
EDA	Distributions, correlations, outliers	Insight report, hypotheses
Feature Engineering	Create, select, transform features	Feature matrix
Modeling	Train, tune, validate models	Trained model, metrics
Evaluation	Test set performance, business impact	Performance report
Communication	Visualize, explain, recommend	Stakeholder presentation

Model Selection Guide


# Decision framework for model selection
task_type = identify_task(target_variable)

if task_type == "classification":
    if interpretability_required:
        models = [LogisticRegression, DecisionTree, RuleFit]
    elif dataset_size < 10_000:
        models = [RandomForest, GradientBoosting, SVM]
    else:
        models = [XGBoost, LightGBM, CatBoost]

elif task_type == "regression":
    if linear_relationship:
        models = [LinearRegression, ElasticNet, Ridge]
    else:
        models = [XGBoost, LightGBM, RandomForest]

# Always compare against a simple baseline
baseline = MostFrequentClass() if classification else MeanPredictor()

Experiment Design Framework

1. State the hypothesis clearly
   H₀: No difference between control and treatment
   H₁: Treatment increases conversion by ≥ 2%

2. Calculate required sample size
   - Effect size, significance level (α=0.05), power (1-β=0.8)

3. Define guardrail metrics
   - Primary: conversion rate
   - Guardrails: revenue per user, support tickets, latency

4. Run for the predetermined duration
   - No peeking, no early stopping without sequential methods

5. Analyze with appropriate tests
   - Proportions: chi-squared or z-test
   - Means: t-test or Mann-Whitney U
   - Always report confidence intervals, not just p-values

Configuration

Parameter	Description	Default
`significance_level`	Alpha for statistical tests	0.05
`cv_folds`	Cross-validation fold count	5
`test_size`	Hold-out test set proportion	0.2
`random_state`	Seed for reproducibility	42
`feature_selection`	Feature selection method	Mutual information
`hyperparameter_search`	Tuning strategy	Optuna/Bayesian
`reporting_format`	Output format for findings	Markdown + plots

Best Practices

Start with a simple baseline before building complex models. A mean predictor for regression or majority-class predictor for classification establishes the floor. If XGBoost only beats the baseline by 2%, the added complexity may not justify the operational cost. Baselines also catch data leakage—if your model performs suspiciously well, compare it against the baseline to sanity-check.
Validate on time-based splits for temporal data. Random train/test splits on time-series data leak future information into training. Always split chronologically: train on the past, validate on the future. This matches production conditions where your model only sees historical data and prevents overly optimistic performance estimates.
Report confidence intervals, not just point estimates. Saying "model accuracy is 87%" is less useful than "model accuracy is 87% ± 3% (95% CI)." Confidence intervals communicate uncertainty and help stakeholders make calibrated decisions. A model with 85% ± 1% accuracy may be preferable to one with 87% ± 8% for high-stakes applications.
Engineer features from domain knowledge, not just statistical methods. Automated feature selection finds statistically significant patterns but misses domain-meaningful combinations. A "days since last purchase" feature built from domain understanding often outperforms dozens of auto-generated interaction terms. Talk to domain experts and encode their mental models as features.
Document assumptions and limitations alongside results. Every model makes assumptions: stationarity, independence, feature distributions. State them explicitly so stakeholders understand when the model's predictions should be trusted and when they should not. A model trained on summer data may not generalize to winter—say so rather than letting stakeholders discover it through bad predictions.

Common Issues

Model performs well in development but poorly in production. Check for data leakage first: features computed from the target variable, future data leaking into training rows, or different feature computation between training and serving. Next, check for distribution shift: production data may have different patterns than training data. Monitor feature distributions in production and retrain when significant drift is detected.

Stakeholders dismiss statistically significant results as not meaningful. Statistical significance and practical significance are different. A p-value of 0.001 on a 0.1% conversion rate improvement may be statistically significant but not worth acting on. Always pair statistical results with business impact estimates: "This 2% conversion improvement translates to approximately $150K in annual revenue."

Feature importance rankings change between model runs. This instability usually indicates correlated features. When two features carry similar information, the model randomly assigns importance between them. Address this by grouping correlated features (using clustering or domain knowledge), using permutation importance instead of built-in feature importance, and reporting importance at the feature group level.

⚠️ Loading Issue

Architect Data Scientist

Architect Data Scientist

When to Use This Agent

Quick Start

Core Concepts

Data Science Workflow

Model Selection Guide

Experiment Design Framework

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner