Architect Data Scientist
Enterprise-grade agent for agent, need, analyze, data. Includes structured workflows, validation checks, and reusable patterns for data ai.
Architect Data Scientist
An agent for rigorous data science work covering statistical analysis, machine learning model development, experimentation design, and translating complex data patterns into actionable business recommendations.
When to Use This Agent
Choose Data Scientist when:
- Designing and analyzing A/B tests or experiments with statistical rigor
- Building predictive or classification models from structured data
- Performing exploratory data analysis to uncover business insights
- Evaluating model performance and selecting optimal approaches
- Communicating technical findings to non-technical stakeholders
Consider alternatives when:
- Building production ML infrastructure (use an ML engineer agent)
- Doing basic data cleaning without statistical analysis (use a data analyst agent)
- Deploying models to production servers (use an AI engineer agent)
Quick Start
# .claude/agents/architect-data-scientist.yml name: Data Scientist model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a senior data scientist. Apply rigorous statistical methodology to analyze data, build models, and design experiments. Always validate assumptions, quantify uncertainty, and translate findings into actionable business recommendations.
Example invocation:
claude --agent architect-data-scientist "Analyze our user churn data, identify the top predictive features, build a classification model, and recommend interventions for the highest-risk segments"
Core Concepts
Data Science Workflow
| Phase | Activities | Outputs |
|---|---|---|
| Problem Framing | Define business question, success metric | Problem statement, KPIs |
| Data Collection | Gather, join, validate data sources | Clean dataset |
| EDA | Distributions, correlations, outliers | Insight report, hypotheses |
| Feature Engineering | Create, select, transform features | Feature matrix |
| Modeling | Train, tune, validate models | Trained model, metrics |
| Evaluation | Test set performance, business impact | Performance report |
| Communication | Visualize, explain, recommend | Stakeholder presentation |
Model Selection Guide
# Decision framework for model selection task_type = identify_task(target_variable) if task_type == "classification": if interpretability_required: models = [LogisticRegression, DecisionTree, RuleFit] elif dataset_size < 10_000: models = [RandomForest, GradientBoosting, SVM] else: models = [XGBoost, LightGBM, CatBoost] elif task_type == "regression": if linear_relationship: models = [LinearRegression, ElasticNet, Ridge] else: models = [XGBoost, LightGBM, RandomForest] # Always compare against a simple baseline baseline = MostFrequentClass() if classification else MeanPredictor()
Experiment Design Framework
1. State the hypothesis clearly
Hβ: No difference between control and treatment
Hβ: Treatment increases conversion by β₯ 2%
2. Calculate required sample size
- Effect size, significance level (Ξ±=0.05), power (1-Ξ²=0.8)
3. Define guardrail metrics
- Primary: conversion rate
- Guardrails: revenue per user, support tickets, latency
4. Run for the predetermined duration
- No peeking, no early stopping without sequential methods
5. Analyze with appropriate tests
- Proportions: chi-squared or z-test
- Means: t-test or Mann-Whitney U
- Always report confidence intervals, not just p-values
Configuration
| Parameter | Description | Default |
|---|---|---|
significance_level | Alpha for statistical tests | 0.05 |
cv_folds | Cross-validation fold count | 5 |
test_size | Hold-out test set proportion | 0.2 |
random_state | Seed for reproducibility | 42 |
feature_selection | Feature selection method | Mutual information |
hyperparameter_search | Tuning strategy | Optuna/Bayesian |
reporting_format | Output format for findings | Markdown + plots |
Best Practices
-
Start with a simple baseline before building complex models. A mean predictor for regression or majority-class predictor for classification establishes the floor. If XGBoost only beats the baseline by 2%, the added complexity may not justify the operational cost. Baselines also catch data leakageβif your model performs suspiciously well, compare it against the baseline to sanity-check.
-
Validate on time-based splits for temporal data. Random train/test splits on time-series data leak future information into training. Always split chronologically: train on the past, validate on the future. This matches production conditions where your model only sees historical data and prevents overly optimistic performance estimates.
-
Report confidence intervals, not just point estimates. Saying "model accuracy is 87%" is less useful than "model accuracy is 87% Β± 3% (95% CI)." Confidence intervals communicate uncertainty and help stakeholders make calibrated decisions. A model with 85% Β± 1% accuracy may be preferable to one with 87% Β± 8% for high-stakes applications.
-
Engineer features from domain knowledge, not just statistical methods. Automated feature selection finds statistically significant patterns but misses domain-meaningful combinations. A "days since last purchase" feature built from domain understanding often outperforms dozens of auto-generated interaction terms. Talk to domain experts and encode their mental models as features.
-
Document assumptions and limitations alongside results. Every model makes assumptions: stationarity, independence, feature distributions. State them explicitly so stakeholders understand when the model's predictions should be trusted and when they should not. A model trained on summer data may not generalize to winterβsay so rather than letting stakeholders discover it through bad predictions.
Common Issues
Model performs well in development but poorly in production. Check for data leakage first: features computed from the target variable, future data leaking into training rows, or different feature computation between training and serving. Next, check for distribution shift: production data may have different patterns than training data. Monitor feature distributions in production and retrain when significant drift is detected.
Stakeholders dismiss statistically significant results as not meaningful. Statistical significance and practical significance are different. A p-value of 0.001 on a 0.1% conversion rate improvement may be statistically significant but not worth acting on. Always pair statistical results with business impact estimates: "This 2% conversion improvement translates to approximately $150K in annual revenue."
Feature importance rankings change between model runs. This instability usually indicates correlated features. When two features carry similar information, the model randomly assigns importance between them. Address this by grouping correlated features (using clustering or domain knowledge), using permutation importance instead of built-in feature importance, and reporting importance at the feature group level.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.