Scikit Learn Toolkit
Production-ready skill that handles machine, learning, python, scikit. Includes structured workflows, validation checks, and reusable patterns for scientific.
Scikit Learn Toolkit
Build, train, evaluate, and deploy classical machine learning models using scikit-learn, Python's standard library for supervised and unsupervised learning. This skill covers preprocessing pipelines, model selection, hyperparameter tuning, cross-validation, feature engineering, and production-ready model persistence.
When to Use This Skill
Choose Scikit Learn Toolkit when you need to:
- Train classification, regression, or clustering models on tabular data
- Build end-to-end ML pipelines with preprocessing, feature selection, and model training
- Perform hyperparameter tuning with grid search or randomized search
- Evaluate models with cross-validation and appropriate scoring metrics
Consider alternatives when:
- You need deep learning or neural networks (use PyTorch or TensorFlow)
- You need gradient boosting at scale (use XGBoost, LightGBM, or CatBoost directly)
- You need time-series forecasting with specialized models (use statsmodels or Prophet)
Quick Start
pip install scikit-learn pandas numpy matplotlib
from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report import numpy as np # Load data data = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split( data.data, data.target, test_size=0.2, random_state=42, stratify=data.target ) # Build pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('clf', RandomForestClassifier(n_estimators=100, random_state=42)) ]) # Cross-validate cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1') print(f"CV F1: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}") # Train and evaluate pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) print(classification_report(y_test, y_pred, target_names=data.target_names))
Core Concepts
Model Selection Guide
| Task | Algorithms | When to Use |
|---|---|---|
| Binary classification | LogisticRegression, SVC, RandomForest | Labeled data, two classes |
| Multi-class classification | RandomForest, GradientBoosting, SVC(OvR) | Labeled data, 3+ classes |
| Regression | LinearRegression, Ridge, Lasso, SVR | Continuous target variable |
| Clustering | KMeans, DBSCAN, AgglomerativeClustering | No labels, find groups |
| Dimensionality reduction | PCA, t-SNE, UMAP | High-dimensional data |
| Anomaly detection | IsolationForest, OneClassSVM, LOF | Find outliers |
| Feature selection | SelectKBest, RFE, mutual_info | Reduce feature set |
Full ML Pipeline with Tuning
from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import GradientBoostingClassifier from sklearn.model_selection import RandomizedSearchCV from scipy.stats import randint, uniform import pandas as pd import numpy as np # Define feature types numeric_features = ['age', 'income', 'credit_score'] categorical_features = ['education', 'employment'] # Preprocessing for each type numeric_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) categorical_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)) ]) preprocessor = ColumnTransformer([ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) # Full pipeline pipeline = Pipeline([ ('preprocess', preprocessor), ('clf', GradientBoostingClassifier(random_state=42)) ]) # Hyperparameter search param_distributions = { 'clf__n_estimators': randint(50, 300), 'clf__max_depth': randint(3, 10), 'clf__learning_rate': uniform(0.01, 0.3), 'clf__subsample': uniform(0.6, 0.4), 'clf__min_samples_leaf': randint(5, 50), } search = RandomizedSearchCV( pipeline, param_distributions, n_iter=50, cv=5, scoring='roc_auc', random_state=42, n_jobs=-1 ) # search.fit(X_train, y_train) # print(f"Best AUC: {search.best_score_:.3f}") # print(f"Best params: {search.best_params_}")
Configuration
| Parameter | Description | Default |
|---|---|---|
test_size | Fraction of data held out for testing | 0.2 |
cv | Number of cross-validation folds | 5 |
scoring | Evaluation metric for model selection | "accuracy" |
n_jobs | Parallel workers (-1 = all cores) | 1 |
random_state | Reproducibility seed | 42 |
n_iter | Iterations for RandomizedSearchCV | 50 |
handle_unknown | OneHotEncoder behavior for unseen categories | "ignore" |
max_features | Features considered per split (tree models) | "sqrt" |
Best Practices
-
Always use pipelines to prevent data leakage — Fitting a scaler on the full dataset before splitting leaks test information into training. Use
Pipelineto chain preprocessing and modeling so that transformations are fit only on training data during cross-validation. -
Stratify splits for imbalanced classification — Use
stratify=yintrain_test_splitandStratifiedKFoldfor cross-validation to maintain class proportions. Without stratification, minority classes may be missing from some folds, producing unreliable performance estimates. -
Choose metrics that match your business objective — Accuracy is misleading for imbalanced data. Use F1-score or precision/recall for classification, RMSE or MAE for regression, and ROC-AUC when ranking matters. Set the
scoringparameter consistently across cross-validation and hyperparameter tuning. -
Use RandomizedSearchCV over GridSearchCV for large search spaces — Grid search evaluates every combination and is exponentially expensive. Randomized search samples a fixed number of combinations and finds near-optimal parameters in a fraction of the time, especially with 4+ hyperparameters.
-
Persist models with joblib, not pickle — Save trained pipelines with
joblib.dump(pipeline, 'model.joblib')and load withjoblib.load(). Joblib is more efficient for objects containing NumPy arrays. Always save the scikit-learn version alongside the model to catch version incompatibilities on deployment.
Common Issues
"NotFittedError: This estimator is not fitted yet" — You called predict() or transform() on a model or transformer that hasn't been trained. Ensure you call fit() or fit_transform() first. In pipelines, calling pipeline.fit(X_train, y_train) fits all steps sequentially.
Feature names mismatch between training and prediction — When using DataFrames, scikit-learn v1.0+ checks that column names at prediction time match training. Ensure your production data has the same columns in the same order. Use pipeline.feature_names_in_ to verify expected features.
Cross-validation scores are much higher than test scores — This indicates data leakage, typically from preprocessing before splitting. Move all transformations inside the pipeline. Also check for duplicate rows (same sample in train and test) or temporal leakage (future data in training set).
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.