Scikit Learn Toolkit

Build, train, evaluate, and deploy classical machine learning models using scikit-learn, Python's standard library for supervised and unsupervised learning. This skill covers preprocessing pipelines, model selection, hyperparameter tuning, cross-validation, feature engineering, and production-ready model persistence.

When to Use This Skill

Choose Scikit Learn Toolkit when you need to:

Train classification, regression, or clustering models on tabular data
Build end-to-end ML pipelines with preprocessing, feature selection, and model training
Perform hyperparameter tuning with grid search or randomized search
Evaluate models with cross-validation and appropriate scoring metrics

Consider alternatives when:

You need deep learning or neural networks (use PyTorch or TensorFlow)
You need gradient boosting at scale (use XGBoost, LightGBM, or CatBoost directly)
You need time-series forecasting with specialized models (use statsmodels or Prophet)

Quick Start


pip install scikit-learn pandas numpy matplotlib


from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import numpy as np

# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42, stratify=data.target
)

# Build pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Cross-validate
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1')
print(f"CV F1: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

# Train and evaluate
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred, target_names=data.target_names))

Core Concepts

Model Selection Guide

Task	Algorithms	When to Use
Binary classification	LogisticRegression, SVC, RandomForest	Labeled data, two classes
Multi-class classification	RandomForest, GradientBoosting, SVC(OvR)	Labeled data, 3+ classes
Regression	LinearRegression, Ridge, Lasso, SVR	Continuous target variable
Clustering	KMeans, DBSCAN, AgglomerativeClustering	No labels, find groups
Dimensionality reduction	PCA, t-SNE, UMAP	High-dimensional data
Anomaly detection	IsolationForest, OneClassSVM, LOF	Find outliers
Feature selection	SelectKBest, RFE, mutual_info	Reduce feature set

Full ML Pipeline with Tuning


from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
import pandas as pd
import numpy as np

# Define feature types
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['education', 'employment']

# Preprocessing for each type
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Full pipeline
pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('clf', GradientBoostingClassifier(random_state=42))
])

# Hyperparameter search
param_distributions = {
    'clf__n_estimators': randint(50, 300),
    'clf__max_depth': randint(3, 10),
    'clf__learning_rate': uniform(0.01, 0.3),
    'clf__subsample': uniform(0.6, 0.4),
    'clf__min_samples_leaf': randint(5, 50),
}

search = RandomizedSearchCV(
    pipeline, param_distributions, n_iter=50, cv=5,
    scoring='roc_auc', random_state=42, n_jobs=-1
)

# search.fit(X_train, y_train)
# print(f"Best AUC: {search.best_score_:.3f}")
# print(f"Best params: {search.best_params_}")

Configuration

Parameter	Description	Default
`test_size`	Fraction of data held out for testing	`0.2`
`cv`	Number of cross-validation folds	`5`
`scoring`	Evaluation metric for model selection	`"accuracy"`
`n_jobs`	Parallel workers (-1 = all cores)	`1`
`random_state`	Reproducibility seed	`42`
`n_iter`	Iterations for RandomizedSearchCV	`50`
`handle_unknown`	OneHotEncoder behavior for unseen categories	`"ignore"`
`max_features`	Features considered per split (tree models)	`"sqrt"`

Best Practices

Always use pipelines to prevent data leakage — Fitting a scaler on the full dataset before splitting leaks test information into training. Use Pipeline to chain preprocessing and modeling so that transformations are fit only on training data during cross-validation.
Stratify splits for imbalanced classification — Use stratify=y in train_test_split and StratifiedKFold for cross-validation to maintain class proportions. Without stratification, minority classes may be missing from some folds, producing unreliable performance estimates.
Choose metrics that match your business objective — Accuracy is misleading for imbalanced data. Use F1-score or precision/recall for classification, RMSE or MAE for regression, and ROC-AUC when ranking matters. Set the scoring parameter consistently across cross-validation and hyperparameter tuning.
Use RandomizedSearchCV over GridSearchCV for large search spaces — Grid search evaluates every combination and is exponentially expensive. Randomized search samples a fixed number of combinations and finds near-optimal parameters in a fraction of the time, especially with 4+ hyperparameters.
Persist models with joblib, not pickle — Save trained pipelines with joblib.dump(pipeline, 'model.joblib') and load with joblib.load(). Joblib is more efficient for objects containing NumPy arrays. Always save the scikit-learn version alongside the model to catch version incompatibilities on deployment.

Common Issues

"NotFittedError: This estimator is not fitted yet" — You called predict() or transform() on a model or transformer that hasn't been trained. Ensure you call fit() or fit_transform() first. In pipelines, calling pipeline.fit(X_train, y_train) fits all steps sequentially.

Feature names mismatch between training and prediction — When using DataFrames, scikit-learn v1.0+ checks that column names at prediction time match training. Ensure your production data has the same columns in the same order. Use pipeline.feature_names_in_ to verify expected features.

Cross-validation scores are much higher than test scores — This indicates data leakage, typically from preprocessing before splitting. Move all transformations inside the pipeline. Also check for duplicate rows (same sample in train and test) or temporal leakage (future data in training set).

⚠️ Loading Issue

Scikit Learn Toolkit

Scikit Learn Toolkit

When to Use This Skill

Quick Start

Core Concepts

Model Selection Guide

Full ML Pipeline with Tuning

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace