S

Scikit Learn Toolkit

Production-ready skill that handles machine, learning, python, scikit. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Scikit Learn Toolkit

Build, train, evaluate, and deploy classical machine learning models using scikit-learn, Python's standard library for supervised and unsupervised learning. This skill covers preprocessing pipelines, model selection, hyperparameter tuning, cross-validation, feature engineering, and production-ready model persistence.

When to Use This Skill

Choose Scikit Learn Toolkit when you need to:

  • Train classification, regression, or clustering models on tabular data
  • Build end-to-end ML pipelines with preprocessing, feature selection, and model training
  • Perform hyperparameter tuning with grid search or randomized search
  • Evaluate models with cross-validation and appropriate scoring metrics

Consider alternatives when:

  • You need deep learning or neural networks (use PyTorch or TensorFlow)
  • You need gradient boosting at scale (use XGBoost, LightGBM, or CatBoost directly)
  • You need time-series forecasting with specialized models (use statsmodels or Prophet)

Quick Start

pip install scikit-learn pandas numpy matplotlib
from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report import numpy as np # Load data data = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split( data.data, data.target, test_size=0.2, random_state=42, stratify=data.target ) # Build pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('clf', RandomForestClassifier(n_estimators=100, random_state=42)) ]) # Cross-validate cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1') print(f"CV F1: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}") # Train and evaluate pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) print(classification_report(y_test, y_pred, target_names=data.target_names))

Core Concepts

Model Selection Guide

TaskAlgorithmsWhen to Use
Binary classificationLogisticRegression, SVC, RandomForestLabeled data, two classes
Multi-class classificationRandomForest, GradientBoosting, SVC(OvR)Labeled data, 3+ classes
RegressionLinearRegression, Ridge, Lasso, SVRContinuous target variable
ClusteringKMeans, DBSCAN, AgglomerativeClusteringNo labels, find groups
Dimensionality reductionPCA, t-SNE, UMAPHigh-dimensional data
Anomaly detectionIsolationForest, OneClassSVM, LOFFind outliers
Feature selectionSelectKBest, RFE, mutual_infoReduce feature set

Full ML Pipeline with Tuning

from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import GradientBoostingClassifier from sklearn.model_selection import RandomizedSearchCV from scipy.stats import randint, uniform import pandas as pd import numpy as np # Define feature types numeric_features = ['age', 'income', 'credit_score'] categorical_features = ['education', 'employment'] # Preprocessing for each type numeric_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) categorical_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)) ]) preprocessor = ColumnTransformer([ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) # Full pipeline pipeline = Pipeline([ ('preprocess', preprocessor), ('clf', GradientBoostingClassifier(random_state=42)) ]) # Hyperparameter search param_distributions = { 'clf__n_estimators': randint(50, 300), 'clf__max_depth': randint(3, 10), 'clf__learning_rate': uniform(0.01, 0.3), 'clf__subsample': uniform(0.6, 0.4), 'clf__min_samples_leaf': randint(5, 50), } search = RandomizedSearchCV( pipeline, param_distributions, n_iter=50, cv=5, scoring='roc_auc', random_state=42, n_jobs=-1 ) # search.fit(X_train, y_train) # print(f"Best AUC: {search.best_score_:.3f}") # print(f"Best params: {search.best_params_}")

Configuration

ParameterDescriptionDefault
test_sizeFraction of data held out for testing0.2
cvNumber of cross-validation folds5
scoringEvaluation metric for model selection"accuracy"
n_jobsParallel workers (-1 = all cores)1
random_stateReproducibility seed42
n_iterIterations for RandomizedSearchCV50
handle_unknownOneHotEncoder behavior for unseen categories"ignore"
max_featuresFeatures considered per split (tree models)"sqrt"

Best Practices

  1. Always use pipelines to prevent data leakage — Fitting a scaler on the full dataset before splitting leaks test information into training. Use Pipeline to chain preprocessing and modeling so that transformations are fit only on training data during cross-validation.

  2. Stratify splits for imbalanced classification — Use stratify=y in train_test_split and StratifiedKFold for cross-validation to maintain class proportions. Without stratification, minority classes may be missing from some folds, producing unreliable performance estimates.

  3. Choose metrics that match your business objective — Accuracy is misleading for imbalanced data. Use F1-score or precision/recall for classification, RMSE or MAE for regression, and ROC-AUC when ranking matters. Set the scoring parameter consistently across cross-validation and hyperparameter tuning.

  4. Use RandomizedSearchCV over GridSearchCV for large search spaces — Grid search evaluates every combination and is exponentially expensive. Randomized search samples a fixed number of combinations and finds near-optimal parameters in a fraction of the time, especially with 4+ hyperparameters.

  5. Persist models with joblib, not pickle — Save trained pipelines with joblib.dump(pipeline, 'model.joblib') and load with joblib.load(). Joblib is more efficient for objects containing NumPy arrays. Always save the scikit-learn version alongside the model to catch version incompatibilities on deployment.

Common Issues

"NotFittedError: This estimator is not fitted yet" — You called predict() or transform() on a model or transformer that hasn't been trained. Ensure you call fit() or fit_transform() first. In pipelines, calling pipeline.fit(X_train, y_train) fits all steps sequentially.

Feature names mismatch between training and prediction — When using DataFrames, scikit-learn v1.0+ checks that column names at prediction time match training. Ensure your production data has the same columns in the same order. Use pipeline.feature_names_in_ to verify expected features.

Cross-validation scores are much higher than test scores — This indicates data leakage, typically from preprocessing before splitting. Move all transformations inside the pipeline. Also check for duplicate rows (same sample in train and test) or temporal leakage (future data in training set).

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates