Comprehensive Mlops Module
Streamline your workflow with this track, experiments, manage, model. Includes structured workflows, validation checks, and reusable patterns for ai research.
Comprehensive MLOps Module
Overview
MLOps (Machine Learning Operations) is the discipline that combines machine learning, DevOps, and data engineering to standardize and streamline the entire ML lifecycle from experimentation through production deployment and ongoing monitoring. As organizations move beyond proof-of-concept ML models to production systems serving real users, MLOps practices become essential for ensuring reproducibility, reliability, scalability, and maintainability. The MLOps lifecycle encompasses data management and versioning, feature engineering, experiment tracking, model training and validation, CI/CD for ML, model serving and deployment, monitoring and observability, and automated retraining pipelines. This module covers the complete MLOps toolkit: from foundational practices like version control for data and models, through production deployment patterns, to emerging trends like AgentOps, edge ML, and explainability-integrated pipelines. Whether you are a data scientist transitioning models to production or an ML engineer building platform infrastructure, this guide provides the practical knowledge needed to implement MLOps at any scale.
When to Use
- Productionizing ML models: Move models from Jupyter notebooks to reliable, monitored production services with automated deployment pipelines.
- Establishing ML infrastructure: Set up the foundational tools and practices for a team or organization beginning its ML journey.
- Improving reproducibility: Implement data versioning, experiment tracking, and environment management to ensure any result can be reproduced.
- Scaling ML operations: Handle growing model counts, increasing data volumes, and expanding team sizes with standardized processes.
- Monitoring model performance: Detect data drift, model degradation, and prediction quality issues before they impact users.
- Compliance and auditing: Maintain audit trails of model versions, training data, and deployment decisions for regulatory requirements.
Quick Start
Core Tool Installation
# Experiment tracking and model registry pip install mlflow # Data versioning pip install dvc dvc init # Pipeline orchestration (choose one) pip install apache-airflow # Workflow orchestration pip install prefect # Modern alternative pip install zenml # ML-specific pipelines # Model serving pip install fastapi uvicorn # Custom API serving pip install bentoml # ML-specific serving pip install triton-client # NVIDIA Triton client # Monitoring pip install evidently # Data and model monitoring pip install whylogs # Data logging and profiling
Minimal MLOps Pipeline
import mlflow from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, f1_score import pandas as pd # 1. Track experiment mlflow.set_tracking_uri("./mlruns") mlflow.set_experiment("customer-churn-v1") with mlflow.start_run(run_name="rf-baseline"): # 2. Load and split data df = pd.read_csv("data/customers.csv") X_train, X_test, y_train, y_test = train_test_split( df.drop("churn", axis=1), df["churn"], test_size=0.2, random_state=42 ) # 3. Train with logged parameters params = {"n_estimators": 100, "max_depth": 10, "random_state": 42} mlflow.log_params(params) model = RandomForestClassifier(**params) model.fit(X_train, y_train) # 4. Evaluate and log metrics y_pred = model.predict(X_test) metrics = { "accuracy": accuracy_score(y_test, y_pred), "f1_score": f1_score(y_test, y_pred), } mlflow.log_metrics(metrics) # 5. Register model mlflow.sklearn.log_model(model, "model", registered_model_name="customer-churn-model") print(f"Accuracy: {metrics['accuracy']:.3f}") print(f"F1 Score: {metrics['f1_score']:.3f}")
Core Concepts
The MLOps Lifecycle
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MLOps Lifecycle β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Data βββββ Feature βββββ Model β β
β βManagement β βEngineeringβ β Training β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β β β
β β ββββββββββββ β β
β β β Model βββββββββββ β
β β βEvaluationβ β
β β ββββββββββββ β
β β β β
β β ββββββββββββ ββββββββββββ β
β β β CI/CD βββββ Model β β
β β β ML β β Serving β β
β β ββββββββββββ ββββββββββββ β
β β β β
β β ββββββββββββ β β
β βββββββββββMonitoringβββββββββββ β
β β& Retrain β β
β ββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Versioning with DVC
# Initialize DVC in your project dvc init # Track large data files dvc add data/training_data.parquet git add data/training_data.parquet.dvc data/.gitignore git commit -m "Add training data v1" # Configure remote storage dvc remote add -d s3storage s3://my-bucket/dvc-store dvc push # Create a reproducible pipeline cat > dvc.yaml << 'EOF' stages: prepare: cmd: python src/prepare_data.py deps: - src/prepare_data.py - data/raw/ outs: - data/processed/ train: cmd: python src/train.py deps: - src/train.py - data/processed/ params: - train.n_estimators - train.max_depth outs: - models/model.pkl metrics: - metrics/scores.json: cache: false evaluate: cmd: python src/evaluate.py deps: - src/evaluate.py - models/model.pkl - data/processed/test.csv metrics: - metrics/eval.json: cache: false EOF # Run the pipeline dvc repro # Compare experiments dvc metrics diff dvc params diff
Feature Store with Feast
# feature_repo/feature_definitions.py from feast import Entity, Feature, FeatureView, FileSource, ValueType from datetime import timedelta # Define data source customer_source = FileSource( path="data/customer_features.parquet", timestamp_field="event_timestamp" ) # Define entity customer = Entity( name="customer_id", value_type=ValueType.INT64, description="Customer identifier" ) # Define feature view customer_features = FeatureView( name="customer_features", entities=[customer], ttl=timedelta(days=1), features=[ Feature(name="total_purchases", dtype=ValueType.FLOAT), Feature(name="days_since_last_purchase", dtype=ValueType.INT64), Feature(name="avg_order_value", dtype=ValueType.FLOAT), ], source=customer_source, ) # Retrieve features for training from feast import FeatureStore store = FeatureStore(repo_path="feature_repo/") training_df = store.get_historical_features( entity_df=entity_df, features=[ "customer_features:total_purchases", "customer_features:days_since_last_purchase", "customer_features:avg_order_value", ], ).to_df()
Model Serving with FastAPI
# serve.py from fastapi import FastAPI, HTTPException from pydantic import BaseModel import mlflow import numpy as np app = FastAPI(title="Customer Churn Prediction API") # Load model from registry model = mlflow.sklearn.load_model("models:/customer-churn-model/Production") class PredictionRequest(BaseModel): total_purchases: float days_since_last_purchase: int avg_order_value: float support_tickets: int class PredictionResponse(BaseModel): churn_probability: float will_churn: bool model_version: str @app.post("/predict", response_model=PredictionResponse) async def predict(request: PredictionRequest): features = np.array([[ request.total_purchases, request.days_since_last_purchase, request.avg_order_value, request.support_tickets ]]) proba = model.predict_proba(features)[0][1] return PredictionResponse( churn_probability=round(float(proba), 4), will_churn=proba > 0.5, model_version="1.0.0" ) @app.get("/health") async def health(): return {"status": "healthy", "model_loaded": model is not None}
Monitoring with Evidently
import pandas as pd from evidently.report import Report from evidently.metric_preset import ( DataDriftPreset, DataQualityPreset, TargetDriftPreset ) # Load reference (training) and current (production) data reference_data = pd.read_csv("data/training_data.csv") current_data = pd.read_csv("data/production_batch_latest.csv") # Create data drift report drift_report = Report(metrics=[ DataDriftPreset(), DataQualityPreset(), TargetDriftPreset() ]) drift_report.run( reference_data=reference_data, current_data=current_data ) # Save report drift_report.save_html("reports/drift_report.html") # Programmatic drift detection drift_results = drift_report.as_dict() dataset_drift = drift_results["metrics"][0]["result"]["dataset_drift"] if dataset_drift: print("DATA DRIFT DETECTED - triggering retraining pipeline") # trigger_retraining()
CI/CD Pipeline for ML
# .github/workflows/ml-pipeline.yml name: ML Pipeline on: push: branches: [main] paths: - 'src/**' - 'data/**' - 'configs/**' jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: '3.11' - name: Install dependencies run: pip install -r requirements.txt - name: Run unit tests run: pytest tests/unit/ -v - name: Run data validation run: python scripts/validate_data.py - name: Run model tests run: pytest tests/model/ -v train: needs: test runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Pull data from DVC run: | pip install dvc[s3] dvc pull - name: Train model run: | python src/train.py dvc repro - name: Evaluate model run: python src/evaluate.py - name: Check quality gates run: | python -c " import json metrics = json.load(open('metrics/eval.json')) assert metrics['accuracy'] > 0.85, f'Accuracy {metrics[\"accuracy\"]} below threshold' assert metrics['f1_score'] > 0.80, f'F1 {metrics[\"f1_score\"]} below threshold' print('All quality gates passed') " deploy: needs: train runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' steps: - name: Deploy model run: | mlflow models serve -m "models:/customer-churn-model/Production" \ -p 8000 --no-conda
Configuration Reference
MLOps Maturity Levels
| Level | Description | Practices |
|---|---|---|
| Level 0 | Manual | Jupyter notebooks, manual deployment, no monitoring |
| Level 1 | Automated Training | Experiment tracking, DVC, automated training pipeline |
| Level 2 | CI/CD for ML | Automated testing, model validation gates, automated deployment |
| Level 3 | Full Automation | Automated retraining on drift, A/B testing, feature stores |
Tool Categories
| Category | Tools | Purpose |
|---|---|---|
| Experiment Tracking | MLflow, W&B, Neptune, CometML | Log parameters, metrics, artifacts |
| Data Versioning | DVC, LakeFS, Delta Lake | Track data and model versions |
| Pipeline Orchestration | Airflow, Prefect, ZenML, Kubeflow | Automate ML workflows |
| Feature Store | Feast, Tecton, Hopsworks | Manage and serve ML features |
| Model Serving | BentoML, Triton, TorchServe, Seldon | Deploy models as APIs |
| Monitoring | Evidently, WhyLabs, NannyML, Arize | Detect drift and degradation |
| Cloud Platforms | SageMaker, Vertex AI, Azure ML | End-to-end managed MLOps |
Key Metrics to Monitor
| Metric | Description | Alert Threshold |
|---|---|---|
| Prediction latency (p99) | 99th percentile response time | > 500ms |
| Prediction throughput | Requests per second | < 80% of capacity |
| Data drift score | Statistical distance from training data | > 0.1 (PSI) |
| Model accuracy | Rolling accuracy on labeled samples | < training accuracy - 5% |
| Feature missing rate | Percentage of null/missing features | > 5% |
| Memory usage | Model serving process memory | > 80% of limit |
Best Practices
-
Version everything: Data, code, models, configurations, and environments must all be versioned. Use Git for code, DVC for data and models, and Docker for environments. If you cannot reproduce a result from six months ago, your MLOps is incomplete.
-
Implement quality gates in CI/CD: Do not deploy models that fail validation. Define minimum accuracy, F1 score, latency, and fairness thresholds. Automate these checks in your deployment pipeline so human error cannot bypass them.
-
Monitor data drift, not just model accuracy: Model accuracy degrades silently as input data distributions shift. Use statistical tests (PSI, KS test, Jensen-Shannon divergence) to detect drift before it impacts predictions.
-
Use a feature store for shared feature engineering: Feature stores prevent training-serving skew by ensuring the same feature computation logic is used in both training and inference. Feast is a solid open-source option.
-
Separate training and serving infrastructure: Training requires GPUs and burst compute. Serving requires low latency and high availability. Do not conflate these requirements. Use separate infrastructure optimized for each workload.
-
Implement shadow deployments before blue-green: Before switching production traffic to a new model, run it in shadow mode alongside the current model. Compare predictions without impacting users.
-
Automate retraining with drift triggers: Set up automated pipelines that retrain models when data drift exceeds thresholds or when a scheduled evaluation shows performance degradation. Include human approval gates for production promotion.
-
Log prediction inputs and outputs: Store prediction requests and responses for debugging, auditing, and future retraining. Implement sampling for high-throughput systems to manage storage costs.
-
Use model registries for lifecycle management: MLflow Model Registry or equivalent provides stage transitions (Staging, Production, Archived), approval workflows, and lineage tracking. Never deploy models without going through the registry.
-
Document your ML system architecture: Maintain a diagram showing data flows, training pipelines, serving infrastructure, and monitoring systems. This documentation is essential for onboarding, debugging, and compliance.
Troubleshooting
Model performs well in training but poorly in production This is usually caused by training-serving skew. Verify that feature engineering logic is identical between training and serving. Check for data leakage in training. Compare feature distributions between training data and production data using Evidently.
Retraining pipeline produces worse models Add quality gates that compare new model metrics against the current production model. Only promote models that meet or exceed current performance. Log training data snapshots to diagnose data quality issues.
Model serving latency exceeds SLA Profile the serving pipeline to identify bottlenecks (feature computation, model inference, postprocessing). Consider model optimization (quantization, pruning, distillation). Use batching for throughput-oriented workloads. Scale horizontally with a load balancer.
Data drift detected but model accuracy is stable Not all drift affects model performance. Review which features are drifting and whether they are important to the model. Update your drift detection thresholds based on actual impact. Some drift is seasonal and expected.
Experiment tracking database growing too large Archive old experiments and delete artifacts for failed runs. Set retention policies for logged artifacts. Use artifact storage (S3, GCS) instead of the tracking database for large files.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.