C

Comprehensive Mlops Module

Streamline your workflow with this track, experiments, manage, model. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Comprehensive MLOps Module

Overview

MLOps (Machine Learning Operations) is the discipline that combines machine learning, DevOps, and data engineering to standardize and streamline the entire ML lifecycle from experimentation through production deployment and ongoing monitoring. As organizations move beyond proof-of-concept ML models to production systems serving real users, MLOps practices become essential for ensuring reproducibility, reliability, scalability, and maintainability. The MLOps lifecycle encompasses data management and versioning, feature engineering, experiment tracking, model training and validation, CI/CD for ML, model serving and deployment, monitoring and observability, and automated retraining pipelines. This module covers the complete MLOps toolkit: from foundational practices like version control for data and models, through production deployment patterns, to emerging trends like AgentOps, edge ML, and explainability-integrated pipelines. Whether you are a data scientist transitioning models to production or an ML engineer building platform infrastructure, this guide provides the practical knowledge needed to implement MLOps at any scale.

When to Use

  • Productionizing ML models: Move models from Jupyter notebooks to reliable, monitored production services with automated deployment pipelines.
  • Establishing ML infrastructure: Set up the foundational tools and practices for a team or organization beginning its ML journey.
  • Improving reproducibility: Implement data versioning, experiment tracking, and environment management to ensure any result can be reproduced.
  • Scaling ML operations: Handle growing model counts, increasing data volumes, and expanding team sizes with standardized processes.
  • Monitoring model performance: Detect data drift, model degradation, and prediction quality issues before they impact users.
  • Compliance and auditing: Maintain audit trails of model versions, training data, and deployment decisions for regulatory requirements.

Quick Start

Core Tool Installation

# Experiment tracking and model registry pip install mlflow # Data versioning pip install dvc dvc init # Pipeline orchestration (choose one) pip install apache-airflow # Workflow orchestration pip install prefect # Modern alternative pip install zenml # ML-specific pipelines # Model serving pip install fastapi uvicorn # Custom API serving pip install bentoml # ML-specific serving pip install triton-client # NVIDIA Triton client # Monitoring pip install evidently # Data and model monitoring pip install whylogs # Data logging and profiling

Minimal MLOps Pipeline

import mlflow from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, f1_score import pandas as pd # 1. Track experiment mlflow.set_tracking_uri("./mlruns") mlflow.set_experiment("customer-churn-v1") with mlflow.start_run(run_name="rf-baseline"): # 2. Load and split data df = pd.read_csv("data/customers.csv") X_train, X_test, y_train, y_test = train_test_split( df.drop("churn", axis=1), df["churn"], test_size=0.2, random_state=42 ) # 3. Train with logged parameters params = {"n_estimators": 100, "max_depth": 10, "random_state": 42} mlflow.log_params(params) model = RandomForestClassifier(**params) model.fit(X_train, y_train) # 4. Evaluate and log metrics y_pred = model.predict(X_test) metrics = { "accuracy": accuracy_score(y_test, y_pred), "f1_score": f1_score(y_test, y_pred), } mlflow.log_metrics(metrics) # 5. Register model mlflow.sklearn.log_model(model, "model", registered_model_name="customer-churn-model") print(f"Accuracy: {metrics['accuracy']:.3f}") print(f"F1 Score: {metrics['f1_score']:.3f}")

Core Concepts

The MLOps Lifecycle

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    MLOps Lifecycle                           β”‚
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚  β”‚   Data    │──→│  Feature  │──→│  Model   β”‚               β”‚
β”‚  β”‚Management β”‚   β”‚Engineeringβ”‚   β”‚ Training β”‚               β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚       β”‚                              β”‚                      β”‚
β”‚       β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚                      β”‚
β”‚       β”‚         β”‚   Model  β”‚β†β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β”‚       β”‚         β”‚Evaluationβ”‚                                β”‚
β”‚       β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                β”‚
β”‚       β”‚              β”‚                                      β”‚
β”‚       β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚       β”‚         β”‚   CI/CD  │──→│  Model   β”‚               β”‚
β”‚       β”‚         β”‚    ML    β”‚   β”‚ Serving  β”‚               β”‚
β”‚       β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚       β”‚                              β”‚                      β”‚
β”‚       β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚                      β”‚
β”‚       └─────────│Monitoringβ”‚β†β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β”‚                 β”‚& Retrain β”‚                                β”‚
β”‚                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Versioning with DVC

# Initialize DVC in your project dvc init # Track large data files dvc add data/training_data.parquet git add data/training_data.parquet.dvc data/.gitignore git commit -m "Add training data v1" # Configure remote storage dvc remote add -d s3storage s3://my-bucket/dvc-store dvc push # Create a reproducible pipeline cat > dvc.yaml << 'EOF' stages: prepare: cmd: python src/prepare_data.py deps: - src/prepare_data.py - data/raw/ outs: - data/processed/ train: cmd: python src/train.py deps: - src/train.py - data/processed/ params: - train.n_estimators - train.max_depth outs: - models/model.pkl metrics: - metrics/scores.json: cache: false evaluate: cmd: python src/evaluate.py deps: - src/evaluate.py - models/model.pkl - data/processed/test.csv metrics: - metrics/eval.json: cache: false EOF # Run the pipeline dvc repro # Compare experiments dvc metrics diff dvc params diff

Feature Store with Feast

# feature_repo/feature_definitions.py from feast import Entity, Feature, FeatureView, FileSource, ValueType from datetime import timedelta # Define data source customer_source = FileSource( path="data/customer_features.parquet", timestamp_field="event_timestamp" ) # Define entity customer = Entity( name="customer_id", value_type=ValueType.INT64, description="Customer identifier" ) # Define feature view customer_features = FeatureView( name="customer_features", entities=[customer], ttl=timedelta(days=1), features=[ Feature(name="total_purchases", dtype=ValueType.FLOAT), Feature(name="days_since_last_purchase", dtype=ValueType.INT64), Feature(name="avg_order_value", dtype=ValueType.FLOAT), ], source=customer_source, ) # Retrieve features for training from feast import FeatureStore store = FeatureStore(repo_path="feature_repo/") training_df = store.get_historical_features( entity_df=entity_df, features=[ "customer_features:total_purchases", "customer_features:days_since_last_purchase", "customer_features:avg_order_value", ], ).to_df()

Model Serving with FastAPI

# serve.py from fastapi import FastAPI, HTTPException from pydantic import BaseModel import mlflow import numpy as np app = FastAPI(title="Customer Churn Prediction API") # Load model from registry model = mlflow.sklearn.load_model("models:/customer-churn-model/Production") class PredictionRequest(BaseModel): total_purchases: float days_since_last_purchase: int avg_order_value: float support_tickets: int class PredictionResponse(BaseModel): churn_probability: float will_churn: bool model_version: str @app.post("/predict", response_model=PredictionResponse) async def predict(request: PredictionRequest): features = np.array([[ request.total_purchases, request.days_since_last_purchase, request.avg_order_value, request.support_tickets ]]) proba = model.predict_proba(features)[0][1] return PredictionResponse( churn_probability=round(float(proba), 4), will_churn=proba > 0.5, model_version="1.0.0" ) @app.get("/health") async def health(): return {"status": "healthy", "model_loaded": model is not None}

Monitoring with Evidently

import pandas as pd from evidently.report import Report from evidently.metric_preset import ( DataDriftPreset, DataQualityPreset, TargetDriftPreset ) # Load reference (training) and current (production) data reference_data = pd.read_csv("data/training_data.csv") current_data = pd.read_csv("data/production_batch_latest.csv") # Create data drift report drift_report = Report(metrics=[ DataDriftPreset(), DataQualityPreset(), TargetDriftPreset() ]) drift_report.run( reference_data=reference_data, current_data=current_data ) # Save report drift_report.save_html("reports/drift_report.html") # Programmatic drift detection drift_results = drift_report.as_dict() dataset_drift = drift_results["metrics"][0]["result"]["dataset_drift"] if dataset_drift: print("DATA DRIFT DETECTED - triggering retraining pipeline") # trigger_retraining()

CI/CD Pipeline for ML

# .github/workflows/ml-pipeline.yml name: ML Pipeline on: push: branches: [main] paths: - 'src/**' - 'data/**' - 'configs/**' jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: '3.11' - name: Install dependencies run: pip install -r requirements.txt - name: Run unit tests run: pytest tests/unit/ -v - name: Run data validation run: python scripts/validate_data.py - name: Run model tests run: pytest tests/model/ -v train: needs: test runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Pull data from DVC run: | pip install dvc[s3] dvc pull - name: Train model run: | python src/train.py dvc repro - name: Evaluate model run: python src/evaluate.py - name: Check quality gates run: | python -c " import json metrics = json.load(open('metrics/eval.json')) assert metrics['accuracy'] > 0.85, f'Accuracy {metrics[\"accuracy\"]} below threshold' assert metrics['f1_score'] > 0.80, f'F1 {metrics[\"f1_score\"]} below threshold' print('All quality gates passed') " deploy: needs: train runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' steps: - name: Deploy model run: | mlflow models serve -m "models:/customer-churn-model/Production" \ -p 8000 --no-conda

Configuration Reference

MLOps Maturity Levels

LevelDescriptionPractices
Level 0ManualJupyter notebooks, manual deployment, no monitoring
Level 1Automated TrainingExperiment tracking, DVC, automated training pipeline
Level 2CI/CD for MLAutomated testing, model validation gates, automated deployment
Level 3Full AutomationAutomated retraining on drift, A/B testing, feature stores

Tool Categories

CategoryToolsPurpose
Experiment TrackingMLflow, W&B, Neptune, CometMLLog parameters, metrics, artifacts
Data VersioningDVC, LakeFS, Delta LakeTrack data and model versions
Pipeline OrchestrationAirflow, Prefect, ZenML, KubeflowAutomate ML workflows
Feature StoreFeast, Tecton, HopsworksManage and serve ML features
Model ServingBentoML, Triton, TorchServe, SeldonDeploy models as APIs
MonitoringEvidently, WhyLabs, NannyML, ArizeDetect drift and degradation
Cloud PlatformsSageMaker, Vertex AI, Azure MLEnd-to-end managed MLOps

Key Metrics to Monitor

MetricDescriptionAlert Threshold
Prediction latency (p99)99th percentile response time> 500ms
Prediction throughputRequests per second< 80% of capacity
Data drift scoreStatistical distance from training data> 0.1 (PSI)
Model accuracyRolling accuracy on labeled samples< training accuracy - 5%
Feature missing ratePercentage of null/missing features> 5%
Memory usageModel serving process memory> 80% of limit

Best Practices

  1. Version everything: Data, code, models, configurations, and environments must all be versioned. Use Git for code, DVC for data and models, and Docker for environments. If you cannot reproduce a result from six months ago, your MLOps is incomplete.

  2. Implement quality gates in CI/CD: Do not deploy models that fail validation. Define minimum accuracy, F1 score, latency, and fairness thresholds. Automate these checks in your deployment pipeline so human error cannot bypass them.

  3. Monitor data drift, not just model accuracy: Model accuracy degrades silently as input data distributions shift. Use statistical tests (PSI, KS test, Jensen-Shannon divergence) to detect drift before it impacts predictions.

  4. Use a feature store for shared feature engineering: Feature stores prevent training-serving skew by ensuring the same feature computation logic is used in both training and inference. Feast is a solid open-source option.

  5. Separate training and serving infrastructure: Training requires GPUs and burst compute. Serving requires low latency and high availability. Do not conflate these requirements. Use separate infrastructure optimized for each workload.

  6. Implement shadow deployments before blue-green: Before switching production traffic to a new model, run it in shadow mode alongside the current model. Compare predictions without impacting users.

  7. Automate retraining with drift triggers: Set up automated pipelines that retrain models when data drift exceeds thresholds or when a scheduled evaluation shows performance degradation. Include human approval gates for production promotion.

  8. Log prediction inputs and outputs: Store prediction requests and responses for debugging, auditing, and future retraining. Implement sampling for high-throughput systems to manage storage costs.

  9. Use model registries for lifecycle management: MLflow Model Registry or equivalent provides stage transitions (Staging, Production, Archived), approval workflows, and lineage tracking. Never deploy models without going through the registry.

  10. Document your ML system architecture: Maintain a diagram showing data flows, training pipelines, serving infrastructure, and monitoring systems. This documentation is essential for onboarding, debugging, and compliance.

Troubleshooting

Model performs well in training but poorly in production This is usually caused by training-serving skew. Verify that feature engineering logic is identical between training and serving. Check for data leakage in training. Compare feature distributions between training data and production data using Evidently.

Retraining pipeline produces worse models Add quality gates that compare new model metrics against the current production model. Only promote models that meet or exceed current performance. Log training data snapshots to diagnose data quality issues.

Model serving latency exceeds SLA Profile the serving pipeline to identify bottlenecks (feature computation, model inference, postprocessing). Consider model optimization (quantization, pruning, distillation). Use batching for throughput-oriented workloads. Scale horizontally with a load balancer.

Data drift detected but model accuracy is stable Not all drift affects model performance. Review which features are drifting and whether they are important to the model. Update your drift detection thresholds based on actual impact. Some drift is seasonal and expected.

Experiment tracking database growing too large Archive old experiments and delete artifacts for failed runs. Set retention policies for logged artifacts. Use artifact storage (S3, GCS) instead of the tracking database for large files.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates