Comprehensive MLOps Module

Overview

MLOps (Machine Learning Operations) is the discipline that combines machine learning, DevOps, and data engineering to standardize and streamline the entire ML lifecycle from experimentation through production deployment and ongoing monitoring. As organizations move beyond proof-of-concept ML models to production systems serving real users, MLOps practices become essential for ensuring reproducibility, reliability, scalability, and maintainability. The MLOps lifecycle encompasses data management and versioning, feature engineering, experiment tracking, model training and validation, CI/CD for ML, model serving and deployment, monitoring and observability, and automated retraining pipelines. This module covers the complete MLOps toolkit: from foundational practices like version control for data and models, through production deployment patterns, to emerging trends like AgentOps, edge ML, and explainability-integrated pipelines. Whether you are a data scientist transitioning models to production or an ML engineer building platform infrastructure, this guide provides the practical knowledge needed to implement MLOps at any scale.

When to Use

Productionizing ML models: Move models from Jupyter notebooks to reliable, monitored production services with automated deployment pipelines.
Establishing ML infrastructure: Set up the foundational tools and practices for a team or organization beginning its ML journey.
Improving reproducibility: Implement data versioning, experiment tracking, and environment management to ensure any result can be reproduced.
Scaling ML operations: Handle growing model counts, increasing data volumes, and expanding team sizes with standardized processes.
Monitoring model performance: Detect data drift, model degradation, and prediction quality issues before they impact users.
Compliance and auditing: Maintain audit trails of model versions, training data, and deployment decisions for regulatory requirements.

Quick Start

Core Tool Installation


# Experiment tracking and model registry
pip install mlflow

# Data versioning
pip install dvc
dvc init

# Pipeline orchestration (choose one)
pip install apache-airflow    # Workflow orchestration
pip install prefect           # Modern alternative
pip install zenml             # ML-specific pipelines

# Model serving
pip install fastapi uvicorn   # Custom API serving
pip install bentoml           # ML-specific serving
pip install triton-client     # NVIDIA Triton client

# Monitoring
pip install evidently          # Data and model monitoring
pip install whylogs            # Data logging and profiling

Minimal MLOps Pipeline


import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import pandas as pd

# 1. Track experiment
mlflow.set_tracking_uri("./mlruns")
mlflow.set_experiment("customer-churn-v1")

with mlflow.start_run(run_name="rf-baseline"):
    # 2. Load and split data
    df = pd.read_csv("data/customers.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        df.drop("churn", axis=1), df["churn"],
        test_size=0.2, random_state=42
    )

    # 3. Train with logged parameters
    params = {"n_estimators": 100, "max_depth": 10, "random_state": 42}
    mlflow.log_params(params)

    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)

    # 4. Evaluate and log metrics
    y_pred = model.predict(X_test)
    metrics = {
        "accuracy": accuracy_score(y_test, y_pred),
        "f1_score": f1_score(y_test, y_pred),
    }
    mlflow.log_metrics(metrics)

    # 5. Register model
    mlflow.sklearn.log_model(model, "model",
        registered_model_name="customer-churn-model")

    print(f"Accuracy: {metrics['accuracy']:.3f}")
    print(f"F1 Score: {metrics['f1_score']:.3f}")

Core Concepts

The MLOps Lifecycle

┌─────────────────────────────────────────────────────────────┐
│                    MLOps Lifecycle                           │
│                                                             │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐               │
│  │   Data    │──→│  Feature  │──→│  Model   │               │
│  │Management │   │Engineering│   │ Training │               │
│  └──────────┘   └──────────┘   └──────────┘               │
│       │                              │                      │
│       │         ┌──────────┐         │                      │
│       │         │   Model  │←────────┘                      │
│       │         │Evaluation│                                │
│       │         └──────────┘                                │
│       │              │                                      │
│       │         ┌──────────┐   ┌──────────┐               │
│       │         │   CI/CD  │──→│  Model   │               │
│       │         │    ML    │   │ Serving  │               │
│       │         └──────────┘   └──────────┘               │
│       │                              │                      │
│       │         ┌──────────┐         │                      │
│       └─────────│Monitoring│←────────┘                      │
│                 │& Retrain │                                │
│                 └──────────┘                                │
└─────────────────────────────────────────────────────────────┘

Data Versioning with DVC


# Initialize DVC in your project
dvc init

# Track large data files
dvc add data/training_data.parquet
git add data/training_data.parquet.dvc data/.gitignore
git commit -m "Add training data v1"

# Configure remote storage
dvc remote add -d s3storage s3://my-bucket/dvc-store
dvc push

# Create a reproducible pipeline
cat > dvc.yaml << 'EOF'
stages:
  prepare:
    cmd: python src/prepare_data.py
    deps:
      - src/prepare_data.py
      - data/raw/
    outs:
      - data/processed/

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/processed/
    params:
      - train.n_estimators
      - train.max_depth
    outs:
      - models/model.pkl
    metrics:
      - metrics/scores.json:
          cache: false

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/model.pkl
      - data/processed/test.csv
    metrics:
      - metrics/eval.json:
          cache: false
EOF

# Run the pipeline
dvc repro

# Compare experiments
dvc metrics diff
dvc params diff

Feature Store with Feast


# feature_repo/feature_definitions.py
from feast import Entity, Feature, FeatureView, FileSource, ValueType
from datetime import timedelta

# Define data source
customer_source = FileSource(
    path="data/customer_features.parquet",
    timestamp_field="event_timestamp"
)

# Define entity
customer = Entity(
    name="customer_id",
    value_type=ValueType.INT64,
    description="Customer identifier"
)

# Define feature view
customer_features = FeatureView(
    name="customer_features",
    entities=[customer],
    ttl=timedelta(days=1),
    features=[
        Feature(name="total_purchases", dtype=ValueType.FLOAT),
        Feature(name="days_since_last_purchase", dtype=ValueType.INT64),
        Feature(name="avg_order_value", dtype=ValueType.FLOAT),
    ],
    source=customer_source,
)

# Retrieve features for training
from feast import FeatureStore
store = FeatureStore(repo_path="feature_repo/")

training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "customer_features:total_purchases",
        "customer_features:days_since_last_purchase",
        "customer_features:avg_order_value",
    ],
).to_df()

Model Serving with FastAPI


# serve.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import mlflow
import numpy as np

app = FastAPI(title="Customer Churn Prediction API")

# Load model from registry
model = mlflow.sklearn.load_model("models:/customer-churn-model/Production")

class PredictionRequest(BaseModel):
    total_purchases: float
    days_since_last_purchase: int
    avg_order_value: float
    support_tickets: int

class PredictionResponse(BaseModel):
    churn_probability: float
    will_churn: bool
    model_version: str

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    features = np.array([[
        request.total_purchases,
        request.days_since_last_purchase,
        request.avg_order_value,
        request.support_tickets
    ]])
    proba = model.predict_proba(features)[0][1]
    return PredictionResponse(
        churn_probability=round(float(proba), 4),
        will_churn=proba > 0.5,
        model_version="1.0.0"
    )

@app.get("/health")
async def health():
    return {"status": "healthy", "model_loaded": model is not None}

Monitoring with Evidently


import pandas as pd
from evidently.report import Report
from evidently.metric_preset import (
    DataDriftPreset,
    DataQualityPreset,
    TargetDriftPreset
)

# Load reference (training) and current (production) data
reference_data = pd.read_csv("data/training_data.csv")
current_data = pd.read_csv("data/production_batch_latest.csv")

# Create data drift report
drift_report = Report(metrics=[
    DataDriftPreset(),
    DataQualityPreset(),
    TargetDriftPreset()
])

drift_report.run(
    reference_data=reference_data,
    current_data=current_data
)

# Save report
drift_report.save_html("reports/drift_report.html")

# Programmatic drift detection
drift_results = drift_report.as_dict()
dataset_drift = drift_results["metrics"][0]["result"]["dataset_drift"]
if dataset_drift:
    print("DATA DRIFT DETECTED - triggering retraining pipeline")
    # trigger_retraining()

CI/CD Pipeline for ML


# .github/workflows/ml-pipeline.yml
name: ML Pipeline

on:
  push:
    branches: [main]
    paths:
      - 'src/**'
      - 'data/**'
      - 'configs/**'

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run unit tests
        run: pytest tests/unit/ -v

      - name: Run data validation
        run: python scripts/validate_data.py

      - name: Run model tests
        run: pytest tests/model/ -v

  train:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Pull data from DVC
        run: |
          pip install dvc[s3]
          dvc pull

      - name: Train model
        run: |
          python src/train.py
          dvc repro

      - name: Evaluate model
        run: python src/evaluate.py

      - name: Check quality gates
        run: |
          python -c "
          import json
          metrics = json.load(open('metrics/eval.json'))
          assert metrics['accuracy'] > 0.85, f'Accuracy {metrics[\"accuracy\"]} below threshold'
          assert metrics['f1_score'] > 0.80, f'F1 {metrics[\"f1_score\"]} below threshold'
          print('All quality gates passed')
          "

  deploy:
    needs: train
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - name: Deploy model
        run: |
          mlflow models serve -m "models:/customer-churn-model/Production" \
            -p 8000 --no-conda

Configuration Reference

MLOps Maturity Levels

Level	Description	Practices
Level 0	Manual	Jupyter notebooks, manual deployment, no monitoring
Level 1	Automated Training	Experiment tracking, DVC, automated training pipeline
Level 2	CI/CD for ML	Automated testing, model validation gates, automated deployment
Level 3	Full Automation	Automated retraining on drift, A/B testing, feature stores

Tool Categories

Category	Tools	Purpose
Experiment Tracking	MLflow, W&B, Neptune, CometML	Log parameters, metrics, artifacts
Data Versioning	DVC, LakeFS, Delta Lake	Track data and model versions
Pipeline Orchestration	Airflow, Prefect, ZenML, Kubeflow	Automate ML workflows
Feature Store	Feast, Tecton, Hopsworks	Manage and serve ML features
Model Serving	BentoML, Triton, TorchServe, Seldon	Deploy models as APIs
Monitoring	Evidently, WhyLabs, NannyML, Arize	Detect drift and degradation
Cloud Platforms	SageMaker, Vertex AI, Azure ML	End-to-end managed MLOps

Key Metrics to Monitor

Metric	Description	Alert Threshold
Prediction latency (p99)	99th percentile response time	> 500ms
Prediction throughput	Requests per second	< 80% of capacity
Data drift score	Statistical distance from training data	> 0.1 (PSI)
Model accuracy	Rolling accuracy on labeled samples	< training accuracy - 5%
Feature missing rate	Percentage of null/missing features	> 5%
Memory usage	Model serving process memory	> 80% of limit

Best Practices

Version everything: Data, code, models, configurations, and environments must all be versioned. Use Git for code, DVC for data and models, and Docker for environments. If you cannot reproduce a result from six months ago, your MLOps is incomplete.
Implement quality gates in CI/CD: Do not deploy models that fail validation. Define minimum accuracy, F1 score, latency, and fairness thresholds. Automate these checks in your deployment pipeline so human error cannot bypass them.
Monitor data drift, not just model accuracy: Model accuracy degrades silently as input data distributions shift. Use statistical tests (PSI, KS test, Jensen-Shannon divergence) to detect drift before it impacts predictions.
Use a feature store for shared feature engineering: Feature stores prevent training-serving skew by ensuring the same feature computation logic is used in both training and inference. Feast is a solid open-source option.
Separate training and serving infrastructure: Training requires GPUs and burst compute. Serving requires low latency and high availability. Do not conflate these requirements. Use separate infrastructure optimized for each workload.
Implement shadow deployments before blue-green: Before switching production traffic to a new model, run it in shadow mode alongside the current model. Compare predictions without impacting users.
Automate retraining with drift triggers: Set up automated pipelines that retrain models when data drift exceeds thresholds or when a scheduled evaluation shows performance degradation. Include human approval gates for production promotion.
Log prediction inputs and outputs: Store prediction requests and responses for debugging, auditing, and future retraining. Implement sampling for high-throughput systems to manage storage costs.
Use model registries for lifecycle management: MLflow Model Registry or equivalent provides stage transitions (Staging, Production, Archived), approval workflows, and lineage tracking. Never deploy models without going through the registry.
Document your ML system architecture: Maintain a diagram showing data flows, training pipelines, serving infrastructure, and monitoring systems. This documentation is essential for onboarding, debugging, and compliance.

Troubleshooting

Model performs well in training but poorly in production This is usually caused by training-serving skew. Verify that feature engineering logic is identical between training and serving. Check for data leakage in training. Compare feature distributions between training data and production data using Evidently.

Retraining pipeline produces worse models Add quality gates that compare new model metrics against the current production model. Only promote models that meet or exceed current performance. Log training data snapshots to diagnose data quality issues.

Model serving latency exceeds SLA Profile the serving pipeline to identify bottlenecks (feature computation, model inference, postprocessing). Consider model optimization (quantization, pruning, distillation). Use batching for throughput-oriented workloads. Scale horizontally with a load balancer.

Data drift detected but model accuracy is stable Not all drift affects model performance. Review which features are drifting and whether they are important to the model. Update your drift detection thresholds based on actual impact. Some drift is seasonal and expected.

Experiment tracking database growing too large Archive old experiments and delete artifacts for failed runs. Set retention policies for logged artifacts. Use artifact storage (S3, GCS) instead of the tracking database for large files.

⚠️ Loading Issue

Comprehensive Mlops Module