ML Operations (Theoretical): Từ Notebook đến Production

Có một sự thật đau lòng trong ngành ML: 87% ML projects không bao giờ đến production.

Tại sao? Vì việc train model đạt 95% accuracy trên Jupyter Notebook là một chuyện. Nhưng deploy model đó để phục vụ hàng triệu users, monitor performance, retrain khi data thay đổi, và maintain trong 2-3 năm? Đó là chuyện hoàn toàn khác.

MLOps (Machine Learning Operations) là discipline giải quyết khoảng cách giữa "research" và "production". Nó kết hợp ML, DevOps, và Data Engineering để tạo ra ML systems reliable, scalable, và maintainable.

Theo Google's "Hidden Technical Debt in Machine Learning Systems" paper: Chỉ có 5% code trong ML system là actual ML code. 95% còn lại? Infrastructure, data collection, monitoring, deployment, và configuration.

ML Lifecycle: Vòng đời của ML Project

Không giống software engineering truyền thống (code → test → deploy → done), ML projects là continuous cycle.

┌─────────────────────────────────────────────┐
│                                             │
│  1. Problem Definition & Data Collection   │
│              ↓                              │
│  2. Data Preparation & EDA                 │
│              ↓                              │
│  3. Feature Engineering                    │
│              ↓                              │
│  4. Model Training & Experimentation       │
│              ↓                              │
│  5. Model Evaluation                       │
│              ↓                              │
│  6. Model Deployment                       │
│              ↓                              │
│  7. Monitoring & Maintenance               │
│              ↓                              │
│  8. Data Drift Detection ──────────────────┘
│              ↓
│     Retrain? → YES → Back to Step 3

Key differences vs traditional software:

Traditional Software	ML Systems
Code-centric	Data-centric
Deterministic	Probabilistic
Manual testing	Automated + Manual testing
Deploy once	Continuous retraining
Static behavior	Behavior changes over time

Phase 1: Problem Definition

The most important phase - sai ở đây = waste toàn bộ effort.

Questions to ask:

Is ML the right solution?
- Có đủ data không? (Cần ít nhất vài nghìn samples)
- Problem có patterns to learn không? (Rule-based có đơn giản hơn không?)
- Cost của wrong predictions có chấp nhận được không?
What is the business metric?
- ❌ Vague: "Improve customer experience"
- ✅ Specific: "Reduce customer churn by 15% in Q2"
What is the ML metric?
- Churn prediction → Recall (không miss churners)
- Spam detection → Precision (không block important emails)

Ví dụ: E-commerce Product Recommendation

Business Goal: Tăng revenue
Business Metric: Revenue per user, Conversion rate
ML Problem: Recommendation (predict items user will buy)
ML Metric: Precision@K, Recall@K, NDCG
Success Criteria: +10% click-through rate, +5% conversion

Phase 2-5: Data → Model (Đã cover ở bài trước)

Các bước: Data collection, cleaning, feature engineering, training, evaluation.

Key point: Track everything - data versions, code versions, hyperparameters, metrics.

Phase 6: Model Deployment

Deployment không chỉ là "copy model lên server".

Deployment strategies:

1. Shadow Deployment (Canary Testing)

Request → Primary Model (v1.0) → Response to User
       └─→ Shadow Model (v2.0)  → Log predictions (không serve)

Compare v1.0 vs v2.0 predictions:
- Are they similar? (sanity check)
- Is v2.0 better on recent data?

Use case: Test model mới không ảnh hưởng users.

2. A/B Testing

50% Users → Model v1.0
50% Users → Model v2.0

Compare metrics:
- Click-through rate
- Conversion rate
- Revenue per user

Statistical significance required (không phải ngẫu nhiên mà v2.0 tốt hơn).

3. Multi-Armed Bandit

Dynamic allocation based on performance:
Initially: 50% v1.0, 50% v2.0
After 1000 requests:
  v2.0 performs better → 30% v1.0, 70% v2.0
After 5000 requests:
  v2.0 clearly wins → 10% v1.0, 90% v2.0

Advantage: Tự động optimize traffic allocation, minimize regret.

4. Blue-Green Deployment

Blue (v1.0): Currently serving 100%
Green (v2.0): Deployed nhưng 0% traffic

Test Green thoroughly → Switch traffic 100% to Green
Keep Blue as backup → Instant rollback nếu cần

Phase 7: Monitoring

"If you can't measure it, you can't improve it."

What to monitor:

1. Model Performance Metrics

# Log predictions và actuals
{
    "timestamp": "2024-02-01T10:30:00",
    "prediction": 0.85,
    "actual": 1,  # (if available)
    "user_id": "12345",
    "features": {...}
}

# Calculate metrics periodically
daily_accuracy = compute_accuracy(predictions, actuals)
if daily_accuracy < threshold:
    alert_team()

Challenge: Actuals (ground truth) không có ngay lập tức.

Ví dụ:

Churn prediction: Biết user có churn hay không sau 30 ngày
Click prediction: Biết ngay (click hay không)
Fraud detection: Có thể biết sau vài ngày (khi có dispute)

→ Dùng proxy metrics trong lúc chờ actuals.

2. Data Quality

# Monitor input distributions
daily_mean_age = input_data['age'].mean()
if abs(daily_mean_age - baseline_mean_age) > threshold:
    alert("Data distribution shift detected!")

# Missing values
missing_rate = input_data.isnull().sum() / len(input_data)
if missing_rate > 0.05:
    alert("High missing rate!")

# Outliers
outliers = detect_outliers(input_data)
if len(outliers) > threshold:
    alert("Unusual data detected!")

3. System Metrics

# Latency (Response time)
p50_latency = 50ms   # Median
p95_latency = 200ms  # 95th percentile
p99_latency = 500ms  # 99th percentile

if p95_latency > SLA:
    alert("Latency SLA violation!")

# Throughput
requests_per_second = 1000
if requests_per_second > capacity:
    autoscale()

# Error rate
error_rate = errors / total_requests
if error_rate > 0.01:  # 1%
    alert("High error rate!")

4. Business Metrics

Revenue impact, User engagement, Conversion rate

Link ML metrics → Business metrics để prove value.

Phase 8: Data Drift & Model Decay

Data Drift: Distribution của input data thay đổi theo thời gian.

Ví dụ: Credit Scoring Model

Training Data (2020):
- Average income: $50k
- Average age: 35
- Smartphone ownership: 70%

Production Data (2024):
- Average income: $65k  ← Drift!
- Average age: 32       ← Drift!
- Smartphone ownership: 95%  ← Drift!

Model trained trên 2020 data → predictions không accurate nữa

Types of drift:

1. Covariate Shift (Feature drift)

P(X) changes, but P(Y|X) stays same

Example: 
- Training: 70% desktop users, 30% mobile
- Production: 30% desktop, 70% mobile

Model vẫn predict tốt, nhưng see different X distribution

2. Concept Drift

P(Y|X) changes (relationship between features and target)

Example: Spam Detection
- 2010: "Nigerian prince" → Spam
- 2024: "Nigerian prince" → Rare, không còn phổ biến
         New spam: AI-generated emails, phishing links

Patterns thay đổi → Model needs retrain

3. Label Drift

P(Y) changes (distribution của target variable)

Example: Fraud Detection
- Training: 1% fraud rate
- Production: 5% fraud rate (new fraud techniques)

Detecting Drift:

Statistical Tests:

from scipy.stats import ks_2samp

# Kolmogorov-Smirnov Test
statistic, p_value = ks_2samp(training_data['age'], 
                               production_data['age'])

if p_value < 0.05:
    print("Significant distribution change detected!")

Population Stability Index (PSI):

PSI = Σ (actual% - expected%) * ln(actual% / expected%)

PSI < 0.1: No significant change
PSI 0.1-0.25: Moderate change (investigate)
PSI > 0.25: Significant change (retrain!)

Monitoring Predictions:

# Track prediction distribution
baseline_mean_prediction = 0.15
current_mean_prediction = 0.35  # Spike!

# Alert if drift
if abs(current - baseline) > threshold:
    investigate_drift()

Reproducibility: "Tại sao Kết quả Khác?"

Nightmare scenario:

Data Scientist: "Model đạt 92% accuracy!"
[1 tháng sau]
Engineer: "Tôi chỉ reproduce được 87%... What happened?"

Sources of non-reproducibility:

1. Random Seeds

# ❌ BAD
np.random.shuffle(data)
train, test = split(data)

# ✅ GOOD
np.random.seed(42)
random.seed(42)
tf.random.set_seed(42)
torch.manual_seed(42)

train, test = split(data)

Set seeds cho:

Data shuffling
Train/test split
Weight initialization
Dropout
Data augmentation

2. Data Versioning

Problem: Data thay đổi không được track.

v1.0: Trained trên data_2024_01_15.csv
v1.1: Trained trên data_2024_02_01.csv  ← Không biết data nào!

Solution: Data Version Control (DVC)

# Track data với DVC
dvc add data/training_data.csv
git add data/training_data.csv.dvc
git commit -m "Add training data v1.0"

# Mỗi version code → link đến specific data version

3. Environment Management

"Works on my machine" problem:

Dev machine: Python 3.8, TensorFlow 2.10, scikit-learn 1.0
Production: Python 3.9, TensorFlow 2.12, scikit-learn 1.2
→ Different results!

Solutions:

Docker:

FROM python:3.8

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . /app
WORKDIR /app

CMD ["python", "train.py"]

requirements.txt (pinned versions):

tensorflow==2.10.0
scikit-learn==1.0.2
pandas==1.4.3
numpy==1.23.1

Poetry/Pipenv: Lock dependencies bao gồm cả transitive dependencies.

4. Experiment Tracking

Track mọi thứ về mỗi experiment:

import mlflow

mlflow.start_run()

# Log parameters
mlflow.log_param("learning_rate", 0.001)
mlflow.log_param("batch_size", 32)
mlflow.log_param("model_type", "ResNet50")

# Log metrics
mlflow.log_metric("train_accuracy", 0.92)
mlflow.log_metric("val_accuracy", 0.88)

# Log model
mlflow.sklearn.log_model(model, "model")

# Log artifacts (plots, confusion matrix)
mlflow.log_artifact("confusion_matrix.png")

mlflow.end_run()

MLflow UI: Compare experiments side-by-side.

Experiment 1: lr=0.001, batch=32 → val_acc=0.88
Experiment 2: lr=0.0001, batch=64 → val_acc=0.91  ← Better!

Model Registry: Quản lý Model Versions

Challenge: Nhiều models, nhiều versions, deployed ở đâu?

fraud_detection_v1.0.pkl  ← Production
fraud_detection_v1.1.pkl  ← Staging
fraud_detection_v2.0.pkl  ← Experimental
fraud_detection_v2.1.pkl  ← ???

Model Registry: Central repository cho models.

Workflow:

1. Train model
2. Register model vào registry:
   - Model binary/weights
   - Metadata (hyperparameters, metrics)
   - Training data version
   - Code version (Git commit)
3. Tag model:
   - "staging", "production", "archived"
4. Deploy model từ registry
5. Track which version deployed ở đâu

Ví dụ với MLflow:

# Register model
mlflow.register_model(
    "runs:/<run_id>/model",
    "fraud_detection"
)

# Promote to production
client = MlflowClient()
client.transition_model_version_stage(
    name="fraud_detection",
    version=3,
    stage="Production"
)

# Load production model
model = mlflow.pyfunc.load_model(
    model_uri="models:/fraud_detection/Production"
)

Benefits:

Track lineage: Model này trained từ data/code nào?
Easy rollback: Revert về version cũ nếu version mới fail
Audit trail: Ai deploy model nào khi nào?

MLOps Maturity Levels (Google)

Level 0: Manual Process

Data Scientist:
1. Manually extract data từ database
2. Train model trên laptop
3. Save model.pkl
4. Email cho engineer: "Deploy cái này nhé!"
5. Engineer manually copy lên server

Problems:
- Không reproducible
- Không automated
- Không monitoring
- Không versioning

Characteristics:

Entirely manual
No CI/CD
No monitoring
Deployment = months

Level 1: ML Pipeline Automation

Automated Training Pipeline:
1. Scheduled data extraction
2. Automated preprocessing
3. Automated training
4. Automated evaluation
5. Model registry

Deployment: Still manual hoặc semi-automated

Improvements:

Fast iteration
Experimentation automated
Model versioning

Still missing:

Continuous deployment
Automated monitoring/alerting

Level 2: CI/CD Pipeline Automation

Full MLOps:
1. Code commit → Trigger CI/CD
2. Automated tests (data validation, model tests)
3. Automated training trên new data
4. Automated evaluation
5. Automated deployment (if metrics meet threshold)
6. Automated monitoring
7. Automated retraining (if drift detected)

Characteristics:

Fully automated end-to-end
Continuous training
Continuous deployment
Production monitoring + alerting
Fast response to drift

Tools ecosystem:

Data Versioning: DVC, Pachyderm
Experiment Tracking: MLflow, Weights & Biases, Neptune
Model Registry: MLflow, Seldon
Pipeline Orchestration: Airflow, Kubeflow, Prefect
Monitoring: Evidently AI, Whylabs, Fiddler

Best Practices

✅ Start simple, iterate:

Level 0 → 1 trước khi jump to Level 2
Automate pain points dần dần

✅ Separate training and serving code:

/training
  - train.py
  - preprocess.py
  - evaluate.py
/serving
  - api.py
  - preprocess.py  ← SAME as training

✅ Version everything:

Code (Git)
Data (DVC)
Models (Model Registry)
Environment (Docker)

✅ Monitor actively, not passively:

Set up alerts
Dashboard cho metrics
Weekly review sessions

✅ Document assumptions:

"""
Model: Random Forest Classifier
Training Data: user_behavior_2024_01.csv (50k samples)
Assumptions:
  - Users active in last 30 days
  - Exclude users với <5 sessions
  - Feature 'age' missing → impute với median
Expected Performance:
  - Train Accuracy: 0.92
  - Val Accuracy: 0.88
  - Precision: 0.85
  - Recall: 0.90
Retrain Trigger:
  - Val Accuracy drops below 0.85
  - Data drift PSI > 0.25
"""

✅ Test như software:

def test_model_input_shape():
    assert model.input_shape == (None, 10)

def test_model_output_range():
    pred = model.predict(sample_input)
    assert 0 <= pred <= 1

def test_model_inference_time():
    start = time.time()
    model.predict(sample_input)
    latency = time.time() - start
    assert latency < 0.1  # 100ms

✅ Graceful degradation:

try:
    prediction = ml_model.predict(features)
except Exception as e:
    log_error(e)
    # Fallback to rule-based or simple heuristic
    prediction = fallback_model.predict(features)

Key Takeaways

ML Lifecycle là continuous cycle, không phải linear process
87% ML projects fail vì thiếu MLOps
Deployment strategies: Shadow, A/B testing, Canary, Blue-Green
Monitoring: Model metrics, data quality, system metrics, business metrics
Data Drift: P(X) hoặc P(Y|X) thay đổi → model decay → retrain
Reproducibility: Track seeds, versions (data/code/environment), experiments
Model Registry: Central place quản lý model versions
MLOps Maturity:
- Level 0: Manual (most companies)
- Level 1: Training automation
- Level 2: Full CI/CD (goal)

Critical mindset shift:

Research mindset: "Model đạt 95% accuracy!"
Production mindset: "Model đạt 95% accuracy, deploy được, serve 1000 req/s, 
                     monitor được, retrain tự động khi drift, và maintain được 2 năm"

Next steps:

Setup experiment tracking (MLflow)
Implement monitoring dashboard
Practice deployment strategies (A/B testing)
Learn orchestration tools (Airflow/Kubeflow)

Trong bài tiếp theo, chúng ta sẽ bước vào thế giới Generative AI & LLMs - khám phá kiến trúc Transformer và cách Large Language Models hoạt động.

Bài viết thuộc series "From Zero to AI Engineer" - Module 5: Deep Learning & Computer Vision