Có một sự thật đau lòng trong ngành ML: 87% ML projects không bao giờ đến production.
Tại sao? Vì việc train model đạt 95% accuracy trên Jupyter Notebook là một chuyện. Nhưng deploy model đó để phục vụ hàng triệu users, monitor performance, retrain khi data thay đổi, và maintain trong 2-3 năm? Đó là chuyện hoàn toàn khác.
MLOps (Machine Learning Operations) là discipline giải quyết khoảng cách giữa "research" và "production". Nó kết hợp ML, DevOps, và Data Engineering để tạo ra ML systems reliable, scalable, và maintainable.
Theo Google's "Hidden Technical Debt in Machine Learning Systems" paper: Chỉ có 5% code trong ML system là actual ML code. 95% còn lại? Infrastructure, data collection, monitoring, deployment, và configuration.
Không giống software engineering truyền thống (code → test → deploy → done), ML projects là continuous cycle.
┌─────────────────────────────────────────────┐
│ │
│ 1. Problem Definition & Data Collection │
│ ↓ │
│ 2. Data Preparation & EDA │
│ ↓ │
│ 3. Feature Engineering │
│ ↓ │
│ 4. Model Training & Experimentation │
│ ↓ │
│ 5. Model Evaluation │
│ ↓ │
│ 6. Model Deployment │
│ ↓ │
│ 7. Monitoring & Maintenance │
│ ↓ │
│ 8. Data Drift Detection ──────────────────┘
│ ↓
│ Retrain? → YES → Back to Step 3
Key differences vs traditional software:
| Traditional Software | ML Systems |
|---|---|
| Code-centric | Data-centric |
| Deterministic | Probabilistic |
| Manual testing | Automated + Manual testing |
| Deploy once | Continuous retraining |
| Static behavior | Behavior changes over time |
The most important phase - sai ở đây = waste toàn bộ effort.
Questions to ask:
Is ML the right solution?
What is the business metric?
What is the ML metric?
Ví dụ: E-commerce Product Recommendation
Business Goal: Tăng revenue
Business Metric: Revenue per user, Conversion rate
ML Problem: Recommendation (predict items user will buy)
ML Metric: Precision@K, Recall@K, NDCG
Success Criteria: +10% click-through rate, +5% conversion
Các bước: Data collection, cleaning, feature engineering, training, evaluation.
Key point: Track everything - data versions, code versions, hyperparameters, metrics.
Deployment không chỉ là "copy model lên server".
Deployment strategies:
1. Shadow Deployment (Canary Testing)
Request → Primary Model (v1.0) → Response to User
└─→ Shadow Model (v2.0) → Log predictions (không serve)
Compare v1.0 vs v2.0 predictions:
- Are they similar? (sanity check)
- Is v2.0 better on recent data?
Use case: Test model mới không ảnh hưởng users.
2. A/B Testing
50% Users → Model v1.0
50% Users → Model v2.0
Compare metrics:
- Click-through rate
- Conversion rate
- Revenue per user
Statistical significance required (không phải ngẫu nhiên mà v2.0 tốt hơn).
3. Multi-Armed Bandit
Dynamic allocation based on performance:
Initially: 50% v1.0, 50% v2.0
After 1000 requests:
v2.0 performs better → 30% v1.0, 70% v2.0
After 5000 requests:
v2.0 clearly wins → 10% v1.0, 90% v2.0
Advantage: Tự động optimize traffic allocation, minimize regret.
4. Blue-Green Deployment
Blue (v1.0): Currently serving 100%
Green (v2.0): Deployed nhưng 0% traffic
Test Green thoroughly → Switch traffic 100% to Green
Keep Blue as backup → Instant rollback nếu cần
"If you can't measure it, you can't improve it."
What to monitor:
1. Model Performance Metrics
# Log predictions và actuals
{
"timestamp": "2024-02-01T10:30:00",
"prediction": 0.85,
"actual": 1, # (if available)
"user_id": "12345",
"features": {...}
}
# Calculate metrics periodically
daily_accuracy = compute_accuracy(predictions, actuals)
if daily_accuracy < threshold:
alert_team()
Challenge: Actuals (ground truth) không có ngay lập tức.
Ví dụ:
→ Dùng proxy metrics trong lúc chờ actuals.
2. Data Quality
# Monitor input distributions
daily_mean_age = input_data['age'].mean()
if abs(daily_mean_age - baseline_mean_age) > threshold:
alert("Data distribution shift detected!")
# Missing values
missing_rate = input_data.isnull().sum() / len(input_data)
if missing_rate > 0.05:
alert("High missing rate!")
# Outliers
outliers = detect_outliers(input_data)
if len(outliers) > threshold:
alert("Unusual data detected!")
3. System Metrics
# Latency (Response time)
p50_latency = 50ms # Median
p95_latency = 200ms # 95th percentile
p99_latency = 500ms # 99th percentile
if p95_latency > SLA:
alert("Latency SLA violation!")
# Throughput
requests_per_second = 1000
if requests_per_second > capacity:
autoscale()
# Error rate
error_rate = errors / total_requests
if error_rate > 0.01: # 1%
alert("High error rate!")
4. Business Metrics
Revenue impact, User engagement, Conversion rate
Link ML metrics → Business metrics để prove value.
Data Drift: Distribution của input data thay đổi theo thời gian.
Ví dụ: Credit Scoring Model
Training Data (2020):
- Average income: $50k
- Average age: 35
- Smartphone ownership: 70%
Production Data (2024):
- Average income: $65k ← Drift!
- Average age: 32 ← Drift!
- Smartphone ownership: 95% ← Drift!
Model trained trên 2020 data → predictions không accurate nữa
Types of drift:
1. Covariate Shift (Feature drift)
P(X) changes, but P(Y|X) stays same
Example:
- Training: 70% desktop users, 30% mobile
- Production: 30% desktop, 70% mobile
Model vẫn predict tốt, nhưng see different X distribution
2. Concept Drift
P(Y|X) changes (relationship between features and target)
Example: Spam Detection
- 2010: "Nigerian prince" → Spam
- 2024: "Nigerian prince" → Rare, không còn phổ biến
New spam: AI-generated emails, phishing links
Patterns thay đổi → Model needs retrain
3. Label Drift
P(Y) changes (distribution của target variable)
Example: Fraud Detection
- Training: 1% fraud rate
- Production: 5% fraud rate (new fraud techniques)
Detecting Drift:
Statistical Tests:
from scipy.stats import ks_2samp
# Kolmogorov-Smirnov Test
statistic, p_value = ks_2samp(training_data['age'],
production_data['age'])
if p_value < 0.05:
print("Significant distribution change detected!")
Population Stability Index (PSI):
PSI = Σ (actual% - expected%) * ln(actual% / expected%)
PSI < 0.1: No significant change
PSI 0.1-0.25: Moderate change (investigate)
PSI > 0.25: Significant change (retrain!)
Monitoring Predictions:
# Track prediction distribution
baseline_mean_prediction = 0.15
current_mean_prediction = 0.35 # Spike!
# Alert if drift
if abs(current - baseline) > threshold:
investigate_drift()
Nightmare scenario:
Data Scientist: "Model đạt 92% accuracy!"
[1 tháng sau]
Engineer: "Tôi chỉ reproduce được 87%... What happened?"
Sources of non-reproducibility:
# ❌ BAD
np.random.shuffle(data)
train, test = split(data)
# ✅ GOOD
np.random.seed(42)
random.seed(42)
tf.random.set_seed(42)
torch.manual_seed(42)
train, test = split(data)
Set seeds cho:
Problem: Data thay đổi không được track.
v1.0: Trained trên data_2024_01_15.csv
v1.1: Trained trên data_2024_02_01.csv ← Không biết data nào!
Solution: Data Version Control (DVC)
# Track data với DVC
dvc add data/training_data.csv
git add data/training_data.csv.dvc
git commit -m "Add training data v1.0"
# Mỗi version code → link đến specific data version
"Works on my machine" problem:
Dev machine: Python 3.8, TensorFlow 2.10, scikit-learn 1.0
Production: Python 3.9, TensorFlow 2.12, scikit-learn 1.2
→ Different results!
Solutions:
Docker:
FROM python:3.8
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . /app
WORKDIR /app
CMD ["python", "train.py"]
requirements.txt (pinned versions):
tensorflow==2.10.0
scikit-learn==1.0.2
pandas==1.4.3
numpy==1.23.1
Poetry/Pipenv: Lock dependencies bao gồm cả transitive dependencies.
Track mọi thứ về mỗi experiment:
import mlflow
mlflow.start_run()
# Log parameters
mlflow.log_param("learning_rate", 0.001)
mlflow.log_param("batch_size", 32)
mlflow.log_param("model_type", "ResNet50")
# Log metrics
mlflow.log_metric("train_accuracy", 0.92)
mlflow.log_metric("val_accuracy", 0.88)
# Log model
mlflow.sklearn.log_model(model, "model")
# Log artifacts (plots, confusion matrix)
mlflow.log_artifact("confusion_matrix.png")
mlflow.end_run()
MLflow UI: Compare experiments side-by-side.
Experiment 1: lr=0.001, batch=32 → val_acc=0.88
Experiment 2: lr=0.0001, batch=64 → val_acc=0.91 ← Better!
Challenge: Nhiều models, nhiều versions, deployed ở đâu?
fraud_detection_v1.0.pkl ← Production
fraud_detection_v1.1.pkl ← Staging
fraud_detection_v2.0.pkl ← Experimental
fraud_detection_v2.1.pkl ← ???
Model Registry: Central repository cho models.
Workflow:
1. Train model
2. Register model vào registry:
- Model binary/weights
- Metadata (hyperparameters, metrics)
- Training data version
- Code version (Git commit)
3. Tag model:
- "staging", "production", "archived"
4. Deploy model từ registry
5. Track which version deployed ở đâu
Ví dụ với MLflow:
# Register model
mlflow.register_model(
"runs:/<run_id>/model",
"fraud_detection"
)
# Promote to production
client = MlflowClient()
client.transition_model_version_stage(
name="fraud_detection",
version=3,
stage="Production"
)
# Load production model
model = mlflow.pyfunc.load_model(
model_uri="models:/fraud_detection/Production"
)
Benefits:
Data Scientist:
1. Manually extract data từ database
2. Train model trên laptop
3. Save model.pkl
4. Email cho engineer: "Deploy cái này nhé!"
5. Engineer manually copy lên server
Problems:
- Không reproducible
- Không automated
- Không monitoring
- Không versioning
Characteristics:
Automated Training Pipeline:
1. Scheduled data extraction
2. Automated preprocessing
3. Automated training
4. Automated evaluation
5. Model registry
Deployment: Still manual hoặc semi-automated
Improvements:
Still missing:
Full MLOps:
1. Code commit → Trigger CI/CD
2. Automated tests (data validation, model tests)
3. Automated training trên new data
4. Automated evaluation
5. Automated deployment (if metrics meet threshold)
6. Automated monitoring
7. Automated retraining (if drift detected)
Characteristics:
Tools ecosystem:
Data Versioning: DVC, Pachyderm
Experiment Tracking: MLflow, Weights & Biases, Neptune
Model Registry: MLflow, Seldon
Pipeline Orchestration: Airflow, Kubeflow, Prefect
Monitoring: Evidently AI, Whylabs, Fiddler
✅ Start simple, iterate:
✅ Separate training and serving code:
/training
- train.py
- preprocess.py
- evaluate.py
/serving
- api.py
- preprocess.py ← SAME as training
✅ Version everything:
✅ Monitor actively, not passively:
✅ Document assumptions:
"""
Model: Random Forest Classifier
Training Data: user_behavior_2024_01.csv (50k samples)
Assumptions:
- Users active in last 30 days
- Exclude users với <5 sessions
- Feature 'age' missing → impute với median
Expected Performance:
- Train Accuracy: 0.92
- Val Accuracy: 0.88
- Precision: 0.85
- Recall: 0.90
Retrain Trigger:
- Val Accuracy drops below 0.85
- Data drift PSI > 0.25
"""
✅ Test như software:
def test_model_input_shape():
assert model.input_shape == (None, 10)
def test_model_output_range():
pred = model.predict(sample_input)
assert 0 <= pred <= 1
def test_model_inference_time():
start = time.time()
model.predict(sample_input)
latency = time.time() - start
assert latency < 0.1 # 100ms
✅ Graceful degradation:
try:
prediction = ml_model.predict(features)
except Exception as e:
log_error(e)
# Fallback to rule-based or simple heuristic
prediction = fallback_model.predict(features)
Critical mindset shift:
Research mindset: "Model đạt 95% accuracy!"
Production mindset: "Model đạt 95% accuracy, deploy được, serve 1000 req/s,
monitor được, retrain tự động khi drift, và maintain được 2 năm"
Next steps:
Trong bài tiếp theo, chúng ta sẽ bước vào thế giới Generative AI & LLMs - khám phá kiến trúc Transformer và cách Large Language Models hoạt động.
Bài viết thuộc series "From Zero to AI Engineer" - Module 5: Deep Learning & Computer Vision