Computer Vision Architectures: Dạy Máy tính "Nhìn"

Một đứa trẻ 2 tuổi có thể nhận ra chó/mèo chỉ sau vài ví dụ. Nhưng dạy máy tính làm điều tương tự? Đó là bài toán khó nhất trong AI trong nhiều thập kỷ.

Năm 2012, AlexNet thay đổi mọi thứ - giành chiến thắng ImageNet competition với độ chính xác cao hơn phương pháp truyền thống 10.8%. Đột phá này đánh dấu sự bùng nổ của Deep Learning trong Computer Vision.

Hôm nay, Computer Vision không chỉ phân loại ảnh - nó power self-driving cars (nhận diện người đi bộ, biển báo), medical diagnosis (phát hiện ung thư từ X-quang), face recognition, AR/VR, và vô số ứng dụng khác.

Tất cả bắt đầu từ một câu hỏi: Làm sao máy tính "hiểu" được ảnh?

Vấn đề với Fully Connected Networks cho Ảnh

Tại sao MLP không hiệu quả cho ảnh?

Ví dụ: Ảnh 224x224 pixels, RGB (3 channels) = 224 × 224 × 3 = 150,528 inputs

MLP Architecture:
Input (150,528) → Hidden Layer 1 (1000) → Hidden Layer 2 (500) → Output (10)

Number of parameters:
Layer 1: 150,528 × 1000 = 150 million parameters!
Layer 2: 1000 × 500 = 500,000
Output: 500 × 10 = 5,000
Total: ~150 million parameters chỉ cho layer đầu tiên!

Vấn đề:

Too many parameters → dễ overfit, tốn memory khổng lồ
Không exploit spatial structure: Fully connected treat mỗi pixel độc lập, không hiểu rằng pixels gần nhau có liên quan
Không translation invariant: Nếu chó ở góc trái vs góc phải → model coi như 2 patterns hoàn toàn khác nhau

Ví dụ translation problem:

Input 1: Chó ở góc trái      Input 2: Chó ở góc phải
[🐕     ]                    [     🐕]

MLP: Hai inputs này có patterns hoàn toàn khác nhau!
→ Phải học riêng cho mỗi vị trí → inefficient

Solution: Convolutional Neural Networks (CNNs)

Convolutional Neural Networks: Kiến trúc Cách mạng

Toán học của Convolution

Convolution operation: Slide một filter (kernel) qua image, tính dot product tại mỗi vị trí.

Ví dụ đơn giản:

Input image (5×5):          Kernel/Filter (3×3):
1  2  3  4  5               1  0  -1
6  7  8  9  10              1  0  -1
11 12 13 14 15              1  0  -1
16 17 18 19 20
21 22 23 24 25

Convolution tại vị trí top-left:
┌─────────┐
│1  2  3 │ 4  5
│6  7  8 │ 9  10      × Kernel →
│11 12 13│ 14 15
└─────────┘
16 17 18  19 20

Output = (1×1 + 2×0 + 3×(-1)) + 
         (6×1 + 7×0 + 8×(-1)) + 
         (11×1 + 12×0 + 13×(-1))
       = (1 + 0 - 3) + (6 + 0 - 8) + (11 + 0 - 13)
       = -2 + (-2) + (-2) = -6

Slide kernel sang phải:

1 ┌─────────┐
6 │2  3  4 │ 5
11│7  8  9 │ 10     × Kernel → Output = -6
16└─────────┘
21 22 23 24 25

Kết quả sau khi slide toàn bộ ảnh → Feature map (output):

Output (3×3):
-6  -6  -6
-6  -6  -6
-6  -6  -6

Ý nghĩa của Filters

Filters học các patterns cơ bản:

Vertical Edge Detector:

1  0  -1
1  0  -1
1  0  -1

Detect cạnh dọc (thay đổi từ sáng → tối theo chiều ngang)

Horizontal Edge Detector:

1   1   1
0   0   0
-1  -1  -1

Detect cạnh ngang

Blur/Smoothing:

1/9  1/9  1/9
1/9  1/9  1/9
1/9  1/9  1/9

Average pixels xung quanh → làm mờ

Sharpen:

0   -1   0
-1   5  -1
0   -1   0

Tăng contrast

Điểm mạnh: CNN học filters tự động thay vì hand-craft!

Hyperparameters của Convolution

1. Kernel Size:

3×3: Most common (VGG, ResNet)
5×5: Wider receptive field
7×7: Rất wide, dùng ở layer đầu tiên
1×1: "Network in Network", giảm dimensions

2. Stride: Bước nhảy khi slide kernel.

Stride = 1 (overlap nhiều):
┌───┐
│ A │ B  C
└───┘
  ┌───┐
  │ B │ C  D
  └───┘

Stride = 2 (skip pixels):
┌───┐
│ A │ B  C  D
└───┘
        ┌───┐
        │ C │ D  E
        └───┘

Stride = 1: Output size gần bằng input
Stride = 2: Output size giảm 1/2 (downsampling)

3. Padding:

Problem: Mỗi lần convolve, output nhỏ hơn input.

Input: 5×5
Kernel: 3×3, Stride=1
Output: 3×3  ← Mất thông tin ở biên!

Solution: Thêm padding (zeros) xung quanh input.

Original (5×5):          With Padding (7×7):
1  2  3  4  5           0  0  0  0  0  0  0
6  7  8  9  10          0  1  2  3  4  5  0
11 12 13 14 15    →     0  6  7  8  9 10  0
16 17 18 19 20          0 11 12 13 14 15  0
21 22 23 24 25          0 16 17 18 19 20  0
                        0 21 22 23 24 25  0
                        0  0  0  0  0  0  0

Convolve với 3×3 kernel → Output: 5×5 (same as input!)

Types:

Valid padding (no padding): Output shrinks
Same padding: Pad để output size = input size
Full padding: Pad thêm để output lớn hơn input

Formula:

Output_size = (Input_size - Kernel_size + 2×Padding) / Stride + 1

4. Number of Filters:

Mỗi filter học một pattern khác nhau → nhiều filters = nhiều feature maps.

Input: 224×224×3 (RGB)
Apply 64 filters (3×3)
→ Output: 224×224×64 (64 feature maps)

Filter 1 học vertical edges
Filter 2 học horizontal edges
Filter 3 học curves
...
Filter 64 học complex textures

Pooling Layers: Downsampling

Mục đích:

Giảm spatial dimensions (width, height) → ít parameters, faster
Tạo translation invariance (chó dịch 1-2 pixels vẫn detect được)
Tăng receptive field

Max Pooling (phổ biến nhất):

Input (4×4):              Max Pool (2×2, stride=2):
1   3   2   4             ┌─────┐ ┌─────┐
5   6   1   2       →     │max=6│ │max=4│
7   2   8   3             └─────┘ └─────┘
3   4   1   9             ┌─────┐ ┌─────┐
                          │max=7│ │max=9│
Output (2×2):             └─────┘ └─────┘
6   4
7   9

Chọn giá trị lớn nhất trong mỗi region.

Average Pooling:

Average values trong region thay vì max.

Input region:     Average Pooling:
1  3              (1+3+5+6)/4 = 3.75
5  6

Global Average Pooling (GAP):

Average toàn bộ feature map thành 1 số.

Feature map (4×4):      GAP:
1  2  3  4              Average all
5  6  7  8        →     = (1+2+...+16)/16
9 10 11 12              = 8.5
13 14 15 16

Dùng ở cuối CNN thay vì Flatten → reduce parameters drastically.

CNN Architecture Pattern

Typical CNN:

Input Image (224×224×3)
    ↓
[CONV → ReLU → POOL] × N  ← Feature extraction
    ↓
[CONV → ReLU → POOL] × M
    ↓
[CONV → ReLU] × K
    ↓
Flatten
    ↓
[Fully Connected → ReLU] × L  ← Classification
    ↓
Softmax Output (Classes)

Pattern:

Early layers: Small, frequent filters (3×3) → learn low-level features (edges, textures)
Deeper layers: More filters, larger receptive field → learn high-level features (objects, shapes)
Gradual downsampling via pooling hoặc strided convolutions

Iconic CNN Architectures

LeNet-5 (1998): The Pioneer

Yann LeCun cho handwritten digit recognition (MNIST).

Architecture:
Input (32×32×1)
→ CONV (5×5, 6 filters) → AvgPool
→ CONV (5×5, 16 filters) → AvgPool
→ Flatten → FC (120) → FC (84) → Output (10)

Parameters: ~60K

Ý nghĩa: Proof of concept rằng CNNs work cho image classification.

AlexNet (2012): The Breakthrough

Alex Krizhevsky, ImageNet winner 2012.

Architecture:
Input (224×224×3)
→ CONV (11×11, stride=4, 96 filters) → MaxPool
→ CONV (5×5, 256 filters) → MaxPool
→ CONV (3×3, 384 filters)
→ CONV (3×3, 384 filters)
→ CONV (3×3, 256 filters) → MaxPool
→ FC (4096) → Dropout → FC (4096) → Dropout
→ Output (1000)

Parameters: 60 million

Innovations:

ReLU activation (thay vì sigmoid/tanh) → train nhanh hơn 6x
Dropout (0.5) để prevent overfitting
Data augmentation (flips, crops, color jittering)
GPU training (2 GPUs song song)

Impact: Error rate giảm từ 26% → 15.3% → AI winter kết thúc!

VGGNet (2014): Deeper is Better

Visual Geometry Group, Oxford.

Key insight: Stack nhiều 3×3 convs thay vì dùng 1 large conv.

2 conv 3×3 = receptive field 5×5
3 conv 3×3 = receptive field 7×7

Nhưng ít parameters hơn và non-linearity nhiều hơn!

VGG-16 Architecture:

Input (224×224×3)
→ [CONV 3×3, 64] × 2 → MaxPool
→ [CONV 3×3, 128] × 2 → MaxPool
→ [CONV 3×3, 256] × 3 → MaxPool
→ [CONV 3×3, 512] × 3 → MaxPool
→ [CONV 3×3, 512] × 3 → MaxPool
→ FC (4096) → FC (4096) → Output (1000)

Parameters: 138 million (very heavy!)

Principle: Simple, uniform architecture - easy to understand and implement.

Nhược điểm: Too many parameters → slow, high memory.

GoogLeNet/Inception (2014): "Network in Network"

Google.

Key insight: Thay vì chọn kernel size (1×1, 3×3, 5×5), tại sao không dùng TẤT CẢ?

Inception Module:

                    Input
                      │
        ┌─────────────┼─────────────┬──────────┐
        │             │             │          │
     1×1 conv      3×3 conv      5×5 conv   MaxPool
       (64)          (128)         (32)      (32)
        │             │             │          │
        └─────────────┴─────────────┴──────────┘
                      │
              Concatenate (256 channels)

1×1 Convolution trick: Giảm dimensions trước khi apply expensive 3×3, 5×5 convs.

Input: 256 channels
→ 1×1 conv (64 filters) → 64 channels  ← Dimensionality reduction
→ 3×3 conv → Cheaper!

GoogLeNet:

22 layers
Only 5 million parameters (25x ít hơn AlexNet!)
Auxiliary classifiers giữa network để help gradient flow

ResNet (2015): The Revolution

Microsoft Research.

Problem: Very deep networks (>20 layers) degradation problem - train error tăng khi thêm layers!

Training Error
    │    ╱  ← 56-layer
    │   ╱
    │  ╱
    │ ╱___  ← 20-layer (better!)
    └────────── Depth

Not overfitting (vì train error cao) - đơn giản là optimization khó hơn.

Solution: Skip Connections (Residual Connections)

Normal block:           Residual block:
x → CONV → ReLU →      x ──────────────┐
    CONV → ReLU →         │             │
    Output F(x)           CONV → ReLU   │ (skip/shortcut)
                          CONV          │
                          │             │
                          + ←───────────┘
                          ReLU
                          Output: F(x) + x

Insight: Dễ học residual F(x) = H(x) - x thay vì học trực tiếp H(x).

Nếu identity mapping là optimal → F(x) chỉ cần learn = 0 (dễ hơn nhiều)!

ResNet-50 Architecture:

Input (224×224×3)
→ CONV 7×7, stride=2 → MaxPool
→ [Residual Block] × 3
→ [Residual Block] × 4
→ [Residual Block] × 6
→ [Residual Block] × 3
→ Global Average Pooling
→ FC (1000)

Parameters: 25.5 million

Variants:

ResNet-18, ResNet-34: Shallower
ResNet-50, ResNet-101, ResNet-152: Deeper
ResNet-1000: Extreme (1000 layers!)

Impact: Cho phép train networks rất sâu (100+ layers) → State-of-the-art nhiều tasks.

Object Detection: Không chỉ "Cái gì" mà còn "Ở đâu"

Classification: "Ảnh này có chó"
Object Detection: "Có 2 con chó: 1 con ở (x1,y1,w1,h1), 1 con ở (x2,y2,w2,h2)"

Key Concepts

1. Bounding Box:

Representation: (x, y, width, height)
hoặc (x_min, y_min, x_max, y_max)

┌─────────────────┐
│                 │
│   ┌───────┐     │
│   │ DOG   │     │
│   │  🐕   │     │
│   └───────┘     │
│    (x,y,w,h)    │
└─────────────────┘

2. Intersection over Union (IoU):

Metric để đánh giá bounding box predictions.

Predicted Box (P):  ┌─────┐
Ground Truth (G):      ┌─────┐
                       │ ∩   │  ← Intersection
                    ┌──┴──┬──┘
                    │     │
                    └─────┘
                    ← Union →

IoU = Area(P ∩ G) / Area(P ∪ G)

IoU = 1.0: Perfect overlap
IoU > 0.5: Usually considered "good"
IoU > 0.7: High quality detection

3. Anchor Boxes:

Pre-defined boxes với different aspect ratios/scales.

Anchor boxes tại mỗi grid cell:
┌──┐  ┌────┐  ┌──────┐  ← 3 scales
│  │  │    │  │      │
└──┘  └────┘  └──────┘

     1:1    1:2    1:3   ← 3 aspect ratios

Model predict offsets từ anchor boxes thay vì predict boxes từ scratch.

4. Non-Maximum Suppression (NMS):

Problem: Model có thể predict nhiều boxes cho cùng 1 object.

        ┌─────┐
      ┌─┼─────┼─┐
    ┌─┼─┼─────┼─┼─┐
    │ │🐕│     │ │ │ ← 5 boxes cho 1 con chó!
    └─┼─┼─────┼─┼─┘
      └─┼─────┼─┘
        └─────┘

NMS Algorithm:

1. Sort boxes theo confidence score (cao → thấp)
2. Pick box có highest confidence
3. Xóa tất cả boxes có IoU > threshold (0.5) với box này
4. Repeat với boxes còn lại

Result: Chỉ giữ lại 1 best box cho mỗi object.

YOLO (You Only Look Once): Real-time Detection

Key idea: Treat object detection như regression problem - predict boxes và classes trong 1 pass.

Architecture:

Input Image (448×448×3)
    ↓
CNN Backbone (Darknet)
    ↓
Output: S×S×(B×5 + C)

S×S: Grid (VD: 7×7)
B: Number of boxes per cell (VD: 2)
5: (x, y, w, h, confidence)
C: Number of classes

Workflow:

1. Chia image thành 7×7 grid
2. Mỗi cell predict 2 bounding boxes
3. Mỗi box: (x, y, w, h, confidence, class_probs)
4. Apply NMS để remove duplicates

Ưu điểm:

Cực kỳ nhanh: 45 FPS (real-time!)
End-to-end training
Global context (see toàn bộ image, không chỉ regions)

Nhược điểm:

Kém với small objects (vì grid coarse)
Struggle với objects gần nhau

Versions:

YOLOv1 (2015)
YOLOv2/YOLO9000 (2016): Faster, better
YOLOv3 (2018): Multi-scale predictions
YOLOv4, YOLOv5, YOLOv8 (2020+): State-of-the-art

Faster R-CNN: Two-stage Detector

More accurate nhưng slower hơn YOLO.

Stage 1: Region Proposal Network (RPN)

Generate ~2000 region proposals (potential objects)

Stage 2: Classification & Refinement

Crop features cho mỗi proposal → Classify & refine box