Fine-tuning & Adaptation: Customize LLMs cho Domain của bạn

RAG giúp LLMs truy cập external knowledge, nhưng đôi khi bạn cần LLM thay đổi behavior - viết theo style cụ thể, hiểu domain-specific terminology, hoặc follow instructions tốt hơn. Đó là lúc fine-tuning phát huy tác dụng.

Trong bài này, chúng ta sẽ khám phá các kỹ thuật fine-tuning từ full fine-tuning đến parameter-efficient methods như LoRA và QLoRA.

Full Fine-tuning vs Parameter-Efficient Fine-Tuning (PEFT)

Full Fine-tuning

Update tất cả parameters của model.

# Pseudocode
for epoch in range(num_epochs):
    for batch in training_data:
        # Forward pass
        outputs = model(batch.inputs)
        loss = compute_loss(outputs, batch.labels)
        
        # Backward pass - update ALL 7B parameters
        loss.backward()
        optimizer.step()

Ưu điểm:

  • Maximum adaptation capability
  • Best performance potential

Nhược điểm:

  • Requires huge GPU memory (7B model = ~28GB for training)
  • Slow training
  • Risk of catastrophic forgetting
  • Need separate copy for each task

Khi nào dùng: Research, unlimited resources, need maximum performance

Parameter-Efficient Fine-Tuning (PEFT)

Freeze hầu hết parameters, chỉ train một phần nhỏ.

# Pseudocode
# Freeze base model
for param in base_model.parameters():
    param.requires_grad = False

# Add small trainable modules
adapter = LoRAAdapter(rank=8)  # Only ~0.1% of parameters

# Train only adapter
for batch in training_data:
    outputs = model_with_adapter(batch.inputs)
    loss = compute_loss(outputs, batch.labels)
    loss.backward()  # Only adapter gradients
    optimizer.step()

Ưu điểm:

  • 10-100x less GPU memory
  • Faster training
  • One base model + multiple adapters
  • Less catastrophic forgetting

Nhược điểm:

  • Slightly lower performance than full fine-tuning (usually <5%)

LoRA - Low-Rank Adaptation

LoRA là PEFT technique phổ biến nhất.

Intuition đằng sau LoRA

Thay vì update weight matrix W (dimension d × d), LoRA thêm low-rank decomposition:

Original: W ∈ ℝ^(d×d)

LoRA: W' = W + BA
  - W: Frozen original weights (d × d)
  - B: Trainable (d × r)
  - A: Trainable (r × d)
  - r << d (rank, thường 4-64)

Parameters to train: d×r + r×d = 2dr
Instead of: d×d

Example: d=4096, r=8
  - Full: 4096×4096 = 16M parameters
  - LoRA: 2×4096×8 = 65K parameters (~0.4%)

LoRA Implementation

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
    r=8,                          # Rank
    lora_alpha=32,                # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(base_model, lora_config)

# Check trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,738,415,616 || trainable%: 0.06%

Training với LoRA

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,                    # Mixed precision
    logging_steps=10,
    save_steps=100
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

# Train
trainer.train()

# Save LoRA adapter (only ~10MB instead of 14GB!)
model.save_pretrained("./lora-adapter")

Inference với LoRA

from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "./lora-adapter"
)

# Generate
inputs = tokenizer("Explain machine learning:", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))

LoRA Hyperparameters

r (rank):

  • Lower (4-8): Fewer parameters, faster, sufficient cho simple tasks
  • Higher (16-64): More expressive, better performance, more memory

lora_alpha:

  • Scaling factor = lora_alpha / r
  • Typical: lora_alpha = 2r hoặc 4r
  • Higher alpha = stronger adaptation

target_modules:

# Minimal (fastest)
target_modules=["q_proj", "v_proj"]

# Recommended (balanced)
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]

# Extensive (best performance)
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

QLoRA - Quantized LoRA

QLoRA combines LoRA với quantization để giảm memory hơn nữa.

Quantization là gì?

Giảm precision của weights:

FP32: 32 bits per parameter  (4 bytes)
FP16: 16 bits per parameter  (2 bytes) - 50% reduction
INT8:  8 bits per parameter  (1 byte)  - 75% reduction
INT4:  4 bits per parameter  (0.5 byte)- 87.5% reduction

QLoRA innovation:

  • Base model: 4-bit quantized (NF4 - Normal Float 4)
  • LoRA adapters: 16-bit training
  • Double quantization: Quantize quantization constants

Memory Comparison

7B model:
- FP32: ~28GB
- FP16: ~14GB
- LoRA (FP16): ~14GB base + 100MB adapter
- QLoRA (4-bit): ~3.5GB base + 100MB adapter ✨

Can fine-tune 7B on 1x RTX 3090 (24GB)!

QLoRA Implementation

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, prepare_model_for_kbit_training

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat4
    bnb_4bit_use_double_quant=True,   # Double quantization
    bnb_4bit_compute_dtype=torch.float16
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# Add LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# Train normally
trainer.train()

QLoRA vs LoRA Trade-offs

Metric LoRA QLoRA
Memory 14GB (7B) 3.5GB (7B)
Speed Faster ~20% slower
Quality Slightly better Very close (~1-2% diff)
Use case GPU với đủ RAM Consumer GPUs

Adapters - Alternative PEFT Method

Thêm small neural network modules vào model.

from peft import AdapterConfig

adapter_config = AdapterConfig(
    reduction_factor=16  # Bottleneck dimension
)

model = get_peft_model(base_model, adapter_config)

Architecture:

Layer Output
    ↓
Adapter Module:
  ↓ Down-project (d → d/16)
  ↓ Activation
  ↓ Up-project (d/16 → d)
  ↓ Skip connection
    ↓
Next Layer

Ưu điểm:

  • Similar efficiency to LoRA
  • Can stack multiple adapters

Nhược điểm:

  • Adds inference latency (sequential modules)

Catastrophic Forgetting

Khi fine-tune, model có thể "quên" original capabilities.

Problem Example

# Before fine-tuning on medical data
model("What is photosynthesis?")
# "Photosynthesis is the process plants use to convert sunlight..."

# After fine-tuning
model("What is photosynthesis?")
# "Photosynthesis is... [medical jargon] ...cells... [confused response]"

Solutions

1. Mix original data:

# Fine-tuning dataset
train_data = domain_specific_data + general_data_sample

# Ratio: 80% domain, 20% general

2. Lower learning rate:

# Too aggressive: 5e-4
# Better: 1e-4 or 2e-5
training_args = TrainingArguments(
    learning_rate=2e-5  # Gentle adaptation
)

3. Use PEFT (LoRA):

  • Base model frozen → retains original knowledge
  • Adapter adds new capabilities

4. Regularization:

# Elastic Weight Consolidation (EWC)
# Penalize changes to important weights

Dataset Preparation cho Instruction Tuning

Fine-tuning for instruction-following requires proper formatting.

Instruction Dataset Format

{
  "instruction": "Summarize the following article:",
  "input": "Long article text here...",
  "output": "Summary of the article..."
}

Alpaca format:

def format_alpaca(example):
    if example["input"]:
        prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{example["instruction"]}

### Input:
{example["input"]}

### Response:
{example["output"]}"""
    else:
        prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{example["instruction"]}

### Response:
{example["output"]}"""
    
    return {"text": prompt}

Data Quality Matters

# ❌ BAD - Low quality
{
  "instruction": "tell me about dogs",
  "output": "dogs r cool"
}

# ✅ GOOD - High quality
{
  "instruction": "Provide a comprehensive overview of dog breeds suitable for apartment living.",
  "output": "Several dog breeds adapt well to apartment living due to their size and temperament. Small breeds like Cavalier King Charles Spaniels are gentle and require moderate exercise..."
}

Dataset quality > quantity:

  • 1,000 high-quality examples > 10,000 noisy ones
  • Diverse instructions (question answering, summarization, creative writing)
  • Correct, detailed, well-formatted outputs

Creating Synthetic Data với GPT-4

from openai import OpenAI

client = OpenAI()

def generate_training_examples(topic, num_examples=100):
    examples = []
    
    for _ in range(num_examples):
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{
                "role": "user",
                "content": f"""Generate a training example for fine-tuning an AI assistant specialized in {topic}.
                
Format:
- Instruction: A question or task
- Input: Optional context (can be empty)
- Output: Detailed, accurate response

Return as JSON."""
            }]
        )
        
        example = json.loads(response.choices[0].message.content)
        examples.append(example)
    
    return examples

# Generate data
medical_data = generate_training_examples("medical diagnosis", 500)

Fine-tuning Best Practices

1. Start with Base or Instruction-tuned?

Base model (e.g., Llama-2-7b):

  • For domain-specific language modeling
  • When you want full control

Instruction-tuned (e.g., Llama-2-7b-chat):

  • For chat/assistant applications
  • Already follows instructions well
  • Easier to adapt
# For chatbot
base = "meta-llama/Llama-2-7b-chat-hf"  # Recommended

# For domain modeling
base = "meta-llama/Llama-2-7b-hf"

2. Monitor Overfitting

from transformers import TrainerCallback

class OverfitCallback(TrainerCallback):
    def on_evaluate(self, args, state, control, metrics, **kwargs):
        train_loss = state.log_history[-2]["loss"]
        eval_loss = metrics["eval_loss"]
        
        if eval_loss > train_loss * 1.5:
            print("⚠️ Possible overfitting!")
        
        return control

trainer = Trainer(
    model=model,
    args=training_args,
    callbacks=[OverfitCallback()]
)

3. Evaluate Properly

# Don't just look at loss
# Test on held-out examples

test_prompts = [
    "Explain quantum computing:",
    "Write a haiku about spring:",
    "Debug this Python code: ..."
]

for prompt in test_prompts:
    output = generate(model, prompt)
    print(f"Prompt: {prompt}")
    print(f"Output: {output}\n")

4. Learning Rate Schedule

from transformers import get_linear_schedule_with_warmup

# Warmup: Gradually increase LR
# Decay: Gradually decrease LR

optimizer = AdamW(model.parameters(), lr=2e-4)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=100,
    num_training_steps=1000
)

5. Gradient Accumulation

Train on larger effective batch size với limited GPU memory:

training_args = TrainingArguments(
    per_device_train_batch_size=1,  # Physical batch size
    gradient_accumulation_steps=16,  # Accumulate 16 batches
    # Effective batch size = 1 × 16 = 16
)

Merging LoRA Adapters

Combine base model + adapter into single model:

from peft import PeftModel

# Load base + adapter
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
lora_model = PeftModel.from_pretrained(base_model, "./lora-adapter")

# Merge
merged_model = lora_model.merge_and_unload()

# Save as standard model
merged_model.save_pretrained("./merged-model")

When to merge:

  • Production deployment (remove LoRA overhead)
  • Sharing model publicly
  • Don't need to swap adapters

When NOT to merge:

  • Multiple adapters per base model
  • Frequently updating adapters
  • Limited storage (adapters are tiny)

Key Takeaways

  • Full fine-tuning updates all parameters - powerful but expensive
  • PEFT methods (LoRA, QLoRA, Adapters) train <1% parameters - efficient
  • LoRA adds low-rank matrices: 0.1-1% parameters, 99% performance
  • QLoRA combines LoRA + 4-bit quantization: train 7B on consumer GPUs
  • Catastrophic forgetting avoided by: mixing data, lower LR, PEFT methods
  • Dataset quality > quantity: 1K good examples > 10K bad ones
  • Instruction tuning format: Clear instruction/input/output structure
  • Best practices: Monitor overfitting, evaluate qualitatively, use warmup, gradient accumulation

Trong bài tiếp theo, chúng ta sẽ khám phá AI Safety & Alignment - hallucination, RLHF, evaluation frameworks cho LLMs.


Bài viết thuộc series "From Zero to AI Engineer" - Module 8: Advanced GenAI Topics