Advanced Retrieval Strategies: Tối ưu RAG Performance

Trong bài trước, chúng ta đã học về vector search và embeddings - nền tảng của RAG. Nhưng việc chỉ embed toàn bộ document và search thôi thường cho kết quả kém. Trong bài này, chúng ta sẽ khám phá các kỹ thuật nâng cao để improve retrieval quality - yếu tố quyết định thành bại của RAG systems.

Một RAG system tốt cần trả lời được: "Làm sao retrieve đúng context với đúng granularity cho LLM?"

Document Chunking Strategies

Chunking là quá trình chia documents dài thành các đoạn nhỏ hơn để embed và search.

Tại sao cần Chunking?

Vấn đề với whole document:

# Document 10,000 words về "Machine Learning"
# User query: "What is gradient descent?"

# Nếu embed toàn bộ document:
# - Vector chứa thông tin về tất cả topics trong doc
# - Diluted semantics - "gradient descent" chỉ chiếm 2% content
# - LLM nhận 10,000 words nhưng answer chỉ cần 200 words

Lợi ích của chunking:

  • Focused semantics - mỗi chunk về một topic cụ thể
  • Efficient token usage - chỉ pass relevant chunks cho LLM
  • Better retrieval precision

1. Fixed-size Chunking

Chia document theo số tokens/characters cố định.

def fixed_size_chunk(text, chunk_size=512, overlap=50):
    """
    Chunk text with fixed size and overlap.
    
    Args:
        chunk_size: Number of characters per chunk
        overlap: Characters overlap between chunks
    """
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - overlap  # Overlap để giữ context
    
    return chunks

# Example
text = "Long document..." * 1000
chunks = fixed_size_chunk(text, chunk_size=1000, overlap=100)

Overlap là gì và tại sao cần?

Chunk 1: [...........................]
Chunk 2:              [...........................]
         |--- overlap ---|

# Overlap giúp:
# - Tránh cắt câu/đoạn giữa chừng
# - Context continuity giữa chunks

Ưu điểm:

  • Đơn giản, predictable
  • Chunks có size đồng đều (dễ batch processing)

Nhược điểm:

  • Không respect semantic boundaries (có thể cắt giữa câu, paragraph)
  • Không phù hợp với structured content

Khi nào dùng: Simple use cases, homogeneous documents

2. Sentence-based Chunking

Chia theo boundaries tự nhiên (sentences).

import nltk
nltk.download('punkt')

def sentence_chunk(text, max_sentences=5):
    """Chunk by sentences."""
    sentences = nltk.sent_tokenize(text)
    
    chunks = []
    current_chunk = []
    
    for sent in sentences:
        current_chunk.append(sent)
        
        if len(current_chunk) >= max_sentences:
            chunks.append(' '.join(current_chunk))
            current_chunk = []
    
    # Add remaining sentences
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

Ưu điểm:

  • Preserve sentence integrity
  • More readable chunks

Nhược điểm:

  • Variable chunk sizes (khó optimize)
  • Không handle long sentences well

3. Semantic Chunking

Chia based on semantic similarity - đoạn mới khi topic thay đổi.

from sentence_transformers import SentenceTransformer
import numpy as np

def semantic_chunk(text, similarity_threshold=0.5):
    """
    Chunk based on semantic similarity between sentences.
    """
    model = SentenceTransformer('all-MiniLM-L6-v2')
    sentences = nltk.sent_tokenize(text)
    
    # Embed all sentences
    embeddings = model.encode(sentences)
    
    chunks = []
    current_chunk = [sentences[0]]
    
    for i in range(1, len(sentences)):
        # Calculate similarity with previous sentence
        similarity = np.dot(embeddings[i], embeddings[i-1])
        
        if similarity > similarity_threshold:
            # Similar topic - add to current chunk
            current_chunk.append(sentences[i])
        else:
            # Topic changed - start new chunk
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentences[i]]
    
    # Add last chunk
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

Ưu điểm:

  • Topic-coherent chunks
  • Natural boundaries

Nhược điểm:

  • Computationally expensive (embed every sentence)
  • Variable chunk sizes

Khi nào dùng: High-quality requirements, documents với varied topics

4. Recursive Chunking

Chia hierarchically - tôn trọng document structure.

from langchain.text_splitter import RecursiveCharacterTextSplitter

def recursive_chunk(text, chunk_size=1000, chunk_overlap=200):
    """
    Recursively split by different separators.
    Priority: \n\n (paragraphs) → \n (lines) → . (sentences) → space
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    
    chunks = splitter.split_text(text)
    return chunks

# Example
text = """
# Chapter 1: Introduction

Machine learning is a subset of AI.

It enables computers to learn from data.

# Chapter 2: Algorithms

Common algorithms include:
- Linear Regression
- Decision Trees
"""

chunks = recursive_chunk(text)
# Respects headings, paragraphs, lists

Ưu điểm:

  • Respect document structure
  • Balance between semantic và size control

Nhược điểm:

  • Cần tune separators cho từng document type

Khi nào dùng: Structured documents (markdown, code, articles)

5. Specialized Chunking

Code chunking:

from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=100
)

# Respects function/class boundaries
code = """
def function1():
    pass

class MyClass:
    def method1(self):
        pass
"""

chunks = python_splitter.split_text(code)

Markdown chunking:

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
chunks = splitter.split_text(markdown_text)

# Each chunk includes header hierarchy as metadata

Chunk Size Optimization

Trade-offs:

Small chunks (256 tokens):
✅ Precise retrieval
✅ Less noise
❌ May lose context
❌ More chunks = slower search

Large chunks (2048 tokens):
✅ More context
✅ Fewer chunks
❌ Diluted semantics
❌ Expensive LLM processing

Recommended sizes:

  • QA systems: 256-512 tokens (precise answers)
  • Summarization: 1024-2048 tokens (need context)
  • Code: 512-1024 tokens (function-level)

How to find optimal size:

# Experiment with different sizes
chunk_sizes = [256, 512, 1024, 2048]
results = {}

for size in chunk_sizes:
    chunks = create_chunks(documents, chunk_size=size)
    
    # Evaluate retrieval quality
    precision = evaluate_retrieval(chunks, test_queries)
    results[size] = precision

# Plot and choose best

Hybrid Search: Keyword + Semantic

Pure vector search có thể miss exact matches. Hybrid search kết hợp cả hai.

BM25 là evolution của TF-IDF, ranking documents by keyword relevance.

from rank_bm25 import BM25Okapi

# Corpus
documents = [
    "Machine learning is a subset of AI",
    "Deep learning uses neural networks",
    "Python is a programming language"
]

# Tokenize
tokenized_docs = [doc.lower().split() for doc in documents]

# Build BM25 index
bm25 = BM25Okapi(tokenized_docs)

# Query
query = "machine learning AI"
tokenized_query = query.lower().split()

# Get scores
scores = bm25.get_scores(tokenized_query)
print(scores)  # [high, low, low]

BM25 strengths:

  • Exact keyword matching
  • Acronyms, product names, IDs
  • Technical terms

BM25 weaknesses:

  • No semantic understanding
  • Vocabulary mismatch problem

Method 1: Reciprocal Rank Fusion (RRF)

def reciprocal_rank_fusion(bm25_results, vector_results, k=60):
    """
    Combine rankings from multiple sources.
    
    Score for doc_id = Σ(1 / (k + rank_i))
    """
    scores = {}
    
    # Add BM25 scores
    for rank, doc_id in enumerate(bm25_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    
    # Add vector scores
    for rank, doc_id in enumerate(vector_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    
    # Sort by combined score
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [doc_id for doc_id, _ in ranked]

# Example
bm25_results = [1, 3, 5]    # doc IDs ranked by BM25
vector_results = [2, 1, 4]  # doc IDs ranked by vector search

combined = reciprocal_rank_fusion(bm25_results, vector_results)
# [1, 2, 3, ...]  # doc 1 appears in both → higher score

Method 2: Weighted Fusion

def weighted_fusion(bm25_scores, vector_scores, alpha=0.5):
    """
    Combine scores with weights.
    
    Final score = alpha * bm25_score + (1-alpha) * vector_score
    """
    combined = {}
    
    all_doc_ids = set(bm25_scores.keys()) | set(vector_scores.keys())
    
    for doc_id in all_doc_ids:
        bm25_score = bm25_scores.get(doc_id, 0)
        vector_score = vector_scores.get(doc_id, 0)
        
        # Normalize scores to [0, 1] trước khi combine
        combined[doc_id] = alpha * bm25_score + (1 - alpha) * vector_score
    
    return sorted(combined.items(), key=lambda x: x[1], reverse=True)

Tuning alpha:

  • α = 1.0: Pure keyword search
  • α = 0.5: Balanced
  • α = 0.0: Pure semantic search

Experiment to find optimal:

for alpha in [0.1, 0.3, 0.5, 0.7, 0.9]:
    results = weighted_fusion(bm25_scores, vector_scores, alpha)
    precision = evaluate(results, ground_truth)
    print(f"Alpha {alpha}: Precision {precision}")

Hybrid Search với Vector Databases

Weaviate example:

result = client.query.get("Document", ["text"]) \
    .with_hybrid(
        query="machine learning",
        alpha=0.5  # 0.5 = balanced keyword/vector
    ) \
    .with_limit(10) \
    .do()

Qdrant example:

from qdrant_client.models import Prefetch, Query

result = client.query_points(
    collection_name="documents",
    prefetch=[
        Prefetch(query="machine learning", using="keywords", limit=20),
        Prefetch(query=[0.1, 0.2, ...], using="vectors", limit=20)
    ],
    query=Query(fusion="rrf"),  # Reciprocal Rank Fusion
    limit=10
)

Re-ranking: Improve Top Results

Retrieval returns top-k (e.g., 100) candidates. Re-ranking refines to top-n (e.g., 5) best.

Why Re-rank?

Two-stage retrieval:

Stage 1: Fast retrieval (Bi-encoder)
  ├─ Query embedding: Single forward pass
  ├─ Compare with millions of docs: Fast (dot product)
  └─ Top-100 candidates

Stage 2: Precise re-ranking (Cross-encoder)
  ├─ Query + Doc together: Attention across both
  ├─ More accurate but slower
  └─ Top-5 results

Bi-encoder vs Cross-encoder

Bi-encoder (Retrieval):

# Encode separately
query_emb = model.encode("What is Python?")
doc_emb = model.encode("Python is a programming language")

# Compare
similarity = cosine_similarity(query_emb, doc_emb)

Pros: Fast - can pre-compute doc embeddings
Cons: No interaction between query và doc

Cross-encoder (Re-ranking):

from sentence_transformers import CrossEncoder

model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# Encode query + doc together
score = model.predict([
    ("What is Python?", "Python is a programming language"),
    ("What is Python?", "JavaScript is also a language")
])

# score[0] > score[1]

Pros: More accurate - full attention
Cons: Slow - can't pre-compute

Re-ranking Pipeline

from sentence_transformers import SentenceTransformer, CrossEncoder

# Models
retriever = SentenceTransformer('all-MiniLM-L6-v2')
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def retrieve_and_rerank(query, documents, top_k=5):
    # Stage 1: Fast retrieval (top-100)
    query_emb = retriever.encode(query)
    doc_embs = retriever.encode(documents)
    
    similarities = [
        cosine_similarity(query_emb, doc_emb) 
        for doc_emb in doc_embs
    ]
    
    # Get top-100 candidates
    top_100_indices = np.argsort(similarities)[-100:][::-1]
    candidates = [documents[i] for i in top_100_indices]
    
    # Stage 2: Precise re-ranking
    pairs = [(query, doc) for doc in candidates]
    rerank_scores = reranker.predict(pairs)
    
    # Get top-k
    top_k_indices = np.argsort(rerank_scores)[-top_k:][::-1]
    final_results = [candidates[i] for i in top_k_indices]
    
    return final_results

Performance impact:

Retrieval only: 100ms, 70% accuracy
Retrieval + Re-ranking: 150ms, 85% accuracy

Trade-off: +50ms for +15% accuracy → Worth it!

Cohere Re-rank API

import cohere

co = cohere.Client("your-key")

# Retrieved documents
documents = [
    "Document 1 text...",
    "Document 2 text...",
    # ... 100 documents
]

# Re-rank
results = co.rerank(
    query="What is machine learning?",
    documents=documents,
    top_n=5,
    model="rerank-english-v2.0"
)

# Get top 5 re-ranked
for idx, result in enumerate(results.results):
    print(f"{idx}: {documents[result.index]} (score: {result.relevance_score})")

Query Transformation Techniques

User queries often không phù hợp trực tiếp cho retrieval.

1. Query Expansion

Thêm related terms để expand coverage.

from transformers import pipeline

# Use LLM to expand query
def expand_query(query):
    generator = pipeline('text-generation', model='gpt2')
    
    prompt = f"""Expand this search query with related terms:
    Query: {query}
    Expanded:"""
    
    result = generator(prompt, max_length=50)
    return result[0]['generated_text']

# Example
original = "machine learning"
expanded = expand_query(original)
# "machine learning artificial intelligence neural networks deep learning"

Classic expansion với synonyms:

from nltk.corpus import wordnet

def expand_with_synonyms(query):
    words = query.split()
    expanded = []
    
    for word in words:
        synonyms = wordnet.synsets(word)
        for syn in synonyms[:2]:  # Top 2 synonyms
            expanded.extend(syn.lemma_names())
    
    return ' '.join(set(words + expanded))

# "car" → "car automobile vehicle motorcar"

2. Query Decomposition

Chia complex query thành sub-queries.

def decompose_query(complex_query):
    """
    Example:
    "Compare Python and JavaScript for web development"
    →
    [
        "Python for web development",
        "JavaScript for web development",
        "Python vs JavaScript comparison"
    ]
    """
    # Use LLM to decompose
    prompt = f"""Break down this complex query into simpler sub-queries:
    Query: {complex_query}
    Sub-queries:"""
    
    # Call LLM
    sub_queries = llm.generate(prompt)
    return sub_queries

# Retrieve for each sub-query and combine results

3. Hypothetical Document Embeddings (HyDE)

Generate hypothetical answer, embed it, use for search.

def hyde_search(query, vector_db):
    # Step 1: Generate hypothetical document
    prompt = f"""Write a detailed answer to this question:
    Question: {query}
    Answer:"""
    
    hypothetical_doc = llm.generate(prompt)
    
    # Step 2: Embed hypothetical doc
    hyp_embedding = embed_model.encode(hypothetical_doc)
    
    # Step 3: Search with hypothetical embedding
    results = vector_db.search(hyp_embedding, top_k=5)
    
    return results

# Why this works:
# - Query: "What is Python?" → short, generic embedding
# - Hypothetical answer: "Python is a high-level..." → rich, detailed embedding
# - Better matches with actual documents

Key Takeaways

  • Chunking strategies: Fixed-size (simple), Semantic (quality), Recursive (structured docs)
  • Chunk size: 256-512 tokens for QA, 1024-2048 for summarization
  • Hybrid search combines BM25 (keywords) + vector search (semantics) via RRF or weighted fusion
  • Re-ranking với cross-encoder improves top results accuracy (+15% for +50ms)
  • Query optimization: Expansion (add terms), Decomposition (split complex), HyDE (generate hypothetical doc)
  • Best practice: Multi-stage pipeline: Hybrid retrieval → Re-rank → Query transformation

Trong bài tiếp theo, chúng ta sẽ khám phá LLM Orchestration Frameworks - LangChain, LlamaIndex, và cách build complex RAG applications.


Bài viết thuộc series "From Zero to AI Engineer" - Module 7: RAG (Retrieval-Augmented Generation)