Vector Search & Embeddings: Nền tảng của RAG Systems

Khi bạn search "dog" trên Google, nó cũng hiểu "puppy", "canine", "pet". Khi bạn hỏi chatbot "How do I reset my password?", nó match được với document có title "Password Recovery Guide" dù không có từ "reset". Đằng sau những điều này là vector embeddingssemantic search.

RAG (Retrieval-Augmented Generation) - kỹ thuật cho phép LLMs truy cập knowledge base - hoàn toàn dựa trên vector search. Hiểu sâu về embeddings và vector search là bước đầu tiên để xây dựng RAG systems hiệu quả.

Embeddings là gì?

Embedding là cách biểu diễn text (hoặc image, audio) thành vectors (arrays of numbers) trong không gian nhiều chiều sao cho semantic meaning được preserve.

Từ Text sang Numbers

"The cat sat on the mat"
         ↓ Embedding Model
[0.234, -0.567, 0.891, ..., 0.123]  # Vector 768 chiều (ví dụ)

Tại sao cần embeddings?

  • Máy tính không hiểu text, chỉ hiểu numbers
  • Embeddings capture semantic meaning
  • Cho phép so sánh similarity bằng toán học

Semantic Similarity

Texts có ý nghĩa giống nhau → vectors gần nhau trong không gian.

# Ví dụ với sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "The cat sleeps on the couch",
    "A feline rests on the sofa",      # Semantic similar
    "Python is a programming language"  # Semantic different
]

embeddings = model.encode(sentences)

# embeddings[0] và embeddings[1] rất gần nhau
# embeddings[0] và embeddings[2] rất xa nhau

Dense vs Sparse Vectors

Sparse vectors: Hầu hết values = 0 (ví dụ: Bag of Words)

# Vocabulary: ["cat", "dog", "fish", "bird"]
"cat and dog" → [1, 1, 0, 0]  # Chỉ có 2 non-zero values

Dense vectors: Hầu hết values ≠ 0 (embeddings thường là dense)

"cat and dog" → [0.234, -0.567, 0.891, 0.123, ...]  # Tất cả đều có giá trị

Trade-offs:

  • Sparse: Efficient storage, nhưng không capture semantics tốt
  • Dense: Better semantics, nhưng tốn space hơn

Embedding Models

1. Word Embeddings - Word2Vec, GloVe

Embedding cho từng từ riêng lẻ.

# Word2Vec example (conceptual)
"king" → [0.2, 0.5, 0.8, ...]
"queen" → [0.25, 0.48, 0.75, ...]
"man" → [-0.1, 0.3, 0.6, ...]
"woman" → [-0.08, 0.28, 0.55, ...]

# Famous example: king - man + woman ≈ queen

Hạn chế:

  • Không xử lý được polysemy (từ đa nghĩa)
    • "bank" (ngân hàng) vs "bank" (bờ sông) → cùng vector
  • Không có context awareness

2. Contextual Embeddings - BERT, RoBERTa

Embedding phụ thuộc vào context.

# "bank" có vectors khác nhau tùy context
sentence1 = "I deposited money at the bank"
sentence2 = "We sat by the river bank"

# BERT tạo vectors khác nhau cho "bank" trong 2 câu này

Cách hoạt động:

  • Input: Full sentence/paragraph
  • Output: Vector cho mỗi token dựa trên context
  • Thường dùng [CLS] token embedding làm sentence representation

3. Sentence Embeddings - Sentence-BERT, MPNet

Optimize để tạo sentence-level embeddings tốt.

from sentence_transformers import SentenceTransformer

# Popular models
model = SentenceTransformer('all-MiniLM-L6-v2')      # Small, fast
# model = SentenceTransformer('all-mpnet-base-v2')   # Better quality
# model = SentenceTransformer('all-MiniLM-L12-v2')   # Balanced

text = "Artificial intelligence is transforming healthcare"
embedding = model.encode(text)

print(embedding.shape)  # (384,) - vector 384 chiều

Sentence-BERT architecture:

Input Sentence
     ↓
BERT Encoder
     ↓
Pooling (mean/max/CLS)
     ↓
Dense Vector (sentence embedding)

4. Specialized Embeddings

OpenAI Embeddings:

from openai import OpenAI

client = OpenAI(api_key="your-key")

response = client.embeddings.create(
    model="text-embedding-3-small",  # 1536 dimensions
    input="Your text here"
)

embedding = response.data[0].embedding

Models:

  • text-embedding-3-small: 1536D, fast, cheap
  • text-embedding-3-large: 3072D, best quality
  • text-embedding-ada-002: Legacy, still good

Cohere Embeddings:

import cohere

co = cohere.Client("your-key")

response = co.embed(
    texts=["Text 1", "Text 2"],
    model="embed-english-v3.0",
    input_type="search_document"  # or "search_query"
)

embeddings = response.embeddings

Input types quan trọng:

  • search_document: Cho documents được index
  • search_query: Cho user queries
  • Asymmetric optimization cho better search

Similarity Measures

Làm sao biết 2 vectors "gần nhau"?

1. Cosine Similarity (Phổ biến nhất)

Đo góc giữa 2 vectors.

cosine_similarity = (A · B) / (||A|| × ||B||)

Range: [-1, 1]
  1: Hoàn toàn giống nhau
  0: Không liên quan
 -1: Hoàn toàn ngược nhau

Implementation:

import numpy as np

def cosine_similarity(v1, v2):
    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    return dot_product / (norm_v1 * norm_v2)

# Example
vec1 = np.array([1, 2, 3])
vec2 = np.array([2, 4, 6])  # Same direction, 2x magnitude

similarity = cosine_similarity(vec1, vec2)
print(similarity)  # 1.0 (perfect match)

Ưu điểm:

  • Không bị ảnh hưởng bởi magnitude (chỉ care direction)
  • Phù hợp cho text embeddings

Khi nào dùng: Text search, document similarity

2. Euclidean Distance

Đo khoảng cách thẳng giữa 2 điểm.

euclidean_distance = √(Σ(ai - bi)²)

Range: [0, ∞]
  0: Identical
  Larger: More different

Implementation:

def euclidean_distance(v1, v2):
    return np.sqrt(np.sum((v1 - v2) ** 2))

# Or use scipy
from scipy.spatial.distance import euclidean

distance = euclidean(vec1, vec2)

Khi nào dùng: Image embeddings, spatial data

3. Dot Product

Similarity đơn giản nhất - nhân từng phần tử và cộng lại.

dot_product = Σ(ai × bi)

Range: [-∞, ∞]

Implementation:

similarity = np.dot(v1, v2)

Lưu ý: Nếu vectors đã normalized (||v|| = 1), dot product = cosine similarity!

# Normalize vectors
v1_norm = v1 / np.linalg.norm(v1)
v2_norm = v2 / np.linalg.norm(v2)

# Now dot product = cosine similarity
np.dot(v1_norm, v2_norm) == cosine_similarity(v1, v2)  # True

Ưu điểm: Nhanh nhất
Khi nào dùng: Khi vectors đã normalized, cần speed

So sánh trong thực tế

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

query = "How to train a machine learning model?"
documents = [
    "Machine learning model training guide",
    "Cooking pasta recipes",
    "Deep learning tutorial for beginners"
]

# Encode
query_emb = model.encode(query)
doc_embs = model.encode(documents)

# Calculate similarities
for i, doc in enumerate(documents):
    cos_sim = util.cos_sim(query_emb, doc_embs[i])
    print(f"{doc}: {cos_sim.item():.4f}")

# Output:
# Machine learning model training guide: 0.7234
# Cooking pasta recipes: 0.1256
# Deep learning tutorial for beginners: 0.6891

Vector Search Algorithms

Với millions of vectors, làm sao tìm nhanh k vectors gần nhất?

Brute-force k-NN (k-Nearest Neighbors)

So sánh query với tất cả vectors.

def knn_search(query, vectors, k=5):
    similarities = []
    
    for vec in vectors:
        sim = cosine_similarity(query, vec)
        similarities.append(sim)
    
    # Get top-k indices
    top_k_indices = np.argsort(similarities)[-k:][::-1]
    return top_k_indices

# Complexity: O(n) - linear scan

Ưu điểm: 100% accurate (exact search)
Nhược điểm: Quá chậm với large datasets (millions of vectors)

Approximate Nearest Neighbors (ANN)

Trade accuracy một chút để được speed cực cao.

Ví dụ:

  • Exact: Tìm 100% top-5 gần nhất
  • ANN: Tìm ~95% top-5 gần nhất, nhưng nhanh hơn 100-1000x

1. HNSW (Hierarchical Navigable Small World)

Build graph structure với multiple layers.

Layer 2: [Few nodes, long range connections]
Layer 1: [More nodes, medium range]
Layer 0: [All nodes, short range]

Cách hoạt động:

  1. Bắt đầu từ layer cao nhất (coarse search)
  2. Tìm entry point gần query
  3. Đi xuống layer thấp hơn, refine search
  4. Ở layer 0, tìm k nearest neighbors

Ưu điểm:

  • Rất nhanh (sub-millisecond queries)
  • Good recall (>95%)
  • Efficient memory usage

Nhược điểm:

  • Build index tốn thời gian
  • Không update-friendly (rebuild needed)

Used by: FAISS, Hnswlib, Milvus, Weaviate

2. IVF (Inverted File Index)

Chia vector space thành clusters.

1. Cluster vectors thành N groups (e.g., 1000 clusters)
2. Query → tìm M nearest clusters (e.g., 10 clusters)
3. Search chỉ trong M clusters đó

Ưu điểm:

  • Reduce search space dramatically
  • Scalable

Nhược điểm:

  • Recall thấp hơn HNSW nếu không tune tốt

Used by: FAISS (IVF-Flat, IVF-PQ)

3. Product Quantization (PQ)

Compress vectors để giảm memory.

Original: [768 dimensions × 4 bytes] = 3072 bytes
After PQ: [96 codes × 1 byte] = 96 bytes (32x smaller!)

Cách hoạt động:

  1. Chia vector thành subvectors (e.g., 768D → 96 subvectors of 8D each)
  2. Cluster mỗi subvector thành 256 centroids
  3. Lưu centroid ID (1 byte) thay vì vector gốc

Trade-off: Giảm memory/speed up search, nhưng mất accuracy

Used by: FAISS (PQ, IVF-PQ)

FAISS Example

import faiss
import numpy as np

# Sample data
dimension = 384
num_vectors = 100000

# Generate random vectors (thực tế dùng embeddings thật)
vectors = np.random.random((num_vectors, dimension)).astype('float32')

# Build HNSW index
index = faiss.IndexHNSWFlat(dimension, 32)  # 32 = M (neighbors per node)
index.add(vectors)

# Search
query = np.random.random((1, dimension)).astype('float32')
k = 5

distances, indices = index.search(query, k)

print(f"Top {k} nearest neighbors:")
print(f"Indices: {indices[0]}")
print(f"Distances: {distances[0]}")

FAISS index types:

# Exact search (brute force)
index = faiss.IndexFlatL2(dimension)

# HNSW
index = faiss.IndexHNSWFlat(dimension, 32)

# IVF
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, 100)  # 100 clusters

# IVF + PQ (memory efficient)
index = faiss.IndexIVFPQ(quantizer, dimension, 100, 8, 8)

Vector Database Architecture

Vector databases optimize cho vector search.

Key components:

┌─────────────────────────────────────┐
│         Vector Database             │
├─────────────────────────────────────┤
│ 1. Storage Layer                    │
│    - Vectors                        │
│    - Metadata (text, tags, etc.)    │
│                                     │
│ 2. Index Layer (HNSW/IVF)           │
│    - Optimize search                │
│                                     │
│ 3. Query Engine                     │
│    - Vector search                  │
│    - Hybrid search (vector + filters)│
│                                     │
│ 4. API Layer                        │
│    - CRUD operations                │
│    - Search endpoints               │
└─────────────────────────────────────┘

1. Pinecone (Managed, Cloud-only)

import pinecone

# Initialize
pinecone.init(api_key="your-key", environment="us-west1-gcp")

# Create index
pinecone.create_index(
    name="my-index",
    dimension=384,
    metric="cosine"
)

# Connect
index = pinecone.Index("my-index")

# Upsert vectors
vectors = [
    ("id1", [0.1, 0.2, ...], {"text": "Document 1"}),
    ("id2", [0.3, 0.4, ...], {"text": "Document 2"})
]
index.upsert(vectors)

# Search
results = index.query(
    vector=[0.1, 0.2, ...],
    top_k=5,
    include_metadata=True
)

2. Weaviate (Open source, self-hosted)

import weaviate

client = weaviate.Client("http://localhost:8080")

# Create schema
schema = {
    "class": "Document",
    "vectorizer": "text2vec-openai",
    "properties": [
        {"name": "text", "dataType": ["text"]}
    ]
}
client.schema.create_class(schema)

# Add data (auto-vectorizes)
client.data_object.create(
    class_name="Document",
    data_object={"text": "Machine learning tutorial"}
)

# Search
result = client.query.get("Document", ["text"]) \
    .with_near_text({"concepts": ["AI training"]}) \
    .with_limit(5) \
    .do()

3. Qdrant (Open source, Rust-based)

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient("localhost", port=6333)

# Create collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE)
)

# Insert
client.upsert(
    collection_name="documents",
    points=[
        {
            "id": 1,
            "vector": [0.1, 0.2, ...],
            "payload": {"text": "Document 1"}
        }
    ]
)

# Search
results = client.search(
    collection_name="documents",
    query_vector=[0.1, 0.2, ...],
    limit=5
)

4. ChromaDB (Embedded, dev-friendly)

import chromadb

client = chromadb.Client()

# Create collection
collection = client.create_collection("documents")

# Add documents (auto-embeds)
collection.add(
    documents=["Document 1", "Document 2"],
    ids=["id1", "id2"]
)

# Query
results = collection.query(
    query_texts=["machine learning"],
    n_results=5
)

Chọn Vector Database nào?

Database Best For Pros Cons
Pinecone Production, don't want to manage infra Fully managed, easy scaling Cost, vendor lock-in
Weaviate Flexible schema, hybrid search Rich features, good docs Setup complexity
Qdrant Performance-critical, self-hosted Fast (Rust), filtering Newer ecosystem
ChromaDB Prototyping, embedded apps Super easy, no server needed Not for large scale
FAISS DIY, custom solutions Maximum control, free No built-in persistence

Key Takeaways

  • Embeddings convert text to dense vectors capturing semantic meaning
  • Popular models: Sentence-BERT (open), OpenAI embeddings (commercial)
  • Similarity measures: Cosine (text), Euclidean (images), Dot product (fast)
  • Exact k-NN accurate nhưng chậm O(n)
  • ANN algorithms (HNSW, IVF) trade accuracy cho speed (100-1000x faster)
  • Vector databases optimize storage, indexing, và querying vectors
  • Choose database based on: scale, features, managed vs self-hosted

Trong bài tiếp theo, chúng ta sẽ khám phá Advanced Retrieval Strategies - chunking, hybrid search, re-ranking, và query optimization techniques cho RAG systems.


Bài viết thuộc series "From Zero to AI Engineer" - Module 7: RAG (Retrieval-Augmented Generation)