LLM Orchestration Frameworks: Build Complex AI Applications

Khi bạn bắt đầu build RAG applications, việc quản lý prompt templates, retrieval logic, memory, và error handling có thể trở nên phức tạp nhanh chóng. LLM Orchestration Frameworks ra đời để abstract complexity này, cho phép developers focus vào business logic thay vì low-level implementation.

Trong bài này, chúng ta sẽ khám phá hai frameworks phổ biến nhất - LangChainLlamaIndex - cùng các patterns để build production-ready LLM applications.

LLM Application Architecture

Trước khi đi vào frameworks, hãy hiểu high-level architecture:

┌─────────────────────────────────────────┐
│         LLM Application                 │
├─────────────────────────────────────────┤
│                                         │
│  ┌──────────┐      ┌──────────────┐   │
│  │  User    │──────▶│   Prompt     │   │
│  │  Query   │      │  Engineering  │   │
│  └──────────┘      └──────────────┘   │
│                           │            │
│                           ▼            │
│  ┌──────────┐      ┌──────────────┐   │
│  │ Vector   │◀─────│  Retrieval   │   │
│  │   DB     │      │    Logic     │   │
│  └──────────┘      └──────────────┘   │
│                           │            │
│                           ▼            │
│  ┌──────────┐      ┌──────────────┐   │
│  │  Memory  │◀────▶│     LLM      │   │
│  │  Store   │      │    Calling   │   │
│  └──────────┘      └──────────────┘   │
│                           │            │
│                           ▼            │
│                    ┌──────────────┐   │
│                    │   Response   │   │
│                    │  Processing  │   │
│                    └──────────────┘   │
│                                         │
└─────────────────────────────────────────┘

Core Abstractions

1. Chains - Sequential Operations

Chain là sequence of operations, output của step này là input của step tiếp theo.

Example workflow:

User Query
    ↓
Retrieval (get relevant docs)
    ↓
Prompt Construction (format docs + query)
    ↓
LLM Call
    ↓
Response

2. Routers - Conditional Logic

Router quyết định flow dựa trên input.

User Query
    ↓
Router
    ├─→ SQL Chain (if database query)
    ├─→ RAG Chain (if knowledge retrieval)
    └─→ Math Chain (if calculation)

3. Memory - Conversation State

Memory lưu trữ conversation history để maintain context.

Turn 1: "What is Python?"
    → LLM: "Python is a programming language..."

Turn 2: "What are its benefits?" 
    → Memory: Include Turn 1 context
    → LLM understands "its" refers to Python

LangChain - The Swiss Army Knife

LangChain là framework tổng quát cho building LLM applications.

Basic Example - Simple Chain

from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# 1. Setup LLM
llm = OpenAI(temperature=0.7)

# 2. Create prompt template
prompt = PromptTemplate(
    input_variables=["product"],
    template="What is a good name for a company that makes {product}?"
)

# 3. Create chain
chain = LLMChain(llm=llm, prompt=prompt)

# 4. Run
result = chain.run("eco-friendly water bottles")
print(result)  # "AquaPure Eco Bottles"

RAG with LangChain

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

# 1. Load documents
loader = TextLoader("docs.txt")
documents = loader.load()

# 2. Split into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

# 3. Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# 4. Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",  # "stuff" = put all docs in prompt
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)

# 5. Query
query = "What is machine learning?"
answer = qa_chain.run(query)
print(answer)

Chain Types in RAG

1. Stuff Chain

# Put ALL retrieved docs into single prompt
chain_type="stuff"

# Pros: Simple, best quality (LLM sees all context)
# Cons: Limited by context window
# Use when: < 4-5 documents

2. Map-Reduce Chain

chain_type="map_reduce"

# Process:
# 1. Map: Summarize each doc individually
# 2. Reduce: Combine summaries into final answer

# Pros: Handle many documents
# Cons: Multiple LLM calls (slower, expensive)
# Use when: Many documents, need comprehensive answer

3. Refine Chain

chain_type="refine"

# Process:
# 1. Answer based on doc 1
# 2. Refine answer with doc 2
# 3. Refine again with doc 3...

# Pros: Iterative improvement
# Cons: Sequential (slow), early docs have more influence
# Use when: Documents có natural order

4. Map-Rerank Chain

chain_type="map_rerank"

# Process:
# 1. Score each doc independently
# 2. Return answer from highest-scored doc

# Pros: Fast, good for QA
# Cons: Only uses single "best" doc
# Use when: Answer likely in one doc

Memory Management

ConversationBufferMemory - Store all messages:

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain

memory = ConversationBufferMemory()

conversation = ConversationChain(
    llm=OpenAI(),
    memory=memory
)

conversation.predict(input="Hi, I'm John")
# "Hello John! How can I help you?"

conversation.predict(input="What's my name?")
# "Your name is John!"

ConversationBufferWindowMemory - Keep last K messages:

memory = ConversationBufferWindowMemory(k=2)

# Only remembers last 2 interactions
# Pros: Limit token usage
# Cons: Forget older context

ConversationSummaryMemory - Summarize old messages:

from langchain.memory import ConversationSummaryMemory

memory = ConversationSummaryMemory(llm=OpenAI())

# Automatically summarizes old messages
# Pros: Compress long conversations
# Cons: May lose details

VectorStoreMemory - Semantic search in history:

from langchain.memory import VectorStoreRetrieverMemory

memory = VectorStoreRetrieverMemory(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 2})
)

# Retrieves relevant past messages, not just recent
# Good for: Long conversations, topic switching

Agents - Dynamic Tool Use

Agents can decide which tools to use based on query.

from langchain.agents import load_tools, initialize_agent, AgentType
from langchain.llms import OpenAI

llm = OpenAI(temperature=0)

# Load tools
tools = load_tools(
    ["serpapi", "llm-math"],  # Google search + calculator
    llm=llm
)

# Initialize agent
agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Query
agent.run("What is the square root of the population of France?")

# Agent reasoning:
# 1. "I need to find population of France" → Use Google search
# 2. "Now I need to calculate square root" → Use calculator
# 3. Return final answer

Agent types:

  • ZERO_SHOT_REACT_DESCRIPTION: Choose tool based on description
  • CONVERSATIONAL_REACT_DESCRIPTION: With memory
  • STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION: For complex tools

Custom Tools

from langchain.tools import Tool

def search_database(query):
    # Your custom logic
    return f"Database results for: {query}"

database_tool = Tool(
    name="Database Search",
    func=search_database,
    description="Useful for searching company database. Input: search query"
)

tools = [database_tool]
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION)

LlamaIndex - Specialized for RAG

LlamaIndex (formerly GPT Index) optimized specifically cho data indexing và retrieval.

Basic RAG with LlamaIndex

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import OpenAI

# 1. Load documents
documents = SimpleDirectoryReader("data/").load_data()

# 2. Configure LLM
llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
service_context = ServiceContext.from_defaults(llm=llm)

# 3. Create index
index = VectorStoreIndex.from_documents(
    documents,
    service_context=service_context
)

# 4. Query
query_engine = index.as_query_engine()
response = query_engine.query("What is machine learning?")
print(response)

Advanced Indexing Strategies

1. List Index - Sequential scan:

from llama_index import ListIndex

index = ListIndex.from_documents(documents)

# Use when: Small datasets, need to check all docs

2. Tree Index - Hierarchical summarization:

from llama_index import TreeIndex

index = TreeIndex.from_documents(documents)

# Builds tree: leaf nodes = chunks, parent nodes = summaries
# Good for: Summarization tasks

3. Keyword Table Index - Keyword extraction:

from llama_index import KeywordTableIndex

index = KeywordTableIndex.from_documents(documents)

# Extracts keywords from each chunk
# Good for: Keyword-based retrieval

4. Knowledge Graph Index:

from llama_index import KnowledgeGraphIndex

index = KnowledgeGraphIndex.from_documents(documents)

# Extracts entities and relationships
# Good for: Complex relationship queries

Response Synthesis

LlamaIndex offers flexible response modes:

query_engine = index.as_query_engine(
    response_mode="compact"  # Options: compact, tree_summarize, refine, etc.
)

# compact: Concatenate chunks, call LLM once
# tree_summarize: Build tree, summarize bottom-up
# refine: Iteratively refine answer with each chunk

Query Transformations

from llama_index.indices.query.query_transform import HyDEQueryTransform

# HyDE: Generate hypothetical document
hyde = HyDEQueryTransform(include_original=True)

query_engine = index.as_query_engine(
    query_transform=hyde
)

# Automatically applies HyDE before retrieval

Composability - Combine Multiple Indexes

from llama_index import ComposableGraph

# Multiple indexes for different data sources
vector_index1 = VectorStoreIndex.from_documents(docs1)
vector_index2 = VectorStoreIndex.from_documents(docs2)

# Compose into graph
graph = ComposableGraph.from_indices(
    ListIndex,
    [vector_index1, vector_index2],
    index_summaries=["Summary of docs1", "Summary of docs2"]
)

# Query decomposition across indexes
query_engine = graph.as_query_engine()
response = query_engine.query("Compare information from both sources")

LangChain vs LlamaIndex

Feature LangChain LlamaIndex
Primary Focus General LLM apps RAG/Data indexing
Flexibility Very flexible, many options Opinionated, best practices
Learning Curve Steeper Gentler
RAG Performance Good Excellent
Agent Support Excellent Limited
Memory Options Many types Basic
Data Connectors ~100 ~160 (LlamaHub)
Use When Complex workflows, agents RAG-focused apps

Complementary use:

# Use LlamaIndex for indexing
from llama_index import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

# Convert to LangChain retriever
from llama_index.langchain_helpers.retrievers import LlamaIndexRetriever

retriever = LlamaIndexRetriever(index)

# Use in LangChain chain
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever
)

Production Patterns

1. Error Handling

from langchain.callbacks import get_openai_callback
import logging

def safe_query(chain, query, max_retries=3):
    for attempt in range(max_retries):
        try:
            with get_openai_callback() as cb:
                response = chain.run(query)
                
                # Log token usage
                logging.info(f"Tokens used: {cb.total_tokens}")
                
                return response
        except Exception as e:
            logging.error(f"Attempt {attempt + 1} failed: {e}")
            if attempt == max_retries - 1:
                raise
    
    return "Unable to process query. Please try again."

2. Caching

from langchain.cache import InMemoryCache, SQLiteCache
import langchain

# In-memory cache
langchain.llm_cache = InMemoryCache()

# Persistent cache
langchain.llm_cache = SQLiteCache(database_path=".langchain.db")

# Same query won't call LLM twice

3. Streaming Responses

from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

llm = OpenAI(
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()]
)

# Prints response token by token

4. Cost Tracking

from langchain.callbacks import get_openai_callback

with get_openai_callback() as cb:
    result = chain.run("query")
    
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Total Cost (USD): ${cb.total_cost}")

5. Evaluation

from llama_index.evaluation import ResponseEvaluator

evaluator = ResponseEvaluator()

# Evaluate response quality
response = query_engine.query("What is ML?")
eval_result = evaluator.evaluate(
    query="What is ML?",
    response=response
)

print(f"Faithfulness: {eval_result.passing}")  # True/False

Best Practices

Start simple, iterate

# Start with basic RAG
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()

# Then add: chunking strategy, re-ranking, hybrid search, etc.

Monitor token usage

# Always wrap expensive calls
with get_openai_callback() as cb:
    response = chain.run(query)
    if cb.total_cost > 0.10:  # $0.10 threshold
        alert_high_cost()

Use appropriate memory type

# Short conversations: ConversationBufferMemory
# Long conversations: ConversationSummaryMemory
# Topic-based: VectorStoreMemory

Implement fallbacks

from langchain.llms import OpenAI, Anthropic

llm = OpenAI()
fallback_llm = Anthropic()

try:
    response = llm(prompt)
except Exception:
    response = fallback_llm(prompt)

Version control prompts

# Store prompts in files, not hardcoded
with open("prompts/qa_prompt.txt") as f:
    prompt_template = f.read()

# Easy to iterate without code changes

Key Takeaways

  • Orchestration frameworks abstract complexity: chains, memory, tools
  • LangChain: General-purpose, flexible, good for agents và complex workflows
  • LlamaIndex: RAG-specialized, easier learning curve, excellent data connectors
  • Core abstractions: Chains (sequential), Routers (conditional), Memory (state)
  • Memory types: Buffer (all), Window (recent), Summary (compressed), Vector (semantic)
  • Production considerations: Error handling, caching, streaming, cost tracking
  • Best practice: Start simple, monitor costs, version prompts, implement fallbacks

Trong bài tiếp theo, chúng ta sẽ khám phá Fine-tuning & Adaptation - LoRA, QLoRA, và cách customize LLMs cho specific domains.


Bài viết thuộc series "From Zero to AI Engineer" - Module 7: RAG (Retrieval-Augmented Generation)