This guide assumes you have a solid understanding of Python programming, basic concepts of Large Language Models (LLMs), and familiarity with web APIs. For the Go section, familiarity with Go syntax and concurrency patterns is beneficial. We'll dive into the specifics of LangChain, LangGraph, and FastAPI patterns relevant to building LLM-powered systems, and contrast them with a Go approach.


Table of Contents


Introduction: Why LangChain, LangGraph, and FastAPI?

Developing applications powered by Large Language Models (LLMs) often involves more than just sending a prompt to an API. Real-world use cases require orchestrating multiple LLM calls, interacting with external tools and data sources, maintaining state, and handling complex decision-making logic.

  • LangChain (Python) provides the foundational building blocks for creating LLM applications. It offers abstractions for models, prompts, document loaders, vector stores, agents, and chains, simplifying the composition of complex LLM workflows.
  • LangGraph (Python) extends LangChain by introducing a way to define LLM workflows as graphs. This is crucial for building applications with cycles, conditional logic, and explicit state management – capabilities often required for sophisticated agents and multi-step reasoning processes.
  • FastAPI (Python) is a modern, high-performance Python web framework ideal for building the API layer for your LLM applications. Its asynchronous nature handles I/O-bound operations efficiently, and its built-in data validation ensures robust input/output handling.
  • Go (Golang) offers an alternative approach, particularly appealing for performance-critical applications or teams already invested in the Go ecosystem. While lacking a direct equivalent to LangChain’s breadth, Go provides strong concurrency, performance, and a growing set of libraries for interacting with LLMs and related technologies.

This guide primarily focuses on the Python stack (LangChain/LangGraph/FastAPI) due to its maturity in the LLM space but also provides insights into achieving similar goals with Go.


Prerequisites

Before diving in, ensure you have:

  1. Python: Version 3.8 or higher (for LangChain/FastAPI).
  2. Go: Version 1.18 or higher (for the Go section).
  3. Package Management: pip and venv (Python), Go Modules (Go).
  4. LLM API Access: An API key for an LLM provider (e.g., OpenAI, Anthropic, Google Gemini).
  5. Basic Understanding:
    • LLM concepts (prompts, tokens, embeddings).
    • REST APIs and HTTP methods.
    • Asynchronous programming (async/await in Python, Goroutines/Channels in Go).

LangChain: The Foundation for LLM Composition (Python)

LangChain offers the components and interfaces to build applications that leverage LLMs for reasoning and interaction with external systems in Python.

Core Concepts (LangChain)

  • Components: Standardized building blocks like Models, Prompts, Document Loaders, Text Splitters, Embeddings, Vector Stores, Retrievers, Output Parsers.
  • Composition (LCEL): The primary way to chain components together using the LangChain Expression Language (| syntax).
  • Use Cases: Abstractions for common applications like Question Answering (RAG), Agents, Chatbots, Summarization, Data Extraction.

Setting Up Your LangChain Environment

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
 
# Install core LangChain and common integrations
pip install langchain langchain-openai langchain-community faiss-cpu python-dotenv
 
# (Optional) Install LangSmith for tracing/debugging
# pip install langsmith
# export LANGCHAIN_TRACING_V2="true"
# export LANGCHAIN_API_KEY="..." # Your LangSmith API key
 
# Create a .env file for API keys
echo "OPENAI_API_KEY='your-openai-api-key'" > .env

Use environment variables (e.g., via python-dotenv) to manage API keys securely. Never hardcode them in your source code.

LangChain Expression Language (LCEL): The Core

LCEL is fundamental to modern LangChain development. It’s not just syntactic sugar; it provides significant benefits:

  • Composable: Easily chain components using the pipe operator (|). chain = prompt | model | parser
  • Streaming Support: Chains built with LCEL inherently support streaming using .stream() and .astream(). This allows you to process tokens as they are generated by the LLM.
  • Async Support: Native async operations via .ainvoke() and .astream(). Essential for performance in I/O-bound applications (like web servers).
  • Batching: Efficiently process multiple inputs using .batch() and .abatch().
  • Parallelism: Execute parts of a chain in parallel using RunnableParallel (often implicitly handled).
  • Debugging & Tracing: Integrates seamlessly with LangSmith for visualization and debugging.
  • Input/Output Schemas: Automatically infers input and output types for better validation and IDE support.

Always aim to build your workflows using LCEL wherever possible to leverage these benefits.

Key Functionality: Models, Prompts, Parsers

  • Models: LangChain provides interfaces for LLMs (text completion) and ChatModels (message-based interaction). Integrations exist for numerous providers (OpenAI, Anthropic, Cohere, Hugging Face, local models via Ollama, etc.).
  • Prompts: PromptTemplate and ChatPromptTemplate allow dynamic creation of prompts using input variables. Message templates (SystemMessage, HumanMessage, AIMessage) structure conversations for chat models. Techniques like few-shot prompting can be implemented here.
  • Output Parsers: Structure the raw LLM output. Examples include StrOutputParser (default string), JsonOutputParser (parse JSON), PydanticOutputParser (parse into Pydantic objects), XMLOutputParser, and RunnableOutputParser for custom parsing logic. Parsers often include retry logic for malformed outputs.

Key Functionality: Retrieval-Augmented Generation (RAG)

RAG enhances LLM responses by grounding them in external knowledge. LangChain provides a comprehensive toolkit:

  1. Load: DocumentLoader interfaces fetch data (PDFs, web pages, Notion, databases, etc.). E.g., PyPDFLoader, WebBaseLoader, CSVLoader.
  2. Split: TextSplitter components break large documents into smaller, manageable chunks suitable for embedding models and context windows. E.g., RecursiveCharacterTextSplitter, CharacterTextSplitter, TokenTextSplitter. Chunking strategy (size, overlap) is critical for retrieval quality.
  3. Embed: Embeddings interfaces (e.g., OpenAIEmbeddings, HuggingFaceEmbeddings) convert text chunks into dense vector representations.
  4. Store: VectorStore interfaces store these embeddings and allow efficient similarity searches. E.g., FAISS, Chroma, Pinecone, Qdrant, Postgres + pgvector.
  5. Retrieve: Retriever interfaces fetch relevant chunks based on a query (typically using vector similarity search, but also keyword or hybrid methods). vectorstore.as_retriever() is common. Advanced techniques include multi-query retrieval, contextual compression, and parent document retrieval.
  6. Generate: An LCEL chain (often using create_stuff_documents_chain or custom prompts) takes the original query and the retrieved documents as context, feeding them to an LLM to generate the final answer.

Common RAG chain patterns:

  • Stuff: Simplest approach; concatenates all retrieved documents into the prompt context. Fails if documents exceed the context window. (create_stuff_documents_chain)
  • Map-Reduce: Processes each document individually (map step) and then combines the results (reduce step). Handles large numbers of documents.
  • Refine: Processes documents sequentially, refining the answer with each new document. Good for building detailed answers but involves more LLM calls.
  • Map-Rerank: Processes each document individually, ranks them based on relevance/confidence, and returns the best result. Faster than refine but might miss synthesized answers.

LangChain’s create_retrieval_chain simplifies setting up a common RAG workflow combining a retriever and a generation chain.

Key Functionality: Tool Use & Function Calling

LLMs become vastly more powerful when they can interact with the outside world. LangChain facilitates this:

  • Tool Abstraction: The Tool class wraps any function (e.g., database query, API call, web search, calculator) with a name and description. The LLM uses the description to understand when and how to use the tool.
  • Model-Specific Function Calling: Models like OpenAI’s GPT series have built-in capabilities to declare available functions/tools and return structured JSON indicating which tool to call with which arguments. LangChain standardizes this:
    • ChatModel.bind_tools([...]): Attaches Tool definitions (or Pydantic models for structured output) directly to the model. The model’s response will include tool_calls if it decides to use one.
    • ChatModel.with_structured_output(...): Forces the model to return output matching a specific Pydantic schema or function definition, useful for data extraction.
  • Parsing Tool Calls: LangChain provides output parsers (OpenAIToolsAgentOutputParser, JsonOutputToolsParser) to extract tool invocation requests from the LLM’s response.
# Example: Defining and binding a tool (OpenAI context)
from langchain_core.tools import tool
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
 
@tool
def search_web(query: str) -> str:
  """Searches the web for the given query and returns top results."""
  print(f"--- MOCK WEB SEARCH: {query} ---")
  # Replace with actual search API call
  return f"Results for '{query}': ..."
 
llm = ChatOpenAI(model="gpt-4o-mini")
llm_with_tools = llm.bind_tools([search_web]) # Make the LLM aware of the tool
 
# Ask a question that might require the tool
ai_msg = llm_with_tools.invoke([HumanMessage("What's the weather in London?")])
 
# Inspect the response for tool calls
print(ai_msg.tool_calls)
# [{'name': 'search_web', 'args': {'query': 'weather in London'}, 'id': '...'}]

Key Functionality: Agents

Agents use an LLM as a “reasoning engine” to decide a sequence of actions (often involving tool calls) to accomplish a goal.

  • Core Loop: Prompt (includes goal, history, available tools) → LLM decides action/tool call → Parse action → Execute action (call tool) → Observe result → Update prompt → Repeat until goal achieved or limit reached.
  • Agent Types: LangChain offers pre-built agent types implementing different reasoning strategies:
    • ReAct (Reasoning and Acting): LLM explicitly verbalizes Thought → Action → Observation steps. Good for debuggability.
    • OpenAI Functions/Tools Agent: Leverages the model’s native ability to call functions/tools. Often more reliable and efficient.
    • Self-Ask with Search: Specifically designed for questions requiring factual lookups.
    • Conversational Agents: Maintain memory of the interaction.
  • AgentExecutor: The runtime environment that orchestrates the agent loop, manages state, calls tools, and handles errors/limits. create_openai_tools_agent and AgentExecutor are common building blocks.

Agents offer a powerful, automated way to solve complex tasks but can sometimes be hard to control or debug precisely. This motivates the explicit control offered by LangGraph for many stateful tasks.

Key Functionality: Memory

Memory allows chains and agents to retain information about previous interactions, enabling contextual conversations.

  • Types:
    • ConversationBufferMemory: Stores messages verbatim. Simple but can exceed context limits.
    • ConversationBufferWindowMemory: Keeps only the last K messages.
    • ConversationSummaryMemory: Uses an LLM to summarize the conversation progressively. Keeps context concise but adds latency/cost.
    • ConversationSummaryBufferMemory: Combines buffering recent messages with summarizing older ones.
    • VectorStoreRetrieverMemory: Stores interactions in a vector store and retrieves relevant past messages based on semantic similarity.
  • Integration: Memory objects are typically added to Chains (LLMChain, ConversationChain) or AgentExecutor instances. They automatically format the history into the prompt based on their specific strategy.

LangGraph: Adding State and Cycles to LLM Workflows (Python)

LangGraph builds upon LangChain’s components to create complex, stateful, and potentially cyclic workflows as graphs. It’s ideal when you need fine-grained control over the execution flow, explicit state management, and capabilities like reflection or human-in-the-loop interactions.

Why LangGraph? Moving Beyond Sequential Chains

Sequential chains lack easy ways to handle cycles, complex branching, explicit state, or human-in-the-loop scenarios. LangGraph provides: Explicit State, Controllable Cycles, Modularity, Debuggability.

Core Concepts (LangGraph)

  • StatefulGraph / Graph: The main builder. Requires defining a State Schema.
  • State Schema: The canonical data structure passed between nodes. Usually a TypedDict or Pydantic BaseModel. Crucially, nodes modify the state by returning dictionaries with keys matching the schema.
  • Nodes (add_node): Python functions or LangChain Runnables. Receive the entire current state, return a partial dictionary containing only the state updates.
  • Edges (add_edge, add_conditional_edges): Define transitions. Conditional edges are key for dynamic routing.
  • Entry/Finish Points (set_entry_point, set_finish_point, END): Control start and termination.

Setting Up LangGraph

pip install langgraph

Key Functionality: Explicit State Management

The state object is central. Use TypedDict or Pydantic. Annotated[..., operator.add] is useful for appending to lists (like history). Each node receives the current state and returns a dictionary specifying what to update.

from typing import TypedDict, List, Annotated
import operator
 
class AgentState(TypedDict):
    input: str
    chat_history: Annotated[List[str], operator.add] # Use operator.add to append
    # Intermediate steps, agent scratchpad, tool outputs etc.
    intermediate_steps: Annotated[list, operator.add]
    agent_outcome: str | None # Final outcome
  • TypedDict or Pydantic: Define the structure clearly.
  • Annotated[..., operator.add]: A common pattern for list fields (like chat history or intermediate steps) where nodes should append rather than overwrite. LangGraph knows to use operator.add to combine the existing state value with the returned value. You can define custom reducer functions too.

Key Functionality: Nodes as Computation Units

Nodes perform actions, operating on and updating the state. Can be simple functions or complex LangChain Runnables. Nodes are the workhorses. They can be simple Python functions or complex LangChain chains (LCEL Runnables).

# Node using a LangChain chain
def generate_response_node(state: AgentState):
    # Assume 'llm_chain' is a pre-configured LCEL chain
    # It might use state['input'] and state['chat_history']
    response = llm_chain.invoke({"input": state['input'], "history": state['chat_history']})
    # Return update dict for the state
    return {"agent_outcome": response, "chat_history": [f"AI: {response}"]} # Append AI response

Key Functionality: Conditional Edges & Routing

This is where LangGraph’s power for control flow shines. A function evaluates the current state and returns the name of the next node to execute.

def should_call_tool(state: AgentState) -> str:
    """Checks the latest AI message for tool calls."""
    # Logic to parse the last message in state['chat_history'] or state['agent_outcome']
    # for tool invocation requests (e.g., using OpenAI function calling parsing)
    last_message = state['chat_history'][-1] # Simplified access
    if "tool_call_request" in last_message: # Pseudocode check
        return "call_tool_node" # Route to the tool execution node
    else:
        return END # No tool call needed, finish.
 
# In graph definition:
# workflow.add_node("agent", agent_node) # Node that generates response/tool request
# workflow.add_node("call_tool", tool_node) # Node that executes the tool
 
workflow.add_conditional_edges(
    "agent", # Source node
    should_call_tool, # Decision function
    {
        "call_tool_node": "call_tool", # Map return value to next node name
        END: END
    }
)
# Need an edge from call_tool back to agent to process tool result
workflow.add_edge("call_tool", "agent")

Key Functionality: Cycles for Iteration & Reflection

Conditional edges easily create loops, essential for complex behaviors:

  • Reflection/Self-Correction:
    1. generate: LLM produces an initial draft.
    2. grade: Another LLM call (or heuristic) evaluates the draft based on criteria (e.g., relevance, safety, factual consistency with retrieved docs). State updated with grade.
    3. decide_reflection: Conditional edge checks the grade. If good → END. If bad → reflect.
    4. reflect: LLM node takes the draft and critique, generates reflection notes or rewrite instructions. State updated.
    5. Edge back to generate (or a dedicated rewrite node) which uses the reflection notes.
  • Iterative Tool Use:
    1. plan: LLM decides which tool (if any) to use next based on goal and history. State updated with planned tool call.
    2. decide_tool: Conditional edge routes to specific tool node or generate_final_response if no tool needed.
    3. execute_tool_X: Node calls the specific tool. State updated with tool output.
    4. Edge back to plan to process the tool output and decide the next step.

Key Functionality: Explicit Tool Calling in Graphs

Instead of relying solely on an AgentExecutor, you can model tool interaction explicitly in the graph:

  1. Agent Node: LLM decides if a tool is needed and which one, potentially using model.bind_tools. Updates state with the tool call request.
  2. Parsing Node (Optional but Recommended): Extracts the tool name and arguments from the LLM response. Updates state.
  3. Conditional Edge: Routes to the correct tool execution node based on the parsed tool name, or to a final response node if no tool was called.
  4. Tool Execution Nodes: Separate nodes for each tool (e.g., web_search_node, calculator_node). They execute the tool using arguments from the state. Update state with the tool’s output string or structured data. Use ToolNode for convenience.
  5. Response Generation Node: Takes tool outputs from the state and generates the final response for the user.
  6. Edges: Connect tool nodes back to the agent/planning node to process results, or to the final response node.

This provides granular control over tool execution, error handling per tool, and state updates. ToolNode is a LangGraph utility that simplifies creating nodes that execute LangChain Tool objects.

Key Functionality: Managing Conversation History

The graph’s state is the natural place to store history. Use Annotated[List[BaseMessage], operator.add] (or similar for strings) in the state schema. Nodes responsible for user input or agent responses append new messages to this list. Subsequent nodes receive the updated history via the state object.

# State definition
class ChatState(TypedDict):
    # Use BaseMessage for richer chat structure
    messages: Annotated[List[BaseMessage], operator.add]
 
# Node appending user message
def add_user_input(state: ChatState):
    # Assume user_input comes from API request or similar
    user_input = "Hello there!" # Example
    return {"messages": [HumanMessage(content=user_input)]}
 
# Node appending AI message
def generate_ai_response(state: ChatState):
    # llm_chain uses state['messages']
    ai_response = llm_chain.invoke({"messages": state['messages']})
    return {"messages": [AIMessage(content=ai_response.content)]}

Key Functionality: Persistence & Checkpointing

For long-running graphs or fault tolerance, LangGraph needs to save its state. Checkpointing allows you to pause and resume execution.

  • Checkpointer: An object responsible for saving/loading the graph’s state. LangGraph provides backends:
    • MemorySaver: In-memory (for testing).
    • SqliteSaver: Saves state to a SQLite database.
    • PostgresSaver: Saves state to PostgreSQL. (Requires psycopg2-binary or asyncpg)
  • Usage: Pass the checkpointer when compiling the graph. Use a unique thread_id (or conversation ID) for each independent run you want to be able to resume.
from langgraph.checkpoint.sqlite import SqliteSaver
 
memory = SqliteSaver.from_conn_string(":memory:") # In-memory example
# memory = SqliteSaver.from_conn_string("checkpoints.sqlite") # File example
 
# Compile with checkpointer
app = workflow.compile(checkpointer=memory)
 
# Invoke with a unique config for the conversation/thread
config = {"configurable": {"thread_id": "user-123-conv-456"}}
app.invoke({"messages": [HumanMessage(content="Hi")]}, config=config)
 
# Later, resume the same conversation
app.invoke({"messages": [HumanMessage(content="How are you?")]}, config=config)

Key Functionality: Human-in-the-Loop

Pause execution and wait for external input.

  • Interrupts: Use interrupt_before=["node_name"] or interrupt_after=["node_name"] during compilation to force the graph to pause at specific points.
  • Resuming: When interrupted, invoke/stream will return. You can then inspect the state, potentially get human feedback, update the state if needed, and then call invoke/stream again with the same config (thread_id) to resume execution from where it paused.

This is powerful for scenarios requiring human review, approval, or correction steps within an automated workflow.

LangGraph vs. LangChain Agents

FeatureLangChain Agents (AgentExecutor)LangGraph
Control FlowMore implicit (driven by LLM reasoning & agent type)Explicit (defined by nodes and edges)
StateManaged internally (memory, scratchpad)Explicitly defined schema, passed between nodes
CyclesPossible but harder to control/visualizeNaturally supported via graph edges
DebuggingCan be opaque (relies on LLM thoughts/LangSmith)More transparent (visualize graph, state steps)
FlexibilityHigh-level, quicker for standard patternsMaximum flexibility for custom/complex logic
ComplexityLower boilerplate for simple casesMore boilerplate code for defining graph
Use CaseSimpler tool use, chatbots, quick prototypesMulti-step reasoning, reflection, human-in-loop, stateful processes, controlled tool use

LangGraph often uses LangChain components (models, tools, chains) within its nodes. They are complementary, with LangGraph providing a more structured execution framework.


Example Project: Simple RAG Chatbot with LangGraph (Python)

Let’s sketch out a Retrieval-Augmented Generation (RAG) chatbot using LangGraph. This chatbot will answer questions based on provided documents, potentially refining its answer if needed.

Goal

Create a chatbot that:

  1. Takes a user question.
  2. Retrieves relevant documents from a vector store.
  3. Generates an initial answer based on the question and retrieved documents.
  4. (Optionally) Grades the answer for relevance/hallucination against the documents.
  5. (Optionally) If the answer is poor, re-generates it.
  6. Returns the final answer.

Components

  • LangChain: ChatOpenAI, ChatPromptTemplate, StrOutputParser, Document Loaders (PyPDFLoader, WebBaseLoader), RecursiveCharacterTextSplitter, OpenAIEmbeddings, FAISS (or another vector store), create_stuff_documents_chain.
  • LangGraph: StateGraph to manage the flow.
  • Python: Basic file handling, environment variables.

Implementation Sketch

from typing import TypedDict, List, Annotated
import operator
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import WebBaseLoader # Example loader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.combine_documents import create_stuff_documents_chain
from langgraph.graph import StateGraph, END
 
# --- Setup (Load Data, Embed, Store) ---
# This part is usually done once offline or as a setup step
 
load_dotenv()
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) # Use a capable model
embeddings = OpenAIEmbeddings()
 
# Example: Load data from a web page
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()
 
# Split documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
 
# Create vector store
try:
    vectorstore = FAISS.from_documents(documents=splits, embedding=embeddings)
    retriever = vectorstore.as_retriever()
    print("Vector store created.")
except Exception as e:
    print(f"Error creating vector store: {e}")
    # Handle error appropriately, maybe exit or use a fallback
    retriever = None # Ensure retriever is defined
 
# --- LangGraph Definition ---
 
# 1. Define State
class RagState(TypedDict):
    question: str
    documents: List[str] # Could be List[Document] for more structure
    generation: str
    # Optional: Add fields for grading, rewrite counts etc.
 
# 2. Define Nodes
 
def retrieve_docs(state: RagState) -> dict:
    """Retrieves documents based on the question."""
    print("---NODE: RETRIEVE---")
    if not retriever:
        print("Retriever not available.")
        return {"documents": []}
    question = state['question']
    documents = retriever.invoke(question)
    print(f"Retrieved {len(documents)} documents.")
    # Store document content as strings for simplicity here
    doc_contents = [doc.page_content for doc in documents]
    return {"documents": doc_contents}
 
def generate_answer(state: RagState) -> dict:
    """Generates an answer using the LLM."""
    print("---NODE: GENERATE---")
    question = state['question']
    documents = state['documents'] # Use the string content
 
    # Create simple prompt and chain for generation
    prompt = ChatPromptTemplate.from_messages([
        ("system", "You are an assistant for question-answering tasks. Use the following retrieved context to answer the question. If you don't know, say that.\n\nContext:\n{context}"),
        ("human", "Question: {question}")
    ])
 
    # Format documents into context string
    context_str = "\n\n".join(documents)
 
    rag_chain = prompt | llm | StrOutputParser()
 
    generation = rag_chain.invoke({"context": context_str, "question": question})
    print("Generated answer.")
    return {"generation": generation}
 
# Optional Nodes: grade_answer, rewrite_answer etc.
 
# 3. Build the Graph
workflow = StateGraph(RagState)
 
workflow.add_node("retrieve", retrieve_docs)
workflow.add_node("generate", generate_answer)
 
# 4. Define Edges
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "generate")
workflow.add_edge("generate", END) # Simple flow: retrieve -> generate -> end
 
# Compile the graph
app = workflow.compile()
 
# --- Run the RAG Chatbot ---
if retriever: # Only run if retriever was successfully created
    user_question = "What are the main challenges of LLM memory?"
    inputs = {"question": user_question}
    for output in app.stream(inputs):
        # output contains the state updates at each step
        for key, value in output.items():
            print(f"Finished node '{key}':")
            # print(value) # Print the full state update for debugging
    final_state = app.invoke(inputs)
    print("\n---FINAL ANSWER---")
    print(final_state.get("generation", "No answer generated."))
else:
    print("Cannot run RAG: Retriever initialization failed.")
 

This is a simplified RAG. A production system might include:

  • Better error handling.
  • Nodes for grading document relevance before generation.
  • Nodes for grading the generated answer against the documents (hallucination check).
  • Conditional edges to loop back for rewriting if the grade is low.
  • More sophisticated state management (e.g., storing Document objects).

FastAPI: Building Performant APIs for LLM Apps (Python)

FastAPI provides the web interface for your Python-based LLM application.

Why FastAPI?

  • Performance: Asynchronous, built on Starlette/Pydantic.
  • Async Support: Native async/await for I/O efficiency.
  • Data Validation: Pydantic models for robust request/response handling.
  • Automatic Docs: Interactive Swagger UI / ReDoc.
  • Developer Experience: Easy, great editor support, dependency injection.

A good structure promotes maintainability and scalability:

my_llm_app/
├── app/
│   ├── __init__.py
│   ├── main.py           # FastAPI app instance, basic middleware
│   ├── api/              # API Routers/Endpoints
│   │   ├── __init__.py
│   │   └── v1/
│   │       ├── __init__.py
│   │       └── endpoints/
│   │           ├── __init__.py
│   │           └── chat.py     # Example endpoint for chat
│   │       └── schemas.py    # Pydantic models for this version
│   ├── core/             # Core logic, configuration
│   │   ├── __init__.py
│   │   └── config.py     # Settings management (e.g., API keys)
│   ├── services/         # Business logic, LangChain/Graph integration
│   │   ├── __init__.py
│   │   └── rag_service.py # Service interacting with the RAG graph
│   ├── models/           # Data models (if different from API schemas)
│   │   └── __init__.py
│   └── utils/            # Utility functions
│       └── __init__.py
├── tests/                # Unit and integration tests
│   └── ...
├── .env                  # Environment variables (add to .gitignore)
├── .gitignore
├── requirements.txt      # Project dependencies
└── README.md

Key Implementation Patterns

Asynchronous Operations

Use async def for your endpoint functions and any I/O-bound operations within them (like calling your LangGraph app.ainvoke or app.astream).

# Example in app/api/v1/endpoints/chat.py
from fastapi import APIRouter, HTTPException
from app.api.v1 import schemas # Pydantic models
from app.services.rag_service import get_rag_app # Assume service provides compiled graph
 
router = APIRouter()
rag_app = get_rag_app() # Get the compiled LangGraph app
 
@router.post("/invoke", response_model=schemas.ChatResponse)
async def invoke_rag_chat(request: schemas.ChatRequest):
    """Runs the RAG graph synchronously (for simplicity here)"""
    try:
        # Use ainvoke for true async operation with the graph
        final_state = await rag_app.ainvoke({"question": request.question})
        answer = final_state.get("generation", "Error: No answer generated.")
        return schemas.ChatResponse(answer=answer)
    except Exception as e:
        # Log the exception e
        raise HTTPException(status_code=500, detail="Internal Server Error")
 
# Add Streaming endpoint later

Dependency Injection

FastAPI’s Depends system is great for managing resources like database connections or pre-initialized LangChain/Graph components.

# Example in app/services/rag_service.py
from functools import lru_cache
from langgraph.graph import StateGraph # ... other imports
 
@lru_cache() # Cache the compiled graph instance
def get_rag_app():
    # ... (Build and compile your LangGraph app here) ...
    workflow = StateGraph(...)
    # ... add nodes/edges ...
    app = workflow.compile()
    print("Compiled RAG App") # Verify it's created once
    return app
 
# Example in app/api/v1/endpoints/chat.py
from fastapi import APIRouter, Depends
# ... other imports
 
router = APIRouter()
 
@router.post("/invoke_di", response_model=schemas.ChatResponse)
async def invoke_rag_chat_di(
    request: schemas.ChatRequest,
    rag_app = Depends(get_rag_app) # Inject the compiled app
):
    # ... (rest of the endpoint logic using rag_app.ainvoke) ...
    pass

Pydantic Models

Define clear input (Request) and output (Response) schemas using Pydantic for validation and documentation.

# Example in app/api/v1/schemas.py
from pydantic import BaseModel
 
class ChatRequest(BaseModel):
    question: str
    user_id: str | None = None # Optional field
 
class ChatResponse(BaseModel):
    answer: str
    # Add other relevant fields like sources, conversation_id, etc.

Routers

Organize endpoints into logical groups using APIRouter. Include these routers in your main FastAPI app instance.

# Example in app/main.py
from fastapi import FastAPI
from app.api.v1.endpoints import chat
 
app = FastAPI(title="LLM RAG API")
 
app.include_router(chat.router, prefix="/api/v1/chat", tags=["chat"])
 
@app.get("/")
def read_root():
    return {"message": "Welcome to the LLM RAG API"}

Configuration Management

Use a library like Pydantic’s BaseSettings or environment variables (python-dotenv) to manage configuration (API keys, model names, etc.) cleanly.

# Example in app/core/config.py
from pydantic_settings import BaseSettings, SettingsConfigDict
 
class Settings(BaseSettings):
    openai_api_key: str
    rag_model_name: str = "gpt-4o-mini"
    vector_store_path: str = "vectorstore_db"
 
    model_config = SettingsConfigDict(env_file='.env') # Load from .env
 
settings = Settings()
 
# Access keys like: settings.openai_api_key

Setting Up FastAPI

pip install fastapi uvicorn[standard] pydantic pydantic-settings

To run the server (assuming your main app instance is in app/main.py):

uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

The --reload flag automatically restarts the server when code changes, useful for development.


Integrating LangChain/LangGraph with FastAPI (Python)

The key is to treat your LangChain/LangGraph application (app.compile() result) as a resource that FastAPI endpoints can access, typically via a service layer or dependency injection.

Structuring the Integration

  1. Initialization: Compile your LangGraph app once when the FastAPI application starts. Using Depends with lru_cache is a common pattern. Avoid recompiling the graph on every request.
  2. Service Layer: Create functions (e.g., in app/services/) that encapsulate the logic of interacting with the compiled graph (app.invoke, app.ainvoke, app.stream, app.astream).
  3. Endpoints: FastAPI endpoint functions call the service layer functions, handle request/response mapping using Pydantic models, and manage HTTP exceptions.
  4. Async All the Way: Use async def for endpoints and service functions that call ainvoke or astream to avoid blocking the server’s event loop.

Example Endpoint (Putting it together)

# In app/services/rag_service.py
# ... (imports and get_rag_app function as before) ...
 
async def run_rag_query(question: str) -> str:
    """Runs the RAG graph asynchronously and returns the answer."""
    rag_app = get_rag_app() # Get the cached instance
    try:
        # Use ainvoke for async execution
        final_state = await rag_app.ainvoke({"question": question})
        answer = final_state.get("generation", "Error: Could not generate answer.")
        return answer
    except Exception as e:
        print(f"Error invoking RAG graph: {e}") # Log the error
        # Re-raise or return a specific error message
        return "Error: An internal error occurred during processing."
 
# In app/api/v1/endpoints/chat.py
from fastapi import APIRouter, HTTPException, Depends
from app.api.v1 import schemas
from app.services import rag_service
 
router = APIRouter()
 
@router.post("/ask", response_model=schemas.ChatResponse)
async def ask_question(request: schemas.ChatRequest):
    """Endpoint to ask a question to the RAG system."""
    try:
        answer = await rag_service.run_rag_query(request.question)
        if "Error:" in answer: # Basic error check from service
             raise HTTPException(status_code=500, detail=answer)
        return schemas.ChatResponse(answer=answer)
    except Exception as e:
        # Log exception e
        raise HTTPException(status_code=500, detail="An unexpected server error occurred.")
 
# In app/main.py
# ... (include the router as shown before) ...

Exploring Alternatives: Building LLM Applications in Go

While Python dominates the LLM framework space, Go offers compelling advantages for certain use cases, particularly around performance, concurrency, and static typing.

The Go Ecosystem for LLMs

The Go ecosystem is less mature than Python’s for high-level LLM abstractions like LangChain, but core components are available:

  • LLM Client Libraries:
    • go-openai: Popular community-maintained client for OpenAI APIs (https://github.com/sashabaranov/go-openai).
    • Clients for other providers (Anthropic, Cohere, Google) often exist but might be less comprehensive or official.
  • Vector Databases: Many vector DBs offer Go clients (e.g., Qdrant, Weaviate, Pinecone). You’ll typically interact directly with their APIs.
  • Embeddings: You might call embedding APIs directly using HTTP clients or use provider-specific libraries if available (like go-openai for OpenAI embeddings).
  • Frameworks/Orchestration: Less common. You typically build the orchestration logic (like RAG steps or agent loops) manually using Go’s standard library features (goroutines, channels, HTTP client). There isn’t a direct Go equivalent to LangChain/LangGraph providing the same level of abstraction yet.

Conceptual Approach: RAG Chatbot in Go

Implementing RAG in Go involves manually wiring together the steps:

  1. Setup: Initialize API clients (LLM provider, vector DB) using API keys (read from env vars).
  2. Load/Split/Embed/Store (Often Offline): Write separate Go scripts or functions to:
    • Read documents (e.g., from files using os, bufio).
    • Split text (implement chunking logic manually or use basic string splitting).
    • Generate embeddings (call embedding API endpoint).
    • Store vectors (use the vector DB’s Go client).
  3. Runtime Retrieval:
    • Receive user query (e.g., via an HTTP server using net/http).
    • Generate embedding for the query.
    • Query the vector DB using its Go client to find relevant document chunks.
  4. Runtime Generation:
    • Construct the prompt string, including the user query and the retrieved document content.
    • Use the LLM client library (go-openai) to call the chat completion endpoint with the constructed prompt.
    • Handle the response and potential errors.
  5. API Server (Optional): Wrap the logic in an HTTP server (net/http, Gin, Echo) to expose it as an API.

Example Project: Simple RAG Chatbot Sketch (Go)

This is a highly simplified sketch focusing on the core LLM interaction, omitting proper vector storage/retrieval and error handling for brevity.

package main
 
import (
	"context"
	"fmt"
	"log"
	"os"
	"strings"
 
	openai "github.com/sashabaranov/go-openai" // Needs: go get github.com/sashabaranov/go-openai
	"github.com/joho/godotenv"                  // Needs: go get github.com/joho/godotenv
)
 
// --- Configuration ---
func loadAPIKey() string {
	err := godotenv.Load() // Load .env file
	if err != nil {
		log.Println("No .env file found, reading from environment")
	}
	apiKey := os.Getenv("OPENAI_API_KEY")
	if apiKey == "" {
		log.Fatal("OPENAI_API_KEY environment variable not set")
	}
	return apiKey
}
 
// --- Simple Retrieval Placeholder ---
// In a real app, this would query a vector database
func retrieveContext(query string) (string, error) {
	log.Printf("Retrieving context for query: %s\n", query)
	// --- Placeholder ---
	// Simulate retrieving relevant info based on the query.
	// Replace this with actual vector DB query logic.
	hardcodedDocs := map[string]string{
		"llm":      "Large Language Models (LLMs) are deep learning models trained on massive text datasets.",
		"langchain": "LangChain is a framework for developing applications powered by language models.",
		"go":       "Go, also known as Golang, is a statically typed, compiled programming language designed at Google.",
	}
	var relevantContext []string
	for keyword, doc := range hardcodedDocs {
		if strings.Contains(strings.ToLower(query), keyword) {
			relevantContext = append(relevantContext, doc)
		}
	}
	if len(relevantContext) == 0 {
		return "No specific context found for this query.", nil // Provide default context?
	}
	return strings.Join(relevantContext, "\n"), nil
	// --- End Placeholder ---
}
 
// --- Generation using OpenAI ---
func generateAnswer(client *openai.Client, query string, contextStr string) (string, error) {
	log.Println("Generating answer...")
	systemPrompt := fmt.Sprintf(`You are a helpful assistant. Answer the user's query based ONLY on the provided context. If the context doesn't contain the answer, say "I don't have information about that in the provided context."
 
Context:
---
%s
---`, contextStr)
 
	resp, err := client.CreateChatCompletion(
		context.Background(),
		openai.ChatCompletionRequest{
			Model: openai.GPT3Dot5Turbo, // Or GPT4oMini etc.
			Messages: []openai.ChatCompletionMessage{
				{
					Role:    openai.ChatMessageRoleSystem,
					Content: systemPrompt,
				},
				{
					Role:    openai.ChatMessageRoleUser,
					Content: query,
				},
			},
			MaxTokens:   150,
			Temperature: 0.2, // Lower temperature for factual RAG
		},
	)
 
	if err != nil {
		return "", fmt.Errorf("chat completion failed: %w", err)
	}
 
	if len(resp.Choices) > 0 {
		return resp.Choices[0].Message.Content, nil
	}
 
	return "", fmt.Errorf("no response choices received")
}
 
func main() {
	apiKey := loadAPIKey()
	client := openai.NewClient(apiKey)
 
	userQuery := "What is Go?" // Example query
 
	// 1. Retrieve Context (Simplified)
	retrievedContext, err := retrieveContext(userQuery)
	if err != nil {
		log.Fatalf("Failed to retrieve context: %v", err)
	}
	log.Printf("Retrieved Context:\n---\n%s\n---\n", retrievedContext)
 
 
	// 2. Generate Answer
	answer, err := generateAnswer(client, userQuery, retrievedContext)
	if err != nil {
		log.Fatalf("Failed to generate answer: %v", err)
	}
 
	fmt.Printf("\nQuery: %s\nAnswer: %s\n", userQuery, answer)
}
 

This Go example is highly simplified. A production Go RAG system requires significant effort in building robust document processing pipelines, managing vector DB interactions, creating effective prompting strategies, handling concurrency for API requests, and implementing error handling and logging, much of which LangChain abstracts away in Python.

Go vs. Python/LangChain for LLM Apps

  • Go:
    • Pros: Excellent performance, strong concurrency model (goroutines), static typing, simpler deployment (single binary). Good choice for performance-critical APIs or services integrating LLM calls.
    • Cons: Less mature LLM ecosystem, requires more manual implementation of orchestration logic (RAG steps, agent loops), smaller community specifically focused on LLM frameworks.
  • Python (with LangChain/LangGraph):
    • Pros: Rich ecosystem, high-level abstractions (LangChain) drastically speed up development, large community, rapid prototyping. Ideal for complex agentic workflows, research, and when developer velocity is prioritized.
    • Cons: Can be slower (GIL, interpretation overhead), potentially more complex dependency management, deployment might require more setup (WSGI/ASGI servers).

Choice depends on: Team expertise, performance requirements, need for complex agentic behavior vs. simpler integration, development speed priorities.


Advanced Considerations (General)

These apply whether using Python or Go.

Streaming Responses

Use server-sent events (SSE) or WebSockets.

  • Python/FastAPI: StreamingResponse. LangChain/Graph’s .astream() integrates well.
  • Go: Use net/http Flusher interface for SSE, or libraries like gorilla/websocket. Call LLM streaming endpoints and forward chunks.

Handling Long-Running Processes

For tasks exceeding HTTP timeout limits:

  • Python: FastAPI BackgroundTasks, Celery, RQ, Arq.
  • Go: Launch goroutines, use worker pools, potentially integrate with queues like NATS or RabbitMQ.

Observability (LangSmith & Alternatives)

  • LangSmith (Python): Excellent for tracing LangChain/Graph execution.
  • General: OpenTelemetry is a cross-language standard. Instrument your code (Python/Go) to send traces and logs to backends like Jaeger, Datadog, Honeycomb. Manually log key steps in Go applications.

Security

  • API Key Management: Use environment variables, secrets managers (Vault, cloud provider secrets).
  • Input Validation: Crucial in both languages. Pydantic in Python, struct tags or validation libraries (e.g., validator) in Go. Sanitize inputs.
  • Output Parsing/Sanitization: Never trust LLM output implicitly, especially if used for downstream actions (DB queries, code execution). Validate and sanitize.
  • Rate Limiting: Implement on API gateways or within the application (libraries available for both FastAPI and Go frameworks).
  • Authentication/Authorization: Protect API endpoints (OAuth2, API Keys, JWT).

Troubleshooting Common Issues

  • Dependency Conflicts (Python): Use virtual environments, pin dependencies (pip freeze > requirements.txt).
  • API Key Errors: Check environment variables, provider quotas/access.
  • LangGraph State Issues (Python): Ensure nodes return correct update dictionaries matching the schema. Use LangSmith/logging. Check Annotated usage.
  • Async/Concurrency Errors: Correct await (Python), proper channel usage/mutexes (Go). Watch for race conditions.
  • Prompt Engineering Issues: Iterate on prompts, check context length limits, temperature settings.
  • RAG Retrieval Quality: Improve chunking, embedding models, retriever settings (k value, thresholds), consider hybrid search.
  • Graph Structure Errors (Python): Use workflow.get_graph().print_ascii(). Ensure conditional edges cover all cases.
  • Go Build/Dependency Issues: Check go.mod, ensure correct imports and module paths.

Resources and Further Learning