LangGraph Best Practices: A Comprehensive Developer Guide
I’ve spent a good amount of time building and reviewing LangGraph-powered systems – both single-agent workflows and fairly complex multi-actor setups. Below is a practical, opinionated playbook on LangGraph Best Practices that you can lift into your own projects. Focused on what matters in production: clear state, controllable flow, durable memory, predictable streaming, strong error boundaries, and operational visibility.
1) Core Architecture & State Design
1.1 Keep state boring—and typed
Your state object is the backbone of the graph. Keep it minimal, explicit, and typed. LangGraph supports TypedDict
, Pydantic, and dataclasses—choose one and stick to it across the codebase for consistency. Use reducer helpers (e.g., add_messages
) only where you truly need accumulation. Don’t dump transient values into state; pass them through function scope.
Example: Minimal typed state with a reducer for accumulating messages
from typing import Annotated, TypedDict, Any, Optional
from langgraph.graph.message import add_messages
class AppState(TypedDict, total=False):
messages: Annotated[list, add_messages]
current_step: str
result: dict[str, Any]
error_count: int
max_steps: int
last_error: Optional[dict[str, Any]]
1.2 Immutability mindset in node functions
Treat each node like a pure function: return a partial state update rather than mutating inputs. It makes testing easier and keeps edge routing predictable.
Example: Pure node returning partial updates
def classify_intent_node(state: AppState) -> dict:
user_msg = state["messages"][-1]["content"]
intent = "search" if "find" in user_msg.lower() else "chat"
return {"current_step": "classified", "result": {"intent": intent}}
1.3 Validation at the boundaries
Validate inbound/outbound state per node boundary—simple schema checks and guards avoid downstream “mystery errors.”
Example: Lightweight state validation
from pydantic import BaseModel, ValidationError
class ResultModel(BaseModel):
intent: str
def validate_result_node(state: AppState) -> dict:
try:
ResultModel(**state.get("result", {}))
return {"current_step": "validated"}
except ValidationError as e:
return {"current_step": "error", "last_error": {"type": "validation", "detail": str(e)}}
2) Graph Flow & Edges
2.1 Prefer simple edges; add conditionals only where behavior branches
Sequence simple edges for linear steps; use conditional edges when the state genuinely branches.
Example: Conditional routing function
from langgraph.graph import StateGraph
builder = StateGraph(AppState)
builder.add_node("classify", classify_intent_node)
builder.add_node("validate", validate_result_node)
# Simple edge: classify -> validate
builder.add_edge("classify", "validate")
# Conditional: validate -> (error_handler | proceed)
def route_after_validate(state: AppState) -> str:
if state.get("current_step") == "error":
return "error_handler"
return "proceed"
builder.add_node("error_handler", lambda s: {"error_count": s.get("error_count", 0) + 1})
builder.add_node("proceed", lambda s: {"current_step": "done"})
builder.add_conditional_edges("validate", route_after_validate, {
"error_handler": "error_handler",
"proceed": "proceed"
})
2.2 Tame cycles with guardrails
Cycles are normal in agentic systems (retry, clarify, tool-call loops). Add hard stops: a max_steps
counter; exponential backoff on repeated failures; explicit exit conditions for “no progress.”
Example: Bounding a cycle with a step limit
def should_continue(state: AppState) -> str:
steps = state.get("error_count", 0)
if steps >= state.get("max_steps", 3):
return "halt"
return "retry"
builder.add_node("halt", lambda s: {"current_step": "halted"})
builder.add_node("retry", lambda s: {"current_step": "retrying"})
builder.add_conditional_edges("error_handler", should_continue, {
"halt": "halt",
"retry": "classify"
})
2.3 Multi-agent patterns: supervisor → specialists
Use a small “supervisor” node to classify the request and delegate to specialist agents; aggregate results centrally.
Example: Supervisor routing to specialist agents
SPECIALISTS = {
"search": "search_agent",
"chat": "chat_agent",
}
builder.add_node("supervisor", classify_intent_node)
for name in SPECIALISTS.values():
builder.add_node(name, lambda s, name=name: {"result": {name: f"handled {s['result']['intent']}"}})
def route_to_specialist(state: AppState) -> str:
intent = state["result"]["intent"]
return SPECIALISTS.get(intent, "chat_agent")
builder.add_conditional_edges("supervisor", route_to_specialist, {
"search_agent": "search_agent",
"chat_agent": "chat_agent",
})
3) Memory, Persistence & Threads
3.1 Use a production checkpointer (Postgres)
For anything beyond demos, plug in a Postgres-backed checkpointer so you can pause/resume, inspect state, and survive process restarts.
Example: Postgres checkpointer
from langgraph.checkpoint.postgres import PostgresSaver
from psycopg_pool import ConnectionPool
DB_URI = "postgresql://user:pass@host:5432/langgraph?sslmode=require"
pool = ConnectionPool(conninfo=DB_URI, max_size=10)
with pool.connection() as conn:
saver = PostgresSaver(conn)
saver.setup() # creates tables if not exists
compiled = builder.compile(checkpointer=saver)
3.2 Treat thread_id
as a first-class key
Every invocation should carry a meaningful thread_id
so checkpoints and HITL pauses attach to the right conversation or workflow instance.
Example: Invoke with thread and namespace
config = {
"configurable": {
"thread_id": f"tenant-{tenant_id}:user-{user_id}:session-{session_id}",
"checkpoint_ns": f"tenant-{tenant_id}",
}
}
final_state = compiled.invoke({
"messages": [{"role": "user", "content": "Find hotels in Berlin"}],
"max_steps": 3,
}, config=config)
3.3 Namespacing for long-term memories
Implement cross-thread or durable user preferences with a clean namespace strategy.
Example: Storing a user preference
from langgraph.store.memory import InMemoryStore
store = InMemoryStore()
user_ns = (f"tenant-{tenant_id}", f"user-{user_id}", "prefs")
store.put(user_ns, "language", {"preferred": "concise", "level": "advanced"})
prefs = store.get(user_ns, "language")
4) Streaming & Performance
4.1 Choose the right stream mode for your UI
LangGraph offers multiple streaming modes: messages
(token-level UX), updates
(state deltas), values
(full state snapshots each step), and custom
when you want to push tailored payloads to the client.
Example: Efficient streaming of state updates
for delta in compiled.stream({"messages": messages}, stream_mode="updates", config=config):
# Send only changes to the frontend (bandwidth-friendly)
handle_delta(delta)
4.2 Parallelize independent work with the Send API
When steps are independent (e.g., run N tool calls), use the Send API to dispatch work in parallel and rejoin.
Example: Fan-out with Send and aggregate
from langgraph.graph import Send
def fanout_node(state: AppState):
tasks = state.get("pending", [])
return [Send("worker", {"task": t}) for t in tasks]
def worker_node(state: AppState):
t = state["task"]
return {"result": {t: f"done:{t}"}}
def aggregate_node(state: AppState):
# merge child results (pseudo-merge)
acc = state.get("result", {})
for k, v in state.get("child_results", {}).items():
acc[k] = v
return {"result": acc}
4.3 Control your context
Monitor prompt + tool I/O size. Compress history (summaries, selective retention) and switch to cheaper/faster models for non-critical steps. Keep prompts as data for A/B testing without code changes.
5) Error Handling & Resilience
5.1 Handle errors at multiple levels
Good graphs fail gracefully: node level (typed error objects into state), graph level (conditional edges to error_handler
with retry/fallback), and app level (circuit breakers, rate limiting, alerting).
Example: Central error handler and retry policy
MAX_RETRIES = 2
def risky_node(state: AppState):
try:
# risky operation ...
raise RuntimeError("transient failure")
except Exception as e:
return {
"current_step": "error",
"last_error": {"type": "exception", "detail": str(e)},
"error_count": state.get("error_count", 0) + 1,
}
def retry_or_fallback(state: AppState) -> str:
if state.get("error_count", 0) > MAX_RETRIES:
return "fallback"
return "retry"
builder.add_node("fallback", lambda s: {"result": {"note": "using cached answer"}})
builder.add_conditional_edges("error_handler", retry_or_fallback, {
"retry": "risky_node",
"fallback": "fallback",
})
5.2 Bound retries and degrade gracefully
For LLM/tool instability, limit retries and switch to simpler fallbacks (lighter model, cached response, human escalation). Keep the user informed with helpful messages.
6) Human-in-the-Loop (HITL)
6.1 Interrupt where human judgment adds value
Use dynamic interrupt
inside nodes to pause on sensitive actions (purchases, PII use, high-risk tool calls). Resume with the operator’s decision attached to state.
Example: Approval pause and resume
from langgraph.types import interrupt
def approval_node(state: AppState):
action = state.get("proposed_action", {})
if action.get("risk_level") == "high":
decision = interrupt({
"action": action,
"request": "approval"
}) # returns {"approved": bool, "note": str}
return {"approved": decision.get("approved", False), "review_note": decision.get("note")}
return {"approved": True}
6.2 Design the resume path
After an interrupt, restore minimal context and continue deterministically. Avoid brittle re-execution by capturing the exact decision payload and relevant inputs in state before pausing.
7) Testing & Quality
7.1 Test graphs, not just functions
Write tests that exercise the graph: construct a tiny state, invoke/ainvoke, and assert on the resulting state and chosen edge.
Example: Unit-style test of a small graph
import pytest
@pytest.mark.asyncio
async def test_happy_path(async_compiled_graph):
init = {"messages": [{"role": "user", "content": "find cafes"}], "max_steps": 2}
out = await async_compiled_graph.ainvoke(init, config={"configurable": {"thread_id": "t1"}})
assert out["current_step"] in {"done", "validated", "proceed"}
7.2 Mock external tools
Isolate LLM and network/tool calls in tests. Provide deterministic mock outputs for edge selection and error branches. Treat “tool schemas” like contracts.
7.3 Property-style checks for state invariants
Assert invariants like “only one of {approved, rejected} is set,” or “retry_count never exceeds max.”
8) Deployment, Observability & Scale
8.1 Configuration per environment
Move provider choices, model names, and feature flags to config, and inject them with configurable
inputs at runtime.
Example: Environment-aware config
import os
def runtime_config(thread: str, ns: str):
return {
"configurable": {
"thread_id": thread,
"checkpoint_ns": ns,
"model_provider": os.getenv("LLM_PROVIDER", "openai"),
"route_traces": True,
}
}
8.2 Trace everything
Use LangSmith (or your APM/OTEL stack) for traces, token accounting, and step timing. Log selected edges, tool payloads (sanitized), retries, and outcomes.
Example: Basic LangSmith setup (conceptual)
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "my-langgraph-app"
os.environ["LANGCHAIN_API_KEY"] = "***"
# Traces will automatically include node spans when using LangChain integrations
8.3 Right-sized durability & pooling
- Postgres checkpointer with connection pooling.
- Prune old checkpoints by policy.
- Horizontal scale workers; keep graphs stateless beyond the checkpointer.
8.4 Streaming at scale
Prefer updates
for dashboards and reserve messages
for chat-like UX. Combine modes only when distinct subscribers truly need them.
9) Security & Compliance
9.1 Treat state as sensitive
State often carries prompts, user inputs, and tool outputs. Sanitize PII, encrypt at rest (DB, object store), and scrub logs. If you mirror parts of state into analytics, do it through a privacy-aware export path.
9.2 Harden your edges and tools
Validate external inputs (schema + range checks), authenticate tool backends, apply rate limits, and prefer allowlists over wildcards for tool execution. For multi-tenant deployments, enforce row-level security or scoped queries keyed by tenant_id
+ thread_id
.
10) Practical Reference Patterns
10.1 Minimal, typed state with reducers
Use TypedDict
/Pydantic and add reducers only when you need accumulation of messages or results.
10.2 Supervisor → agent fan-out with Send API
Have the supervisor classify the task, fan-out to N workers in parallel via Send
, and aggregate into a single node for ranking/decision.
10.3 Durable HITL checkpoints
Before an interrupt
, snapshot the proposed action and minimal inputs into state so the operator sees exactly what they’re approving. Resume with the approval payload merged into state.
11) Troubleshooting Checklist
- A branch isn’t firing? Re-check your conditional function return strings and mapping dict—small typos cause silent misroutes.
- Streams look heavy? Switch to
updates
or add acustom
channel for just the fields your UI needs. - Graphs “forget” progress? Verify you’re consistently sending
thread_id
and your checkpointer is set up in the same namespace. - Parallel fan-out stalls? Ensure each
Send
target consumes compatible partial state and returns the same shape expected by the aggregator.
12) Putting It All Together
A resilient LangGraph application looks like this:
- State: small, typed, and validated; reducers used sparingly.
- Flow: simple edges where possible; conditional edges only at real decision points; bounded cycles.
- Memory: Postgres checkpointer with thread-scoped checkpoints; namespaced long-term preferences.
- Streaming: deliberate choice of
messages
/updates
/values
/custom
per UX and bandwidth needs. - Errors: node + graph + app-level handling with graceful degradation and escalation.
- HITL: precise
interrupt
points and deterministic resume paths. - Ops: environment-based config, full tracing, connection pooling, and cost monitoring.
If you follow these practices, your team will ship graphs that are easier to reason about, test, and scale—without turning your orchestration layer into a ball of yarn.